Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Stream support reopen parser #146

Merged
merged 1 commit into from
May 23, 2017
Merged

Conversation

Freyskeyd
Copy link
Contributor

Hello,

I'm working on an xmpp library and I need to be able to reopen the parser to consume new bytes.

It's a first step to be able to reset some condition, I use it like that:

loop {
    // My parser is an instance of EventReader<Buffer>,
    // Buffer is a custom struct that hold my buffer and expose some useful methods.
    // It implement Read 
    if self.parser.source().available_data() > 0 {
        self.parser.reopen_parser();
    }
    match self.parser.next() { ... }
}

What do you think?

Signed-off-by: Freyskeyd simon.paitrault@gmail.com

@netvl
Copy link
Owner

netvl commented May 3, 2017

I'm not familiar with the XMPP enough; just to clarify, as far as I understand, this kind of "reopening" is necessary only if the stream contains multiple root elements. Is that your case? If yes, then it is probably better to make it a configuration option which would allow multiple root elements. It would make the usage much simpler.

@Freyskeyd
Copy link
Contributor Author

Freyskeyd commented May 3, 2017

XMPP use a unique root element called stream:stream when this root element is close it means that the communication between client and server are closed.

When communicating, client and server write 'stanzas' into a stream. Here's an example of an XMPP stream when connecting:

<?xml version="1.0" encoding="UTF-8"?>
<stream:stream xmlns:stream="http://etherx.jabber.org/streams" xmlns="jabber:client" to="example.com" version="1.0">
   <message from="juliet@example.com" to="romeo@example.net" xml:lang="en">
      <body>first message</body>
   </message>

after that, the client sends another message, the stream looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<stream:stream xmlns:stream="http://etherx.jabber.org/streams" xmlns="jabber:client" to="example.com" version="1.0">
   <message from="juliet@example.com" to="romeo@example.net" xml:lang="en">
      <body>first message</body>
   </message>
   <message from="juliet@example.com" to="romeo@example.net" xml:lang="en">
      <body>second message</body>
   </message>

The client wants to close all communications with the server, he writes the closing tag:

<?xml version="1.0" encoding="UTF-8"?>
<stream:stream xmlns:stream="http://etherx.jabber.org/streams" xmlns="jabber:client" to="example.com" version="1.0">
   <message from="juliet@example.com" to="romeo@example.net" xml:lang="en">
      <body>first message</body>
   </message>
   <message from="juliet@example.com" to="romeo@example.net" xml:lang="en">
      <body>second message</body>
   </message>
</stream:stream>

@netvl
Copy link
Owner

netvl commented May 3, 2017

Okay, I see. If it is like this, then why do you actually need to "reset" the parser? The parser should continue reading until the closing tag for the root element is reached. Take a look at this example: https://github.com/netvl/xml-rs/blob/streaming-test/tests/streaming.rs. You can clone that branch and run cargo test to verify that it does work.

Granted, the streaming parsing is not supported entirely well (e.g. character data cannot be streamed, which can be seen from the comment in the test), since I hadn't kept this use case in mind when I was writing the parser, but it still should be usable for the XMPP use case.

@Freyskeyd
Copy link
Contributor Author

Freyskeyd commented May 3, 2017

if you try this:

    write_and_reset_position(it.source_mut(), b"<child-2/>");
    assert_match!(it.next(), Some(Ok(XmlEvent::StartElement { ref name, .. })) if name.local_name == "child-2");
    assert_match!(it.next(), Some(Ok(XmlEvent::EndElement { ref name })) if name.local_name == "child-2");

    // Add a blank call, this one return an Error
    let _ = it.next();

    write_and_reset_position(it.source_mut(), b"</root>");
    // it.next() return None, instead of a EndElement as expected
    assert_match!(it.next(), Some(Ok(XmlEvent::EndElement { ref name })) if name.local_name == "root");
    assert_match!(it.next(), Some(Ok(XmlEvent::EndDocument)));
    assert_match!(it.next(), None);

It's because of this: https://github.com/netvl/xml-rs/blob/streaming-test/src/reader/parser/mod.rs#L243

We call next we have data but the parser think we already got the end of the stream.
Maybe, can we just check if new bytes are available instead of returning the last value?

@netvl
Copy link
Owner

netvl commented May 3, 2017

First, I'm actually really surprised that it returns None, it should return an error instead (about an unexpected end of stream). This is worth investigating.

Second, why do you actually need to call next() extra time here? I don't think I'm willing to change the default behavior - if a document is legitimately incomplete, upon the end of stream it should return an error per the XML spec. Maybe a separate method or a special configuration option would be better.

On the second thought, I probably understand why - that's because multiple new elements are possible, and you don't know in advance how much, right? If so, then yes, I believe it is a valid case.

I think that the proper solution would be to add a configuration option which would allow retrying the read at the end of stream, and which would make the parser always return the ErrorKind::UnexpectedEof if the stream is currently at its end:

let mut it = parser.into_iter();
read_more_data: loop {
    read_data(it.source_mut());
    for e in &mut it {
        match e {
            Ok(...) => ...,  // process data
            Err(ref e) if e.is_unexpected_eof() => continue read_more_data,
            Err(e) => return Err(e),
        }
    }
}

where Error::is_unexpected_eof() is defined like this:

impl Error {
    pub fn is_unexpected_eof(&self) -> bool {
        match self.kind {
            ErrorKind::Io(e) if e.kind() == io::ErrorKind::UnexpectedEof => true,
            _ => false, 
        }
    }
}

This, of course, also requires removal of the ErrorKind::UnexpectedEof variant since it is superseded by io::ErrorKind::UnexpectedEof, which I think is actually good and should be done, while we are at it.

A naive solution for returning UnexpectedEof and allowing to read more after more data is available in the stream would indeed be similar to this PR - it would involve keeping the final_result empty. I think, however, that this is suboptimal, because it would only allow certain kinds of events (namely elements and probably entities), but it would fail miserably for other events, like characters, and could even lead to the loss of data.

The proper solution would require modifying the parser and the lexer to read as much as they can from the stream and then either output what they could read (a characters event, for example), or returning an UnexpectedEof error and returning to the previous position if the input consists of a part of event (e.g. when the stream has stopped inside an element). This, however, seems to require unbounded buffering, because almost all incomplete events, in particular, CDATA, can be arbitrarily large. I'm not really sure how this should better be handled.

Nevertheless, I think that implementing a partial solution first does bring some benefits, since it would allow certain use cases like yours.

@Freyskeyd
Copy link
Contributor Author

How do you want to proceed for a partial implementation?

@netvl
Copy link
Owner

netvl commented May 4, 2017

I think that it would be something like what you did in this PR, but not upon external request through a special method, but upon checking a special flag in the parser configuration. If would like to do it, I would be very glad, since I don't know right now when I will be able to - I'm quite busy at my main job right now.

@Freyskeyd
Copy link
Contributor Author

@netvl I think it's a huge refacto to do a stream based analysis.

A stream can contains unclosed tag, it will be catch has an error at one time but next iteration will feed the end of the tag.

I really need to be able to reopen the parser. Can we consider this method has unsafe and ignore from documentation?

@Freyskeyd Freyskeyd force-pushed the master branch 2 times, most recently from 774498d to 4a11aae Compare May 11, 2017 17:27
@Freyskeyd
Copy link
Contributor Author

@netvl I found a solution, let me know what you're thinking about it. thank's ❤️

@Freyskeyd Freyskeyd force-pushed the master branch 3 times, most recently from 8ddf8d6 to 4365c2e Compare May 16, 2017 09:47
Copy link
Owner

@netvl netvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! There are minor changes which need to be done, but I like the general idea. 👍

@@ -221,7 +221,7 @@ pub struct Lexer {
skip_errors: bool,
inside_comment: bool,
inside_token: bool,
eof_handled: bool
pub eof_handled: bool
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer having an mutator method like fn reset_eof_handled(&mut self) instead of making the field public, simply because other fields are not.

/// By default this option is setted to false but if you want to deal with progressive inputs
/// like a stream over tcp, you can set it to true to make the parser ignore end of stream
/// error.
pub ignore_end_of_stream: bool
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to nitpick, but I prefer the documentation to be as clean as possible :)

Whether or not the parser should ignore the end of stream. Default is false.

By default the parser will either error out when it encounters a premature end of stream or complete normally if the end of stream was expected. If you want to continue reading from a stream whose input is supplied progressively, you can set this option to true. In this case the parser will allow you to invoke the next() method even if a supposed end of stream has happened.

Note that support for this functionality is incomplete; for example, the parser will fail if the premature end of stream happens inside PCDATA. Therefore, use this option at your own risk.

@@ -120,6 +120,9 @@ impl PullParser {
pop_namespace: false
}
}

/// Tells if this parser are ignoring end of stream errors
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checks if this parser ignores the end of stream errors.

assert!(false);
},
Some(Err(_)) => {
assert!(true);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't actually need this.

match reader.next() {
None |
Some(Ok(_)) => {
assert!(false);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to write a panic!() with some helpful error message.

Signed-off-by: Freyskeyd <simon.paitrault@gmail.com>
@Freyskeyd
Copy link
Contributor Author

@netvl I fixed :)

@netvl
Copy link
Owner

netvl commented May 23, 2017

This is great, thanks! I'll release a new version as soon as I get to a computer with the Rust environment on it.

@netvl netvl merged commit 763ada5 into netvl:master May 23, 2017
@linkmauve linkmauve mentioned this pull request May 23, 2017
@netvl
Copy link
Owner

netvl commented Jun 1, 2017

Sorry for the delay! I just released version 0.5.0 with this and one other fix.

@netvl
Copy link
Owner

netvl commented Jun 1, 2017

It's actually 0.6.0 now, since another incompatible change has just got merged :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants