Proposal: Stream support reopen parser #146

Freyskeyd · 2017-05-03T09:33:00Z

Hello,

I'm working on an xmpp library and I need to be able to reopen the parser to consume new bytes.

It's a first step to be able to reset some condition, I use it like that:

loop {
    // My parser is an instance of EventReader<Buffer>,
    // Buffer is a custom struct that hold my buffer and expose some useful methods.
    // It implement Read 
    if self.parser.source().available_data() > 0 {
        self.parser.reopen_parser();
    }
    match self.parser.next() { ... }
}

What do you think?

Signed-off-by: Freyskeyd simon.paitrault@gmail.com

netvl · 2017-05-03T10:10:00Z

I'm not familiar with the XMPP enough; just to clarify, as far as I understand, this kind of "reopening" is necessary only if the stream contains multiple root elements. Is that your case? If yes, then it is probably better to make it a configuration option which would allow multiple root elements. It would make the usage much simpler.

Freyskeyd · 2017-05-03T10:16:35Z

XMPP use a unique root element called stream:stream when this root element is close it means that the communication between client and server are closed.

When communicating, client and server write 'stanzas' into a stream. Here's an example of an XMPP stream when connecting:

<?xml version="1.0" encoding="UTF-8"?>
<stream:stream xmlns:stream="http://etherx.jabber.org/streams" xmlns="jabber:client" to="example.com" version="1.0">
   <message from="juliet@example.com" to="romeo@example.net" xml:lang="en">
      <body>first message</body>
   </message>

after that, the client sends another message, the stream looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<stream:stream xmlns:stream="http://etherx.jabber.org/streams" xmlns="jabber:client" to="example.com" version="1.0">
   <message from="juliet@example.com" to="romeo@example.net" xml:lang="en">
      <body>first message</body>
   </message>
   <message from="juliet@example.com" to="romeo@example.net" xml:lang="en">
      <body>second message</body>
   </message>

The client wants to close all communications with the server, he writes the closing tag:

<?xml version="1.0" encoding="UTF-8"?>
<stream:stream xmlns:stream="http://etherx.jabber.org/streams" xmlns="jabber:client" to="example.com" version="1.0">
   <message from="juliet@example.com" to="romeo@example.net" xml:lang="en">
      <body>first message</body>
   </message>
   <message from="juliet@example.com" to="romeo@example.net" xml:lang="en">
      <body>second message</body>
   </message>
</stream:stream>

netvl · 2017-05-03T15:39:29Z

Okay, I see. If it is like this, then why do you actually need to "reset" the parser? The parser should continue reading until the closing tag for the root element is reached. Take a look at this example: https://github.com/netvl/xml-rs/blob/streaming-test/tests/streaming.rs. You can clone that branch and run cargo test to verify that it does work.

Granted, the streaming parsing is not supported entirely well (e.g. character data cannot be streamed, which can be seen from the comment in the test), since I hadn't kept this use case in mind when I was writing the parser, but it still should be usable for the XMPP use case.

Freyskeyd · 2017-05-03T16:00:56Z

if you try this:

    write_and_reset_position(it.source_mut(), b"<child-2/>");
    assert_match!(it.next(), Some(Ok(XmlEvent::StartElement { ref name, .. })) if name.local_name == "child-2");
    assert_match!(it.next(), Some(Ok(XmlEvent::EndElement { ref name })) if name.local_name == "child-2");

    // Add a blank call, this one return an Error
    let _ = it.next();

    write_and_reset_position(it.source_mut(), b"</root>");
    // it.next() return None, instead of a EndElement as expected
    assert_match!(it.next(), Some(Ok(XmlEvent::EndElement { ref name })) if name.local_name == "root");
    assert_match!(it.next(), Some(Ok(XmlEvent::EndDocument)));
    assert_match!(it.next(), None);

It's because of this: https://github.com/netvl/xml-rs/blob/streaming-test/src/reader/parser/mod.rs#L243

We call next we have data but the parser think we already got the end of the stream.
Maybe, can we just check if new bytes are available instead of returning the last value?

netvl · 2017-05-03T18:04:53Z

First, I'm actually really surprised that it returns None, it should return an error instead (about an unexpected end of stream). This is worth investigating.

Second, why do you actually need to call next() extra time here? I don't think I'm willing to change the default behavior - if a document is legitimately incomplete, upon the end of stream it should return an error per the XML spec. Maybe a separate method or a special configuration option would be better.

On the second thought, I probably understand why - that's because multiple new elements are possible, and you don't know in advance how much, right? If so, then yes, I believe it is a valid case.

I think that the proper solution would be to add a configuration option which would allow retrying the read at the end of stream, and which would make the parser always return the ErrorKind::UnexpectedEof if the stream is currently at its end:

let mut it = parser.into_iter();
read_more_data: loop {
    read_data(it.source_mut());
    for e in &mut it {
        match e {
            Ok(...) => ...,  // process data
            Err(ref e) if e.is_unexpected_eof() => continue read_more_data,
            Err(e) => return Err(e),
        }
    }
}

where Error::is_unexpected_eof() is defined like this:

impl Error {
    pub fn is_unexpected_eof(&self) -> bool {
        match self.kind {
            ErrorKind::Io(e) if e.kind() == io::ErrorKind::UnexpectedEof => true,
            _ => false, 
        }
    }
}

This, of course, also requires removal of the ErrorKind::UnexpectedEof variant since it is superseded by io::ErrorKind::UnexpectedEof, which I think is actually good and should be done, while we are at it.

A naive solution for returning UnexpectedEof and allowing to read more after more data is available in the stream would indeed be similar to this PR - it would involve keeping the final_result empty. I think, however, that this is suboptimal, because it would only allow certain kinds of events (namely elements and probably entities), but it would fail miserably for other events, like characters, and could even lead to the loss of data.

The proper solution would require modifying the parser and the lexer to read as much as they can from the stream and then either output what they could read (a characters event, for example), or returning an UnexpectedEof error and returning to the previous position if the input consists of a part of event (e.g. when the stream has stopped inside an element). This, however, seems to require unbounded buffering, because almost all incomplete events, in particular, CDATA, can be arbitrarily large. I'm not really sure how this should better be handled.

Nevertheless, I think that implementing a partial solution first does bring some benefits, since it would allow certain use cases like yours.

Freyskeyd · 2017-05-03T18:13:29Z

How do you want to proceed for a partial implementation?

netvl · 2017-05-04T09:41:55Z

I think that it would be something like what you did in this PR, but not upon external request through a special method, but upon checking a special flag in the parser configuration. If would like to do it, I would be very glad, since I don't know right now when I will be able to - I'm quite busy at my main job right now.

Freyskeyd · 2017-05-11T10:12:50Z

@netvl I think it's a huge refacto to do a stream based analysis.

A stream can contains unclosed tag, it will be catch has an error at one time but next iteration will feed the end of the tag.

I really need to be able to reopen the parser. Can we consider this method has unsafe and ignore from documentation?

Freyskeyd · 2017-05-11T17:28:12Z

@netvl I found a solution, let me know what you're thinking about it. thank's ❤️

netvl

Thank you! There are minor changes which need to be done, but I like the general idea. 👍

netvl · 2017-05-20T03:50:57Z

src/reader/lexer.rs

@@ -221,7 +221,7 @@ pub struct Lexer {
    skip_errors: bool,
    inside_comment: bool,
    inside_token: bool,
-    eof_handled: bool
+    pub eof_handled: bool


I would prefer having an mutator method like fn reset_eof_handled(&mut self) instead of making the field public, simply because other fields are not.

netvl · 2017-05-20T03:55:54Z

src/reader/config.rs

+    /// By default this option is setted to false but if you want to deal with progressive inputs
+    /// like a stream over tcp, you can set it to true to make the parser ignore end of stream
+    /// error.
+    pub ignore_end_of_stream: bool


I don't want to nitpick, but I prefer the documentation to be as clean as possible :)

Whether or not the parser should ignore the end of stream. Default is false.

By default the parser will either error out when it encounters a premature end of stream or complete normally if the end of stream was expected. If you want to continue reading from a stream whose input is supplied progressively, you can set this option to true. In this case the parser will allow you to invoke the next() method even if a supposed end of stream has happened.

Note that support for this functionality is incomplete; for example, the parser will fail if the premature end of stream happens inside PCDATA. Therefore, use this option at your own risk.

netvl · 2017-05-20T03:57:09Z

src/reader/parser/mod.rs

@@ -120,6 +120,9 @@ impl PullParser {
            pop_namespace: false
        }
    }
+
+    /// Tells if this parser are ignoring end of stream errors


Checks if this parser ignores the end of stream errors.

netvl · 2017-05-20T03:58:06Z

tests/streaming.rs

+          assert!(false);
+       },
+       Some(Err(_)) => {
+          assert!(true);


You don't actually need this.

netvl · 2017-05-20T03:58:23Z

tests/streaming.rs

+    match reader.next() {
+       None |
+       Some(Ok(_)) => {
+          assert!(false);


It is better to write a panic!() with some helpful error message.

Signed-off-by: Freyskeyd <simon.paitrault@gmail.com>

Freyskeyd · 2017-05-23T08:25:20Z

@netvl I fixed :)

netvl · 2017-05-23T15:54:06Z

This is great, thanks! I'll release a new version as soon as I get to a computer with the Rust environment on it.

netvl · 2017-06-01T06:29:18Z

Sorry for the delay! I just released version 0.5.0 with this and one other fix.

netvl · 2017-06-01T06:35:47Z

It's actually 0.6.0 now, since another incompatible change has just got merged :)

Freyskeyd force-pushed the master branch 2 times, most recently from 774498d to 4a11aae Compare May 11, 2017 17:27

Freyskeyd force-pushed the master branch 3 times, most recently from 8ddf8d6 to 4365c2e Compare May 16, 2017 09:47

netvl requested changes May 20, 2017

View reviewed changes

Freyskeyd force-pushed the master branch from 4365c2e to cc3b918 Compare May 22, 2017 08:36

feat(reopen): stream support reopen parser

8cde96f

Signed-off-by: Freyskeyd <simon.paitrault@gmail.com>

Freyskeyd force-pushed the master branch from cc3b918 to 8cde96f Compare May 22, 2017 08:39

netvl merged commit 763ada5 into netvl:master May 23, 2017

linkmauve mentioned this pull request May 23, 2017

Add a streaming API #113

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Stream support reopen parser #146

Proposal: Stream support reopen parser #146

Freyskeyd commented May 3, 2017

netvl commented May 3, 2017

Freyskeyd commented May 3, 2017 •

edited

Loading

netvl commented May 3, 2017

Freyskeyd commented May 3, 2017 •

edited

Loading

netvl commented May 3, 2017

Freyskeyd commented May 3, 2017

netvl commented May 4, 2017

Freyskeyd commented May 11, 2017

Freyskeyd commented May 11, 2017

netvl left a comment

netvl May 20, 2017

netvl May 20, 2017

netvl May 20, 2017

netvl May 20, 2017

netvl May 20, 2017

Freyskeyd commented May 23, 2017

netvl commented May 23, 2017

netvl commented Jun 1, 2017

netvl commented Jun 1, 2017

Proposal: Stream support reopen parser #146

Proposal: Stream support reopen parser #146

Conversation

Freyskeyd commented May 3, 2017

netvl commented May 3, 2017

Freyskeyd commented May 3, 2017 • edited Loading

netvl commented May 3, 2017

Freyskeyd commented May 3, 2017 • edited Loading

netvl commented May 3, 2017

Freyskeyd commented May 3, 2017

netvl commented May 4, 2017

Freyskeyd commented May 11, 2017

Freyskeyd commented May 11, 2017

netvl left a comment

Choose a reason for hiding this comment

netvl May 20, 2017

Choose a reason for hiding this comment

netvl May 20, 2017

Choose a reason for hiding this comment

netvl May 20, 2017

Choose a reason for hiding this comment

netvl May 20, 2017

Choose a reason for hiding this comment

netvl May 20, 2017

Choose a reason for hiding this comment

Freyskeyd commented May 23, 2017

netvl commented May 23, 2017

netvl commented Jun 1, 2017

netvl commented Jun 1, 2017

Freyskeyd commented May 3, 2017 •

edited

Loading

Freyskeyd commented May 3, 2017 •

edited

Loading