Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Re-enable scanning interactively or from a network byte stream [sf#130] #131

Closed
lsf37 opened this issue Feb 15, 2015 · 24 comments
Closed
Labels
bug Not working as intended
Milestone

Comments

@lsf37
Copy link
Member

lsf37 commented Feb 15, 2015

Reported by jeningar on 2014-08-05 13:14 UTC
Hi,

We have an application that receives commands over a TCP/IP socket. After each command a reply is sent. Normally the communication is interactive, which means a strict order of questions and answers. This works perfectly with jflex 1.4.x. (The grammar is easy from the jflex point of view. Every command is terminated by a semicolon, no read ahead necessary; this follows your FAQ answer on "I want my scanner to read from a network byte stream or from interactive stdin. Can I do this with JFlex?").

Now we have some problems with Jflex 1.6.0.

Jflex generates following code in zzRefill():

...
    int requested = zzBuffer.length - zzEndRead;
    int totalRead = 0;
    while (totalRead < requested) {
      int numRead = zzReader.read(zzBuffer, zzEndRead + totalRead, requested - totalRead);
      if (numRead == -1) {
        break;
      }
      totalRead += numRead;
    }
...

This code reads bytes from the input as long as the buffer isn't full or until EOF is reached. This is nice in case of reading a file, but leads to deadlocks if using an interactive scanner.

The easiest way to avoid the deadlock is to eliminate the while loop. If the assumption that a read() returns at least one character (or EOF) isn't valid, the while loop must be executed as long as nothing is read. (I tested this; it seems to work flawlessly).

Since I assume there was a good reason for this piece of code, I'd like to have a command line option to eliminate the while loop here.
I'd be delighted if someone would explain me why a repeated call of zzRefill() is worse than the while loop. It can't really be a performance issue. The costs of an I/O exceed by far the costs of a function call and a bit of basic arithmetic.

Regards,

Ronald

@lsf37 lsf37 changed the title Re-enable scanning interactively or from a network byte stream [Bug] Re-enable scanning interactively or from a network byte stream [sf#130] Feb 15, 2015
@lsf37 lsf37 added this to the feature request milestone Feb 15, 2015
@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Updated by steve_rowe on 2014-08-18 23:22 UTC

  • status: open --> open-fixed
  • assigned_to: Steve Rowe

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by steve_rowe on 2014-08-18 23:22 UTC
I put that while loop in place when I was making changes to the skeleton to enable supplementary character handling (chars above the Unicode basic multilingual plane). I had forgotten about the interactive/network byte stream use cases (I've never used JFlex in those ways myself).

I've removed the while loop, and put back in place the previous code to perform a single retry if zero characters are read.

The changes are in r852 (for trunk; I'll merge the changes into the 1.6.X branch after I create it later today). This change will be part of the 1.6.1 JFlex release.

Thanks Ronald!

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Updated by steve_rowe on 2014-08-18 23:55 UTC

  • summary: configurable read strategy --> Re-enable scanning interactively or from a network byte stream

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by dawidweiss on 2014-09-18 12:06 UTC
I think the solution in the code is wrong though -- there should be a while loop running until there's either an EOF or at least one byte received from the input. The single 'if' retry is not enough (if the stream can return 0 as the result, it can do it twice, but still have lingering input).

Patch attached.

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by steve_rowe on 2014-09-18 20:16 UTC
Ronald, will the changes in Dawid's patch work for your interactive scanner?

Also, we don't have an interactive scanner test case under testsuite/testcases/src/test/cases/ - would you be willing to contribute one? It's hard for me to evaluate changes like this without a test.

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by dawidweiss on 2014-09-18 21:37 UTC
Come on, it's not that difficult :) It's actually a typical error in handling stream reads -- read(...) returning 0 is not an EOF and should simply retry.

What my patch does will work for Ronald assuming his input reader is not doing any internal buffering (which it shouldn't if it's a simple socket-based read). The patch simply corrects the current behavior in which the stream returning 0 twice will cause the handler to return EOF (which isn't the case).

I would add a test case... if they were in some regular format, but I'm kind of lost in the current testing harness :)

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by dawidweiss on 2014-09-18 22:17 UTC
I've put together a stand-alone test case, hope you can integrate it with jflex's architecture (I tried, but failed).

to compile, run:

java -jar jflex-1.7.0-SNAPSHOT.jar refill.flex && javac *.java && java -cp . Refill

The current code prints (incorrectly):

Reading "refill.flex"
Constructing NFA : 6 states in NFA
Converting NFA to DFA :
..
4 states before minimization, 3 states in minimized DFA
Old file "Echo.java" saved as "Echo.java~"
Writing code to "Echo.java"

while after patching it prints:

Reading "refill.flex"
Constructing NFA : 6 states in NFA
Converting NFA to DFA :
..
4 states before minimization, 3 states in minimized DFA
Old file "Echo.java" saved as "Echo.java~"
Writing code to "Echo.java"

Letter received: a
Letter received: b
Letter received: c

The reads are not buffered as you can clearly see because I added an explicit 1-second interval between each parsed letter.

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by jeningar on 2014-09-19 05:41 UTC
Hi Steve, hi Dawid,

Dawid's change seems absolutely correct to me. In fact it covers the case I already mentioned in my first post:

If the assumption that a read() returns at least one character (or EOF) isn't valid, the while loop must be executed as long as nothing is read.

And I guess that the general case is such that anything can happen, including a read() returning after having read nothing (though that will the exception rather than the rule). The code I tested was the original code having "requested" in the while condition replaced by the literal 1. (see above).
Though now expressed slightly differently, the actual code is semantically the same.

I regard Dawid's testcase absolutely convincing and wouldn't know how to improve the test.

I'll test our software with the latest jflex and Dawid's patch this weekend.
If I encounter any problems, I'll repair it and send a patch. If not, I'll report success.

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by dawidweiss on 2014-09-19 06:50 UTC

including a read() returning after having read nothing

This has happened to me a few times in the past, actually. It's possible when you have, for example, filters over your actual I/O stream -- then the filtering stream can eat up some of the buffer and return zero bytes consecutively, without asking for more content.

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by jeningar on 2014-09-19 07:28 UTC
In my opinion a blocking reader should always return at least 1 item or an EOF. A blocking reader returning 0 items looks like a design flaw to me. Nevertheless, jflex wants to handle abstract data sources and returning 0 items can be a property of such an abstract source as you pointed out.
In fact, the current code (with the while() instead of the if()) is even capable of handling non-blocking readers, although it would be an expensive way of heating the room ;-)
PS. please excuse me for initially misspelling your name

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by dawidweiss on 2014-09-19 08:02 UTC
Yeah, unfortunately this isn't an ideal world in which everything works as advertised... It's not even that there are many non-blocking (spin-looping) readers out there; it's just that it does happen with third party code that occasionally such situations do happen. Better safe than sorry.

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by jeningar on 2014-09-23 08:43 UTC
The good news is that the interactive scanner seems to work (both with "if" and "while").
The bad news is that it runs into some other flaw which doesn't occur using jflex 1.4.x. I'm still investigating the situation. To point at the direction, there's a difference between

echo "list sessions;" | sdmsh

and

echo "list sessions;" > /tmp/x
sdmsh < /tmp/x

as the first commandline hangs (wrong), the second terminates (correct).
There shouldn't be a difference in behaviour though.

As soon as I've found the cause, I'll drop a note (my mistake) or report a bug (jflex mistake).

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by dawidweiss on 2014-09-23 16:38 UTC
Dump the stack trace from the hung process, Ronald. It's very odd as technically a pipe and a shell redirect are not possible to be distinguished from Java level.

As for the 'if' vs. 'while' -- only two options are correct, really:

  • strict mode: throw an exception on the input stream returning zero (not conforming to contract),
  • lenient mode: on zero returned from read() retry until EOF or some bytes are returned from stream. you could yield() or wait, but I don't think it makes sense (typically there should be SOME I/O at the end of the readers chain anyway).

The double 'if' idiom is incorrect (given 0-returning input), as my simple example showed. That it works for you is accidental (apparently your input doesn't return 0 twice in a sequence, but it just as well could).

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by jeningar on 2014-09-23 16:59 UTC
Hi Dawid,

I've already found the bug which is in skeleton.nested (see bug #132). Better said I found out how to repair it :-)

I fully agree with you regarding if vs while. I tested both if and while, and both work because the base java classes (unfiltered, so to say) obviously conform to the contract.

The lenient mode will be better. In the normal case there's no difference in executed code (the read will return at least one byte or an EOF). But it is the robust implementation.

In the diff of the skeleton.nested in bug 132, you can see I replaced 'if' with 'while'.

Regarding the stack traces. That's a bit hard to get. If I redirect the file, everything's over before I can even start looking for a pid. In the other case, zzrefill() is waiting in a read().

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by dawidweiss on 2014-09-23 17:41 UTC
I don't know anything about the .nested skeleton, it looks like an odd concept to me (although I can see some benefits for people who want to dynamically include stuff while preserving the parser state).

Thanks for clarifying.

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by jeningar on 2014-09-23 18:31 UTC
Well, you might or might not know mysql (as an example).
if you use mysql interactively (that can be a script that is piped into the tool of course), you can

source somefile.sql

which then executes the statements contained in that file. (This is true for about every sql command line frontend I know, which are quite a few).
The same is true for the command line interface to our scheduling system.

In order to do so, you'd have to save the internal state of the scanner and save the buffer. You can't do that without help of jflex. Hence the "nested" skeleton.

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Commented by dawidweiss on 2014-09-23 18:45 UTC
I do know mysql :)

What I meant is that typically when doing parser includes you'd restart nested new parsing context instead of reusing the same parser for the include... unless you really want the inclusion to support partial state evaluation, etc.

I'm not arguing to remove it, of course -- different things for different folks.

@lsf37 lsf37 added the bug Not working as intended label Feb 15, 2015
@lsf37 lsf37 removed this from the feature request milestone Feb 15, 2015
@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Fixed in 9a226be

@dweiss
Copy link

dweiss commented Feb 15, 2015

If 9a226be is the fix then I don't agree with it. It's not a fix -- it's a (bad) workaround for certain code patterns using the parser. If you need to terminate the scanner on 0 returned from the input stream then you should wrap that stream and pass the wrapper to the parser; the wrapper would emit a proper EOF on a condition of "no more input available" -- if it knows that indeed the underlying stream (network, or other) has such semantics.

In general, readers and input streams are allowed to return 0 as many times as they need and this should not be a cause of breaking out of the parser loop.

@lsf37
Copy link
Member Author

lsf37 commented Feb 15, 2015

Yes, looks like this needs to be revisited. Leaving the issue open.

@lsf37 lsf37 added this to the 1.6.1 milestone Feb 15, 2015
@lsf37
Copy link
Member Author

lsf37 commented Feb 21, 2015

Alright, sorry for getting into this discussion very late. Let me record for posteriority that the spec for java.io.Reader.read says

 * Reads characters into a portion of an array.  This method will block
 * until some input is available, an I/O error occurs, or the end of the
 * stream is reached.

which, I'd argue, means it can never return 0 (unless the meaning of the words some input is not what I think).

As has been pointed out, not everything works according to spec. Back in the day, I added the single if because that fixed the only occurrences of a Reader returning 0 that we had encountered. I agree this was a hack and not a good solution (Steve just reverted the skeleton to that old code).

I also agree that there are basically two ways to deal with this: loop or throw an exception/error.

Neither will fix all situations. Obviously you will get the exception in strict mode for some Readers, since there are some that return 0. It is possible, but annoying for users to work around that. The loop has the problem of non-termination if we always get 0. A bit of googling reveals that this can happen: there are InputStream implementations such as javax.sound.AudioInputStream#read(byte[] b, int off, int len) that will always return 0 unless you happen to request the right number of characters (here more than the frame size). Wrap that in an InputStreamReader, and you'll always get 0 if the scanner buffer is too small. Again, you can work around that by increasing buffer size or wrapping the Reader.

Come to think of it, there is a third alternative: you could introduce a configurable arbitrary limit for the number of iterations in the while loop (including no limit). By default, the limit could be large, so if you happen to work with a nonconforming Reader that always returns 0, you'll get an exception, but the usual interactive cases will work. And if you know what you're doing and there will be input eventually, you can always configure it to "no limit". The danger with this one is that you will get sporadic exceptions with interactive Readers that return 0 very often but not always (i.e. it may be hard to figure out what the real limit should be or that you should set it to "no limit"). Then again, there is only so much we can do in the scanner about broken input code.

Any opinions/feedback?

@dweiss
Copy link

dweiss commented Feb 21, 2015

I would give my head that the spec says readers are allowed to return 0 bytes from array-read... life is full of surprises, thanks for proving me wrong, Gerwin.

In any case (and unfortunately) real life zero-returning streams (especially filtering streams) are pretty common and their semantics is definitely not equivalent with an EOS marker. So if I were to choose I'd lean towards a "clean" solution of respecting the contract literally -- throw an IOException if read returns zero. This may seem like a breaker, but it really isn't -- it forces both camps to come up with workarounds for bad streams above the parser layer. Those who need zero to be treated like an EOF can wrap with an appropriate filter. Those who pass the parser with a reader that is non spec-confirming would have to deal with it somehow (and one of the options is, again, to write a filter that retries, yields, interrupts, whatever on zero returned from read(byte[]).

I honestly don't think this should be dealt with anyhow at the parser level.

A little bit on the side of the above I would love to see the parser code to be more tweakable. As a rationale you can take a look at the build we have in Apache Lucene -- doing certain things (like removing finalization from certain variables) requires dirty hacks that feel... well, dirty.

https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/build.xml#L84

@lsf37
Copy link
Member Author

lsf37 commented Feb 21, 2015

There is certainly something to be said for failing early if the Reader doesn't conform to spec. Since it does seem fairly common, we could add an example in the distribution with a wrapper that reads from interactive stdin or similar so people aren't completely left alone.

Btw, I finally remembered why the original hack had the weird one-time read() in it: note that it is read() without parameters and not the array read. The Readers I was looking at back then had different implementations for read(), overriding the default one (which is just a length 1 array read). They always returned at least one character or EOF, since they can't really do anything else apart from throwing an exception. One character was all I needed to get over the 0 problem. All that can be done in a wrapped Reader as well, though.

As for more tweakable code: in principle, sure. The question is how to implement it without going into complete config option hell. Adding an option for leaving out final on ZZ_BUFFER_SIZE would not be too bad, though.

Looking at https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/build.xml#L84 you can possibly avoid some of these with using a custom skeleton file or maintaining a patch for the default skeleton. Not the prettiest either, but maybe a bit more explicit.

@dweiss
Copy link

dweiss commented Feb 23, 2015

I think adding an example but making the code more strict with regard to the spec sounds very sensible. In fact, there could be two examples -- one eagerly terminating parsing, the other just catering for the non-spec conformant reader (sleeping for a few milliseconds and retrying, for example).

Thanks for providing the hack's rationale. I know exactly what you're talking about -- there are more caveats with streams. The one I've hit in the past (which was really nasty) was with FilterInputStream. The thing with this class is that it delegates to the target but inconsistently -- read() and read(byte[],int,int) do it but read(byte[]) invokes read(byte[],int,int). The problem is if you override only a subset of these methods (for example read()); then any other read will bypass your "filtering" code. I've seen a lot of non-intuitive bugs arising from this pattern. Eh, life.

As for more tweakable code - yes, a custom pattern is of course doable. I was wondering if this could be done in any other way but don't have any brilliant ideas to share (yet). It's on my (long) todo list; perhaps I'll return to it once I'm back from short holidays I'm on right now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Not working as intended
Projects
None yet
Development

No branches or pull requests

2 participants