-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Re-enable scanning interactively or from a network byte stream [sf#130] #131
Comments
Updated by steve_rowe on 2014-08-18 23:22 UTC
|
Commented by steve_rowe on 2014-08-18 23:22 UTC I've removed the while loop, and put back in place the previous code to perform a single retry if zero characters are read. The changes are in r852 (for trunk; I'll merge the changes into the 1.6.X branch after I create it later today). This change will be part of the 1.6.1 JFlex release. Thanks Ronald! |
Updated by steve_rowe on 2014-08-18 23:55 UTC
|
Commented by dawidweiss on 2014-09-18 12:06 UTC Patch attached. |
Commented by steve_rowe on 2014-09-18 20:16 UTC Also, we don't have an interactive scanner test case under testsuite/testcases/src/test/cases/ - would you be willing to contribute one? It's hard for me to evaluate changes like this without a test. |
Commented by dawidweiss on 2014-09-18 21:37 UTC What my patch does will work for Ronald assuming his input reader is not doing any internal buffering (which it shouldn't if it's a simple socket-based read). The patch simply corrects the current behavior in which the stream returning 0 twice will cause the handler to return EOF (which isn't the case). I would add a test case... if they were in some regular format, but I'm kind of lost in the current testing harness :) |
Commented by dawidweiss on 2014-09-18 22:17 UTC to compile, run: java -jar jflex-1.7.0-SNAPSHOT.jar refill.flex && javac *.java && java -cp . Refill The current code prints (incorrectly): Reading "refill.flex" while after patching it prints: Reading "refill.flex" The reads are not buffered as you can clearly see because I added an explicit 1-second interval between each parsed letter. |
Commented by jeningar on 2014-09-19 05:41 UTC Dawid's change seems absolutely correct to me. In fact it covers the case I already mentioned in my first post:
And I guess that the general case is such that anything can happen, including a read() returning after having read nothing (though that will the exception rather than the rule). The code I tested was the original code having "requested" in the while condition replaced by the literal 1. (see above). I regard Dawid's testcase absolutely convincing and wouldn't know how to improve the test. I'll test our software with the latest jflex and Dawid's patch this weekend. |
Commented by dawidweiss on 2014-09-19 06:50 UTC
This has happened to me a few times in the past, actually. It's possible when you have, for example, filters over your actual I/O stream -- then the filtering stream can eat up some of the buffer and return zero bytes consecutively, without asking for more content. |
Commented by jeningar on 2014-09-19 07:28 UTC |
Commented by dawidweiss on 2014-09-19 08:02 UTC |
Commented by jeningar on 2014-09-23 08:43 UTC echo "list sessions;" | sdmsh and echo "list sessions;" > /tmp/x as the first commandline hangs (wrong), the second terminates (correct). As soon as I've found the cause, I'll drop a note (my mistake) or report a bug (jflex mistake). |
Commented by dawidweiss on 2014-09-23 16:38 UTC As for the 'if' vs. 'while' -- only two options are correct, really:
The double 'if' idiom is incorrect (given 0-returning input), as my simple example showed. That it works for you is accidental (apparently your input doesn't return 0 twice in a sequence, but it just as well could). |
Commented by jeningar on 2014-09-23 16:59 UTC I've already found the bug which is in skeleton.nested (see bug #132). Better said I found out how to repair it :-) I fully agree with you regarding if vs while. I tested both if and while, and both work because the base java classes (unfiltered, so to say) obviously conform to the contract. The lenient mode will be better. In the normal case there's no difference in executed code (the read will return at least one byte or an EOF). But it is the robust implementation. In the diff of the skeleton.nested in bug 132, you can see I replaced 'if' with 'while'. Regarding the stack traces. That's a bit hard to get. If I redirect the file, everything's over before I can even start looking for a pid. In the other case, zzrefill() is waiting in a read(). |
Commented by dawidweiss on 2014-09-23 17:41 UTC Thanks for clarifying. |
Commented by jeningar on 2014-09-23 18:31 UTC source somefile.sql which then executes the statements contained in that file. (This is true for about every sql command line frontend I know, which are quite a few). In order to do so, you'd have to save the internal state of the scanner and save the buffer. You can't do that without help of jflex. Hence the "nested" skeleton. |
Commented by dawidweiss on 2014-09-23 18:45 UTC What I meant is that typically when doing parser includes you'd restart nested new parsing context instead of reusing the same parser for the include... unless you really want the inclusion to support partial state evaluation, etc. I'm not arguing to remove it, of course -- different things for different folks. |
Fixed in 9a226be |
If 9a226be is the fix then I don't agree with it. It's not a fix -- it's a (bad) workaround for certain code patterns using the parser. If you need to terminate the scanner on 0 returned from the input stream then you should wrap that stream and pass the wrapper to the parser; the wrapper would emit a proper EOF on a condition of "no more input available" -- if it knows that indeed the underlying stream (network, or other) has such semantics. In general, readers and input streams are allowed to return 0 as many times as they need and this should not be a cause of breaking out of the parser loop. |
Yes, looks like this needs to be revisited. Leaving the issue open. |
Alright, sorry for getting into this discussion very late. Let me record for posteriority that the spec for
which, I'd argue, means it can never return 0 (unless the meaning of the words As has been pointed out, not everything works according to spec. Back in the day, I added the single I also agree that there are basically two ways to deal with this: loop or throw an exception/error. Neither will fix all situations. Obviously you will get the exception in strict mode for some Readers, since there are some that return 0. It is possible, but annoying for users to work around that. The loop has the problem of non-termination if we always get 0. A bit of googling reveals that this can happen: there are InputStream implementations such as Come to think of it, there is a third alternative: you could introduce a configurable arbitrary limit for the number of iterations in the while loop (including no limit). By default, the limit could be large, so if you happen to work with a nonconforming Reader that always returns 0, you'll get an exception, but the usual interactive cases will work. And if you know what you're doing and there will be input eventually, you can always configure it to "no limit". The danger with this one is that you will get sporadic exceptions with interactive Readers that return 0 very often but not always (i.e. it may be hard to figure out what the real limit should be or that you should set it to "no limit"). Then again, there is only so much we can do in the scanner about broken input code. Any opinions/feedback? |
I would give my head that the spec says readers are allowed to return 0 bytes from array-read... life is full of surprises, thanks for proving me wrong, Gerwin. In any case (and unfortunately) real life zero-returning streams (especially filtering streams) are pretty common and their semantics is definitely not equivalent with an EOS marker. So if I were to choose I'd lean towards a "clean" solution of respecting the contract literally -- throw an IOException if read returns zero. This may seem like a breaker, but it really isn't -- it forces both camps to come up with workarounds for bad streams above the parser layer. Those who need zero to be treated like an EOF can wrap with an appropriate filter. Those who pass the parser with a reader that is non spec-confirming would have to deal with it somehow (and one of the options is, again, to write a filter that retries, yields, interrupts, whatever on zero returned from read(byte[]). I honestly don't think this should be dealt with anyhow at the parser level. A little bit on the side of the above I would love to see the parser code to be more tweakable. As a rationale you can take a look at the build we have in Apache Lucene -- doing certain things (like removing finalization from certain variables) requires dirty hacks that feel... well, dirty. https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/build.xml#L84 |
There is certainly something to be said for failing early if the Reader doesn't conform to spec. Since it does seem fairly common, we could add an example in the distribution with a wrapper that reads from interactive stdin or similar so people aren't completely left alone. Btw, I finally remembered why the original hack had the weird one-time As for more tweakable code: in principle, sure. The question is how to implement it without going into complete config option hell. Adding an option for leaving out Looking at https://github.com/apache/lucene-solr/blob/trunk/lucene/analysis/common/build.xml#L84 you can possibly avoid some of these with using a custom skeleton file or maintaining a patch for the default skeleton. Not the prettiest either, but maybe a bit more explicit. |
I think adding an example but making the code more strict with regard to the spec sounds very sensible. In fact, there could be two examples -- one eagerly terminating parsing, the other just catering for the non-spec conformant reader (sleeping for a few milliseconds and retrying, for example). Thanks for providing the hack's rationale. I know exactly what you're talking about -- there are more caveats with streams. The one I've hit in the past (which was really nasty) was with FilterInputStream. The thing with this class is that it delegates to the target but inconsistently -- read() and read(byte[],int,int) do it but read(byte[]) invokes read(byte[],int,int). The problem is if you override only a subset of these methods (for example read()); then any other read will bypass your "filtering" code. I've seen a lot of non-intuitive bugs arising from this pattern. Eh, life. As for more tweakable code - yes, a custom pattern is of course doable. I was wondering if this could be done in any other way but don't have any brilliant ideas to share (yet). It's on my (long) todo list; perhaps I'll return to it once I'm back from short holidays I'm on right now :) |
Reported by jeningar on 2014-08-05 13:14 UTC
Hi,
We have an application that receives commands over a TCP/IP socket. After each command a reply is sent. Normally the communication is interactive, which means a strict order of questions and answers. This works perfectly with jflex 1.4.x. (The grammar is easy from the jflex point of view. Every command is terminated by a semicolon, no read ahead necessary; this follows your FAQ answer on "I want my scanner to read from a network byte stream or from interactive stdin. Can I do this with JFlex?").
Now we have some problems with Jflex 1.6.0.
Jflex generates following code in zzRefill():
This code reads bytes from the input as long as the buffer isn't full or until EOF is reached. This is nice in case of reading a file, but leads to deadlocks if using an interactive scanner.
The easiest way to avoid the deadlock is to eliminate the while loop. If the assumption that a read() returns at least one character (or EOF) isn't valid, the while loop must be executed as long as nothing is read. (I tested this; it seems to work flawlessly).
Since I assume there was a good reason for this piece of code, I'd like to have a command line option to eliminate the while loop here.
I'd be delighted if someone would explain me why a repeated call of zzRefill() is worse than the while loop. It can't really be a performance issue. The costs of an I/O exceed by far the costs of a function call and a bit of basic arithmetic.
Regards,
Ronald
The text was updated successfully, but these errors were encountered: