-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide the ability to disable scanner buffer growth #197
Comments
Logging some design thoughts here: If we remove buffer growth, we are obviously putting a limit on the length of returned tokens (that is the whole purpose). That means, there is a new error condition when we reach a token of that length. I think it would not make sense for the scanning engine to try to recover from that condition by itself. The application might treat this as an unrecoverable error and just shut down the scanner or try to handle it in some way by returning the text matched so far, or anything in between. If we just removed the buffer growth code, generated scanners will currently produce an IOException, because it will request 0 characters from the Reader and will therefore get 0 characters back, which is already a checked error case. In terms of recovery options for the user, depending on what kind of state the scanner keeps, one could for instance What would not be easily possible is figuring out what kind of token it was that created the error -- because scanning is not finished, multiple expressions could still be matching simultaneously, some expressions might have matched but are not the longest match, or no expression might have matched yet, but multiple are still possible to match. For now, I'm going to leave all these unexplored, because the use case is mostly guarding against malicious input where it would be Ok to abort. |
In terms of spec language, I'm currently thinking of introducing a directive
where |
New directive %token_size_limit <limit> where <limit> is a Java number literal or qualified identifier provides an optional hard limit on token length. The scanner buffer will not be increased beyond this length limit and if a longer token is encountered the scanner with throw an EOFException. This is for applications that require memory limits when parsing untrusted input. Implements #197
New directive %token_size_limit <limit> where <limit> is a Java number literal or qualified identifier provides an optional hard limit on token length. The scanner buffer will not be increased beyond this length limit and if a longer token is encountered the scanner with throw an EOFException. This is for applications that require memory limits when parsing untrusted input. Implements #197
The Lucene project hacks generated scanners to do this - see https://issues.apache.org/jira/browse/LUCENE-5897 and https://issues.apache.org/jira/browse/LUCENE-5400 for background.
The text was updated successfully, but these errors were encountered: