Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide the ability to disable scanner buffer growth #197

Closed
madrob opened this issue Apr 1, 2016 · 2 comments · Fixed by #1045
Closed

Provide the ability to disable scanner buffer growth #197

madrob opened this issue Apr 1, 2016 · 2 comments · Fixed by #1045
Labels
enhancement Feature requests
Milestone

Comments

@madrob
Copy link

madrob commented Apr 1, 2016

The Lucene project hacks generated scanners to do this - see https://issues.apache.org/jira/browse/LUCENE-5897 and https://issues.apache.org/jira/browse/LUCENE-5400 for background.

@regisd regisd added the enhancement Feature requests label Nov 1, 2017
@lsf37 lsf37 added this to the 1.9.0 milestone Jan 22, 2023
@lsf37
Copy link
Member

lsf37 commented Jan 28, 2023

Logging some design thoughts here:

If we remove buffer growth, we are obviously putting a limit on the length of returned tokens (that is the whole purpose). That means, there is a new error condition when we reach a token of that length. I think it would not make sense for the scanning engine to try to recover from that condition by itself. The application might treat this as an unrecoverable error and just shut down the scanner or try to handle it in some way by returning the text matched so far, or anything in between.

If we just removed the buffer growth code, generated scanners will currently produce an IOException, because it will request 0 characters from the Reader and will therefore get 0 characters back, which is already a checked error case.

In terms of recovery options for the user, depending on what kind of state the scanner keeps, one could for instance yyreset() the scanning engine with the current Reader again and keep scanning. Line/column etc counting will be reset and lost, though. One could also add a scanner method (in standard lexer user class code) that only resets the buffer-related fields (zzEndRead, zzMarkedPos, etc), but not other state in the scanner. Depending on what the scanner does, this could be safe and keep counts etc up to date. It would lose content of the offending error token.

What would not be easily possible is figuring out what kind of token it was that created the error -- because scanning is not finished, multiple expressions could still be matching simultaneously, some expressions might have matched but are not the longest match, or no expression might have matched yet, but multiple are still possible to match. For now, I'm going to leave all these unexplored, because the use case is mostly guarding against malicious input where it would be Ok to abort.

@lsf37
Copy link
Member

lsf37 commented Jan 28, 2023

In terms of spec language, I'm currently thinking of introducing a directive

%token_size_limit (<identifier>|<number>)

where <number> would be a compile-time limit on the buffer size, and <identifier> could be either a constant or a mutable field that can be declared in and adjusted from user code at runtime. Should probably be a long ident instead of just ident.

lsf37 added a commit that referenced this issue Jan 28, 2023
New directive

    %token_size_limit <limit>

where <limit> is a Java number literal or qualified identifier provides
an optional hard limit on token length.

The scanner buffer will not be increased beyond this length limit and
if a longer token is encountered the scanner with throw an EOFException.

This is for applications that require memory limits when parsing
untrusted input.

Implements #197
@lsf37 lsf37 linked a pull request Jan 28, 2023 that will close this issue
lsf37 added a commit that referenced this issue Jan 28, 2023
New directive

    %token_size_limit <limit>

where <limit> is a Java number literal or qualified identifier provides
an optional hard limit on token length.

The scanner buffer will not be increased beyond this length limit and
if a longer token is encountered the scanner with throw an EOFException.

This is for applications that require memory limits when parsing
untrusted input.

Implements #197
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants