New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for validating Regexp in Ripper #4902

Merged
merged 7 commits into from Jan 14, 2018

Conversation

Projects
None yet
2 participants
@grddev
Contributor

grddev commented Dec 17, 2017

This represents a stab at implementing the remaining Ripper compatibility as mentioned in #4898.

This uses the fact that Regexp tokenization is handled by a single StringTerm, and thus all tSTRING_CONTENT fragments are easily collectable until the tREGEXP_END comes with the options that we need for validation.

The validation itself is a copied/simplified version of what is performed by the main parser, as large parts the validation depended on the AST structure, which we do not have here.

Technically, this doesn't perform the validation at the same point in time as the main parser, as it performs the validation when encountering the tREGEXP_END token rather than when processing the regexp rule.

I speculate that the difference doesn't really matter given that the only thing we could do with the tREGEXP_END token is to apply the regexp rule.

In order to reduce copy-paste between the main parser and Ripper, I opted to shift some Regexp-related code into the Lexer. In theory, the code doesn't belong in the lexer, but putting it in the Lexer has some benefits. First, it is a component that is shared in a reasonable way between the two parsers. Second, it is essentially required by the proposed implementation, as the new validation takes place effectively inside the Lexer.

Unsurprisingly, it turns out that the coverage for Ripper parsing of Regexp isn't very extensive, and I haven't had time to put the code through any additional tests.

@grddev

This comment has been minimized.

Contributor

grddev commented Dec 17, 2017

Forgot to mention, but this does nothing to address local variables as mentioned in #4898, but some quick testing seemed to indicate that MRI didn't handle this either (as the below output has a vcall rather than a var_ref towards the end.

% RBENV_VERSION=2.4.2 ruby -rripper -e 'p Ripper.sexp("/\$(?<dollars>\d+)\.(?<cents>\d+)/ =~ \"$3.67\"\ndollars")'
[:program, [[:binary, [:regexp_literal, [[:@tstring_content, "$(?<dollars>d+).(?<cents>d+)", [1, 1]]], [:@regexp_end, "/", [1, 29]]], :=~, [:string_literal, [:string_content, [:@tstring_content, "$3.67", [1, 35]]]]], [:vcall, [:@ident, "dollars", [2, 0]]]]]
@enebo

Really I just want a comment on that null fragment if since it is not obvious what scenario that happens in.

lexer.checkRegexpFragment(runtime, fragment, options);
}
}
if (last != null && regexpFragments.size() == 1) {

This comment has been minimized.

@enebo

enebo Dec 18, 2017

Member

Is this only possible for /#{ffofofofo}/ and /@{gfogfogogog}/? If so can you add a comment here as to what scenario this is for? Without printing out the lex stream I would almost think this was not possible. A comment will help later on.

This comment has been minimized.

@grddev

grddev Dec 19, 2017

Contributor

Rewrote the implementation with an additional regexpDynamic variable instead to make the logic simpler to follow instead on commenting on the cryptic logic.

}
}
private boolean is7BitASCII(ByteList value) {

This comment has been minimized.

@enebo

enebo Dec 18, 2017

Member

Note for myself. We should make sure CR is calculated as we build up string so we do not need to rescan all bytelists after creation.

@@ -1453,13 +1453,8 @@ public Node arg_append(Node node1, Node node2) {
// MRI: reg_fragment_check
public void regexpFragmentCheck(RegexpNode end, ByteList value) {

This comment has been minimized.

@enebo

enebo Dec 18, 2017

Member

Since 9.2 will be considered a major version you can remove this from ParserSupport if you want (or not) and just make callers call lexer directly. Optional.

This comment has been minimized.

@grddev

grddev Dec 19, 2017

Contributor

Done

@enebo

This comment has been minimized.

Member

enebo commented Jan 10, 2018

@grddev sorry our last merge broke this. I did not notice you had addressed the comments so I spaced out landing this. Can you update it?

grddev added some commits Dec 17, 2017

Extract parseRegexpFlags to Lexer
And use the same function from both Ripper and main parser.
Move Regexp checking from ParserSupport to Lexer
While not really related to Lexing, this is a component that is shared between Ripper and the main parser, and that seemed like the lesser evil.
Align validation code with MRI
It seems to have been `!ENCODING_IS_ASCII8BIT(str)` from the beginning, so I'm not sure why it was the opposite here.
Make sure to clear $! when rescuing RaiseException
The code was copied from Parser support, so it was clearly broken before, but it had to be fixed now as parts of the Ripper test suite relies on $! rather than explicitly catching the exception.
Add support for validating Regexp in Ripper
This uses the fact that Regexp tokenization is handled by a single StringTerm, and thus all tSTRING_CONTENT fragments are easily collectable until the tREGEXP_END comes with the options that we need for validation.

The validation itself is a copied/simplified version of what is performed by the main parser, as large parts the validation depended on the AST structure, which we do not have here.

Technically, this doesn't perform the validation at the same point in time as the main parser, as it performs the validation when encountering the tREGEXP_END token rather than when processing the regexp rule.

I speculate that the difference doesn't really matter given that the only thing we could do with the tREGEXP_END token is to apply the regexp rule.
Simplify the regexp validation logic
Use a separate variable to track whether things are dynamic or not, and use a List to avoid tracking the last element explicitly.
Inline the regexp validation inside ParserSupport
The methods were only retained to provide the old interface, but by directly calling the new methods in the Lexer, we can remove the old methods, given that we don't really need to be backwards compatible here.
@grddev

This comment has been minimized.

Contributor

grddev commented Jan 14, 2018

Rebased the changes on top of master

@enebo enebo merged commit 2614c77 into jruby:master Jan 14, 2018

1 check failed

continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details
@enebo

This comment has been minimized.

Member

enebo commented Jan 14, 2018

@grddev thanks. sorry I lost this one in the weeds :)

@enebo enebo added this to the JRuby 9.2.0.0 milestone Feb 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment