Add support for validating Regexp in Ripper #4902
Conversation
Forgot to mention, but this does nothing to address local variables as mentioned in #4898, but some quick testing seemed to indicate that MRI didn't handle this either (as the below output has a vcall rather than a var_ref towards the end.
|
Really I just want a comment on that null fragment if since it is not obvious what scenario that happens in. |
lexer.checkRegexpFragment(runtime, fragment, options); | ||
} | ||
} | ||
if (last != null && regexpFragments.size() == 1) { |
enebo
Dec 18, 2017
Member
Is this only possible for /#{ffofofofo}/ and /@{gfogfogogog}/? If so can you add a comment here as to what scenario this is for? Without printing out the lex stream I would almost think this was not possible. A comment will help later on.
Is this only possible for /#{ffofofofo}/ and /@{gfogfogogog}/? If so can you add a comment here as to what scenario this is for? Without printing out the lex stream I would almost think this was not possible. A comment will help later on.
grddev
Dec 19, 2017
Author
Contributor
Rewrote the implementation with an additional regexpDynamic
variable instead to make the logic simpler to follow instead on commenting on the cryptic logic.
Rewrote the implementation with an additional regexpDynamic
variable instead to make the logic simpler to follow instead on commenting on the cryptic logic.
} | ||
} | ||
|
||
private boolean is7BitASCII(ByteList value) { |
enebo
Dec 18, 2017
Member
Note for myself. We should make sure CR is calculated as we build up string so we do not need to rescan all bytelists after creation.
Note for myself. We should make sure CR is calculated as we build up string so we do not need to rescan all bytelists after creation.
@@ -1453,13 +1453,8 @@ public Node arg_append(Node node1, Node node2) { | |||
|
|||
// MRI: reg_fragment_check | |||
public void regexpFragmentCheck(RegexpNode end, ByteList value) { |
enebo
Dec 18, 2017
Member
Since 9.2 will be considered a major version you can remove this from ParserSupport if you want (or not) and just make callers call lexer directly. Optional.
Since 9.2 will be considered a major version you can remove this from ParserSupport if you want (or not) and just make callers call lexer directly. Optional.
grddev
Dec 19, 2017
Author
Contributor
Done
Done
@grddev sorry our last merge broke this. I did not notice you had addressed the comments so I spaced out landing this. Can you update it? |
And use the same function from both Ripper and main parser.
While not really related to Lexing, this is a component that is shared between Ripper and the main parser, and that seemed like the lesser evil.
It seems to have been `!ENCODING_IS_ASCII8BIT(str)` from the beginning, so I'm not sure why it was the opposite here.
The code was copied from Parser support, so it was clearly broken before, but it had to be fixed now as parts of the Ripper test suite relies on $! rather than explicitly catching the exception.
This uses the fact that Regexp tokenization is handled by a single StringTerm, and thus all tSTRING_CONTENT fragments are easily collectable until the tREGEXP_END comes with the options that we need for validation. The validation itself is a copied/simplified version of what is performed by the main parser, as large parts the validation depended on the AST structure, which we do not have here. Technically, this doesn't perform the validation at the same point in time as the main parser, as it performs the validation when encountering the tREGEXP_END token rather than when processing the regexp rule. I speculate that the difference doesn't really matter given that the only thing we could do with the tREGEXP_END token is to apply the regexp rule.
Use a separate variable to track whether things are dynamic or not, and use a List to avoid tracking the last element explicitly.
The methods were only retained to provide the old interface, but by directly calling the new methods in the Lexer, we can remove the old methods, given that we don't really need to be backwards compatible here.
Rebased the changes on top of master |
@grddev thanks. sorry I lost this one in the weeds :) |
This represents a stab at implementing the remaining Ripper compatibility as mentioned in #4898.
This uses the fact that Regexp tokenization is handled by a single StringTerm, and thus all tSTRING_CONTENT fragments are easily collectable until the tREGEXP_END comes with the options that we need for validation.
The validation itself is a copied/simplified version of what is performed by the main parser, as large parts the validation depended on the AST structure, which we do not have here.
Technically, this doesn't perform the validation at the same point in time as the main parser, as it performs the validation when encountering the tREGEXP_END token rather than when processing the regexp rule.
I speculate that the difference doesn't really matter given that the only thing we could do with the tREGEXP_END token is to apply the regexp rule.
In order to reduce copy-paste between the main parser and Ripper, I opted to shift some Regexp-related code into the Lexer. In theory, the code doesn't belong in the lexer, but putting it in the Lexer has some benefits. First, it is a component that is shared in a reasonable way between the two parsers. Second, it is essentially required by the proposed implementation, as the new validation takes place effectively inside the Lexer.
Unsurprisingly, it turns out that the coverage for Ripper parsing of Regexp isn't very extensive, and I haven't had time to put the code through any additional tests.