New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve compatibility with MRI's Ripper #4898

Merged
merged 13 commits into from Dec 16, 2017

Conversation

Projects
None yet
2 participants
@grddev
Contributor

grddev commented Dec 16, 2017

After having discovered #4882 and then discovering that (block-)local variable references didn't work, it seemed like the whole process would be quicker if I looked for the solution myself, rather than wait for someone else to solve one problem before I could discover the next one.

And once I had figured out roughly how the code worked, I spent a little time to try to get all the MRI specs to pass. Apart from the tests that rely on the MRI directory layout, the only remaining issue is that JRuby's Ripper doesn't validate regular expressions during parsing. I looked into it briefly, but given that the Ripper parser effectively doesn't have access to a parse tree (as the dispatch callbacks replace the raw values), it wasn't obvious to me how to do that.

In addition to the MRI tests, this adds test/jruby/test_ripper.rb that tries to cover additional aspects of Ripper that the MRI specs do not cover. I guess these would ideally go into MRI to clarify that MRI's Ripper behavior is intentional rather than accidental.

Given that this is essentially a collection of a whole bunch of small fixes, I refer to the individual commit messages for additional information, instead of trying to summarize the changes here.

grddev added some commits Dec 12, 2017

Return false on tokadd_mbchar failure
Otherwise, an invalid byte sequence inside an identifier causes Ripper to continuously parse the same invalid byte sequence over and over.
Call tokadd_mbchar from tokenAddMBC
They are identical (after fixing tokadd_mbchar)
Use current scope in is_id_var
Otherwise, it doesn't recognize variables introduced in the block (parameters), resulting in vcall nodes instead of var_ref
Rework Ripper scope variable analysis
The previous implementation called assignable on a bunch of things that were unlikely to be strings, while only supporting string arguments. This changes the logic so that it relies on getting the identifier value from the lexer instead.

In addition, this inlines the logic from assignable in all cases where the implementation is known, and splits the cases for identifiers and constants, as they have very little to do with each other. Given that the parser has inlined the definition of MRI's user_varaible and keyword variable, it seems to make sense to also inline the assignable actions.

While inlining the assignable implementation, it seamed to make sense to also inline is_id_var (and remove its unused argument).
Align Ripper parser with main Ruby parser
Not 100% sure about the impact of this, but it seems unlikely that having two separate Ruby grammars for Ripper and the main parser would be a good thing.
Add command-args-state begin to Ripper
The corresponding construct was in the main parser, and allowed it correctly parse "m x do; end", where Ripper would fail and assign the block to x rather than m.
Rework handling of heredoc end
Instead of ending the heredoc before emitting the final string content part, this instead just emits the final string content part and positions the cursor right before the end marker, such that the next call to the heredoc term should end the heredoc using the early return.

The motivation for doing this was that with ripper the final string content of a heredoc wouldn't be dispatched to any token handler, but instead just propagated.

The old behavior was clearly wrong for this corner-case, but this approach might have other problems.
Clear command-args-state in brace blocks
It might not be important, but MRI seems to do this.
Dispatch on_magic_comment for Ripper
As an added bonus, the parsing of magic comment key values no longer stops at the first unrecognized key, and thus writing

# -*- warn_past_scope: true; coding: ISO-8859-1 -*-

will now apply the encoding even if JRuby doesn't support the magic comment attribute warn_past_scope.
Update the MRI ripper spec excludes
First, this ignores the entire file that relies on the MRI file structure, as it seems unlikely that any of those tests will ever be applicable.

Second, it adds the test_sexp.rb file which was absent from the index for some unknown reason. It also introduces a couple of excludes for failing tests in that file.

Third, it clears the excludes for a couple of tests that are currently passing, and updates the description for one of the excludes.
@enebo

This comment has been minimized.

Member

enebo commented Dec 16, 2017

@grddev This is fantastic.

If you have interest in 9.1.x supporting these changes then you might maybe be up for the challenge of opening up another PR for jruby-9.1 branch so our current 9.1.x can get these changes. Note though that a few of these commits will not cherry-pick over (notably RipperParser.y) since ruby 2.4 added new productions 2.3 does not have like arg_rhs (ah also master changed Tokens to use keyword_XXX).

If you don't want to do that work I can try cp'ing as much as is reasonable and then fix the rest.

Another test case I frequently use is Yard although we pass their specs so as you can see ripper coverage is not fantastic.

For regexp checking did you see ParserSupport.regexpFragmentCheck? I have not spent any time looking at this but that method will check syntax but only require a ByteList. For named captures as lvars perhaps that is more involved than that?

@enebo enebo added this to the JRuby 9.2.0.0 milestone Dec 16, 2017

@enebo enebo merged commit 0a4be2d into jruby:master Dec 16, 2017

1 check failed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
@grddev

This comment has been minimized.

Contributor

grddev commented Dec 16, 2017

For regexp checking did you see ParserSupport.regexpFragmentCheck? I have not spent any time looking at this but that method will check syntax but only require a ByteList. For named captures as lvars perhaps that is more involved than that?

I did see that, but the main part that the fragment validation does is ensure that the encoding is valid in each fragment, which isn't really possible to do before seeing the flags for the regexp, which in turn would require the Parser/Lexer to keep track of the regular expression fragments until getting to the options. Once the fragments are recorded, it would also be relatively simple to perform full validation/compilation of the regexp if the expression is static.

I didn't even think about variable captures, and haven't looked into how MRI Ripper deals with that, so that might require some investigation as well.

Its definitely solvable, but I believe it would be complicated enough to warrant its own PR (and I'm also not sure I have the time to look into it right now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment