Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Candidate sets for Antlr grammars #37

Open
kaby76 opened this issue Dec 21, 2019 · 11 comments
Open

Candidate sets for Antlr grammars #37

kaby76 opened this issue Dec 21, 2019 · 11 comments
Labels

Comments

@kaby76
Copy link

@kaby76 kaby76 commented Dec 21, 2019

Mike,

Thanks for this library.

I am interested in computing the set of possible lookaheads in an Antlr grammar itself, specifically Expr.g4, defined at https://github.com/mike-lischke/antlr4-c3/blob/master/ports/c%23/test/Antlr4CodeCompletion.CoreUnitTest/Grammar/Expr.g4.

I've ported your library to use Antlr4.Runtime.Standard (v4.7.2), and updated the code to use the Java Antlr tool generator (https://github.com/kaby76/antlr4-c3). (Note, unless there is a compelling reason why people keep using Harwell's Antlr4cs generator, https://github.com/tunnelvisionlabs/antlr4cs, I prefer to use the standard Antlr Java tool and Antlr4.Runtime.Standard. As far as I know, no other target has a separate Antlr tool generator, and Harwell's tool is several version behind the current tool version 4.7.2.) I then added a test (https://github.com/kaby76/antlr4-c3/blob/master/ports/c%23/XUnitTestProject1/UnitTest1.cs#L121) that has the input of Expr.g4 erased from line 9 to the end of the file:

grammar Expr;

expression: assignment | simpleExpression;

assignment
    : (VAR | LET) ID EQUAL simpleExpression
;

The input has no syntax errors, but it is, of course, incomplete. I want to simulate typing in a new rule after the last semi-colon, e.g., code completion. I'm expecting at least RULE_REF to appear in the candidate.tokens list because after ";", I can start a new rule. When I call the library to compute the candidate set after the last semi-colon "core.CollectCandidates(index, null)", I get instead a token list of only CATCH, FINALLY, and -2. CATCH and FINALLY make sense because the rule for parserRuleSpec is "parserRuleSpec : DOC_COMMENT? ruleModifiers? RULE_REF argActionBlock? ruleReturns? throwsSpec? localsSpec? rulePrequel* COLON ruleBlock SEMI exceptionGroup ;". FOLLOW(SEMI) should contain FIRST(exceptionGroup), which in this case, contains CATCH, FINALLY. However, exceptionGroup can derive the empty string, so it should also contain FIRST(parserRuleSpec). If I add to the input "simpleExpression" after the last semi-colon, again I would expect RULE_REF at the caret positioned at "simpleExpression", but I only get CATCH, FINALLY, -2.

My question is why isn't RULE_REF in the lookahead? I've only started to debug your code, and it'll take a while for me to fully comprehend.

--Ken

@mike-lischke

This comment has been minimized.

Copy link
Owner

@mike-lischke mike-lischke commented Dec 22, 2019

Ken, can you please clarify: are you looking for completion candidates for the Expr grammar or the ANTLR4 grammar? You mention elements from the ANTRL4 grammar, but also mention Expr (which wouldn't play a role at all when you are interested in the ANTLR4 grammar instead).

@kaby76

This comment has been minimized.

Copy link
Author

@kaby76 kaby76 commented Dec 22, 2019

I'm looking for candidates for Antlr grammars. If the caret is placed after a semicolon of a grammar parser rule, I would expect to see RULE_REF in the list because I can certainly type in a new rule after a semicolon which ended the previous rule. Expr is just sample input, but it could be any grammar.

@mike-lischke

This comment has been minimized.

Copy link
Owner

@mike-lischke mike-lischke commented Dec 23, 2019

Now I understand, thanks. The RULE_REF candidate is returned for me at this position. Have you perhaps included that in your ignored token list?

Note: including RULE_REF at this point is probably not helpful, because you would start a new rule when typing a name, for which you cannot provide meaningful completion items (which only can show existing elements).

@kaby76

This comment has been minimized.

Copy link
Author

@kaby76 kaby76 commented Dec 23, 2019

I'm not filtering; the test is all in my forked repo at the link I gave. I'll check it against Harwell's version now.

@kaby76

This comment has been minimized.

Copy link
Author

@kaby76 kaby76 commented Dec 23, 2019

Harwell's tool and runtime (all C#) produces the same result. It looks like the port of C3 to C# may be wrong. Checking this next.

@kaby76

This comment has been minimized.

Copy link
Author

@kaby76 kaby76 commented Dec 23, 2019

Typescript version (generated 4.7.3 version from non-standard build), Harwell's tool 4.6.6 , and Antlr4 4.7.2 standard C# all produce Tokens={26,27,-2} = {CATCH, FINALLY, -2} at token index 34, which is the space just after the very last ";" of input "grammar Expr; expression: assignment | simpleExpression; assignment : (VAR | LET) ID EQUAL simpleExpression ; ". I don't think C3 is computing the token sets correctly. TOKEN_REF should be there. This seems to be similar to issue #23 -- FOLLOW(GET) should contain FIRST(fooBarBaz withQux? EOF). But, since fooBarBase derives epsilon, you need to add to the set FIRST(withQux? EOF). I'll now delve into the code.

@kaby76

This comment has been minimized.

Copy link
Author

@kaby76 kaby76 commented Dec 24, 2019

A short note: the Seek() call at https://github.com/mike-lischke/antlr4-c3/blob/master/ports/c%23/src/Antlr4CodeCompletion.Core/CodeCompletion/CodeCompletionCore.cs#L100 should not be there! It is misleading; the stream is never used in the ProcessRule() ATN walker. And, it does not exist in the corresponding Typescript code https://github.com/mike-lischke/antlr4-c3/blob/master/src/CodeCompletionCore.ts. Continuing my code reading.

@mike-lischke

This comment has been minimized.

Copy link
Owner

@mike-lischke mike-lischke commented Dec 24, 2019

If that call is not in the typescript code then it shouldn't be in any of the ports, you are right.

@kaby76

This comment has been minimized.

Copy link
Author

@kaby76 kaby76 commented Dec 30, 2019

After a bit of reading and debugging of your code, the Antlr ATN parser, and one of Parr's paper on LL(*) over the last week, I believe the sets are not computed correctly in the case where the ATN recognizing the last token of the input contains a stop state of an ATN. ProcessRule() correctly parses the input to the caret, for a particular ATN corresponding to a rule in the grammar. The main loop for the ATN automaton interpreter is at https://github.com/mike-lischke/antlr4-c3/blob/master/ports/c%23/src/Antlr4CodeCompletion.Core/CodeCompletion/CodeCompletionCore.cs#L468. It performs the parse from a state within that ATN for the input from the pipeline queue, up until it gets to the stop state of the ATN or when a RULE transition occurs and ProcessRule() is then called recursively. The parser "accepts" the input when we arrive at a state via a non-epsilon transition on the last token in the input. You now place that state on the pipeline queue https://github.com/mike-lischke/antlr4-c3/blob/master/ports/c%23/src/Antlr4CodeCompletion.Core/CodeCompletion/CodeCompletionCore.cs#L584 and continue to compute the lookahead sets after the caret. At this point, the lookahead set should contain the non-epsilon symbols from the e-closure from the "accept" state, which you obtain via DetermineFollowSets(). But, more importantly, if we are in the stop state for the rule's ATN, the lookahead must also be obtained from the next enclosing state from the call stack of ATNs. It looks like you accumulate the sets in "this.Tokens[]", but ProcessRule() could contain multiple uses of a particular rule in the path of the parse tree down to the symbol at the caret. The final set of lookaheads reported back to the user should not include epsilon (= -2), but your code does anyways. What is interesting is that Parr's Antlr ATN parser runtime also does not correctly predict the lookahead sets either for reporting (https://github.com/antlr/antlr4/blob/master/runtime/CSharp/runtime/CSharp/Antlr4.Runtime/DefaultErrorStrategy.cs#L410). (Take an Antlr grammar like Expr.g4, and place a "." in the input following a ";" at the end of a rule, create a parser for Antlr grammars itself and call this parser on this input. Antlr's default reporter says it expects EOF/mode because these can occur after the semi-colon. But, it does not say that TOKEN_REF or RULE_REF is expected, which clearly it should, since a new rule--whether parser or lexer grammar--can occur at the point of the ".".) What I plan to do is to write my own implementation. A parse is already done via Antlr, but unfortunately, the Antlr runtime only records the start state of a rule (Antlr4.Runtime.RuleContext.invokingState), and not for each state transition from the start state to accept state in the parse tree. But, I'm not sure if modifying the parser would help because I think it just stops the search after obtaining the first valid parse; your code continues, which is correct. --Ken

@mike-lischke

This comment has been minimized.

Copy link
Owner

@mike-lischke mike-lischke commented Dec 31, 2019

In addition to your observations there's the fact that the C3 engine has to track back and follow other possible paths through the ATN, which the normal ANTLR4 parser does not do (and which is one of the main reasons why I don't use it for code completion).

@kaby76

This comment has been minimized.

Copy link
Author

@kaby76 kaby76 commented Dec 31, 2019

Yes, I think I read that somewhere in your code comments. I agree, it's important to do a full search of the parse space. -K

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.