Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
hand/lexer: Add long-char.c test case.
Not quite sure what the best approach is for error detection here. Trying to be cleaver to potentially give better error messages, or be simple but consistent and fail early for invalid tokens. Approach 1) If trying to be cleaver, where is the cut-off point? 1 byte look-ahead, to locate 'cc' as invalid, 2 byte look-ahead, to also identify 'ccc', etc. Approach 2) Another approach is to backtrack to the first possible starting position of a valid token within the input byte stream. In the case of `'cc';`, the lexer would emit an "unterminated character literal" error for the first apostrophe and then backtrack to the position directly succeeding the first apostrophe (i.e. `cc';`) to continue lexing. The lexer would then emit the identifier `cc`, an error for the second unterminated apostrophe and a semicolon token. Approach 3) (current approach) Lex as many bytes as would be valid for the current token being lexed. In the case of `'cc'; // Not OK` being lexed, lex `'c` before emitting an "unterminated character literal" error, as the first apostrophe indicates that a character literal is to be lexed, the first `c` indicates a valid character was located within the character literal, and as the lexer tries to locate the terminating apostrophe, but fails to do so, it emits an error, and continues lexing from the next byte in the byte stream, i.e. from `c'; // Not OK`. Which would be lexed as the identifier `c`, an "unterminated character literal" error for `';`, and a comment for `// Not OK` @sangisos any idea on which approach that is preferable? What are the advantages and disadvantages with the three different approaches? Are there any other approaches we may try?
- Loading branch information
5849769
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could discuss the various approaches in this commit, or in a dedicated issue. For now, I'll put my thoughts here.
First and foremost, I'm not perticularly attached to any of the three approaches. My main concern is that whichever solution we arrive at, it should feel intuitive and consistent.
Approach 1
Pros
Cons
'aa'
multichar,'a\'
unterminated,'aaa'
multichar with potentially infinite look-ahead. New line in character literal? How far ahead should we scan before deciding which error message to produce? In which state should we leave the lexer once the error has been recorded. After the initial apostrophe? After the final apostrophe of a multichar character literal? After the first invalid byte in an otherwise valid token?Approach 2
Pros
Cons
Approach 3
Pros
Example:
Try to consume the closing apostrophe of the character literal. If not present, output error and continue lexing a new token from the current position.
Cons
5849769
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @sangisos :) Even if we discuss this topic in person, lets collect our thoughts here to make them publicly visible.
5849769
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that I still have a lot to learn to make an informed decision, and I really would like to avoid cleverly shooting ourselves in the feet. I would opt for approach 3 to avoid complexity from tedious work, but maybe try to help the user REALLY look at the first error.
An other approach might be to try to also lex the file backwards from the end, to maybe more accurately determine the start/end of the error, and helping with recovery, but that also sounds tedious, as I guess a whole new lexer would need to be written, looking at what to expect BEFORE some construct.
As stated, I think approach 3 is good enough, but better with some guidance to the user.
5849769
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this create another instance of the same problem? An underminated character literal is unterminated no matter which direction we are lexing. E.g. which tokens and errors would the lexer produce for the following input:
It is possible that lexing it backwards would facilitate better error messages. From a quick glance, I fail to see how the problem is made easier.
Ok, lets go with approach 3 for now, and try to help the user by outlining the first error, and including any additional errors below.
5849769
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I didn't mean it would be easier, quite the opposite, a more advanced "cleaver" approach which could possibly give better error messages, but being much more difficult and non-straight-forward to solve. That is why I didn't really recommend it as a start. It would need more than only lexing to solve.
Maybe having a helpful block right at the end repeating the first error with extra help/solutions, and explaining that some of the problems above could follow from the first repeated error. Also adding that a -v flag will give extra help for all errors or something.