Skip to content

Commit

Permalink
hand/lexer: Add long-char.c test case.
Browse files Browse the repository at this point in the history
Not quite sure what the best approach is for error detection here.
Trying to be cleaver to potentially give better error messages, or be
simple but consistent and fail early for invalid tokens.

Approach 1)

If trying to be cleaver, where is the cut-off point? 1 byte look-ahead,
to locate 'cc' as invalid, 2 byte look-ahead, to also identify 'ccc',
etc.

Approach 2)

Another approach is to backtrack to the first possible starting position
of a valid token within the input byte stream. In the case of `'cc';`,
the lexer would emit an "unterminated character literal" error for the
first apostrophe and then backtrack to the position directly succeeding
the first apostrophe (i.e. `cc';`) to continue lexing. The lexer would
then emit the identifier `cc`, an error for the second unterminated
apostrophe and a semicolon token.

Approach 3) (current approach)

Lex as many bytes as would be valid for the current token being lexed.
In the case of `'cc'; // Not OK` being lexed, lex `'c` before emitting
an "unterminated character literal" error, as the first apostrophe
indicates that a character literal is to be lexed, the first `c`
indicates a valid character was located within the character literal,
and as the lexer tries to locate the terminating apostrophe, but fails
to do so, it emits an error, and continues lexing from the next byte in
the byte stream, i.e. from `c'; // Not OK`. Which would be lexed as the
identifier `c`, an "unterminated character literal" error for `';`, and
a comment for `// Not OK`

@sangisos any idea on which approach that is preferable? What are the
advantages and disadvantages with the three different approaches? Are
there any other approaches we may try?
  • Loading branch information
mewmew committed Feb 22, 2016
1 parent 3ecddc7 commit 5849769
Showing 1 changed file with 118 additions and 2 deletions.
120 changes: 118 additions & 2 deletions uc/hand/lexer/lexer_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -537,8 +537,124 @@ func TestLexer(t *testing.T) {
},
},
},
{
path: "../../testdata/incorrect/lexer/long-char.c",
toks: []token.Token{
{
Kind: token.Ident,
Val: "int",
Pos: 0,
},
{
Kind: token.Ident,
Val: "main",
Pos: 4,
},
{
Kind: token.Lparen,
Val: "(",
Pos: 8,
},
{
Kind: token.Ident,
Val: "void",
Pos: 9,
},
{
Kind: token.Rparen,
Val: ")",
Pos: 13,
},
{
Kind: token.Lbrace,
Val: "{",
Pos: 15,
},
{
Kind: token.Ident,
Val: "char",
Pos: 19,
},
{
Kind: token.Ident,
Val: "c",
Pos: 24,
},
{
Kind: token.Semicolon,
Val: ";",
Pos: 25,
},
{
Kind: token.Ident,
Val: "c",
Pos: 29,
},
{
Kind: token.Assign,
Val: "=",
Pos: 31,
},
{
Kind: token.CharLit,
Val: "'c'",
Pos: 33,
},
{
Kind: token.Semicolon,
Val: ";",
Pos: 36,
},
{
Kind: token.Comment,
Val: " OK",
Pos: 38,
},
{
Kind: token.Ident,
Val: "c",
Pos: 46,
},
{
Kind: token.Assign,
Val: "=",
Pos: 48,
},
{
Kind: token.Error,
Val: "unterminated character literal",
// TODO: Figure out how to handle position of errors.
Pos: 51,
},
{
Kind: token.Ident,
Val: "c",
Pos: 52,
},
{
Kind: token.Error,
Val: "unterminated character literal",
// TODO: Figure out how to handle position of errors.
Pos: 54,
},
{
Kind: token.Comment,
Val: " Not OK",
Pos: 56,
},
{
Kind: token.Rbrace,
Val: "}",
Pos: 66,
},
{
Kind: token.EOF,
Val: "",
Pos: 68,
},
},
},
// TODO: Add tokens for the following test cases.
{path: "../../testdata/incorrect/lexer/long-char.c"},
{path: "../../testdata/incorrect/lexer/ugly.c"},
{path: "../../testdata/quiet/lexer/l01.c"},
{path: "../../testdata/quiet/lexer/l02.c"},
Expand Down Expand Up @@ -575,7 +691,7 @@ func TestLexer(t *testing.T) {
break
}
}
if looprun >= 1 {
if looprun >= 2 {
break // TODO: Remove this break to test all test cases and not just the first.
}
}
Expand Down

5 comments on commit 5849769

@mewmew
Copy link
Owner Author

@mewmew mewmew commented on 5849769 Feb 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could discuss the various approaches in this commit, or in a dedicated issue. For now, I'll put my thoughts here.

First and foremost, I'm not perticularly attached to any of the three approaches. My main concern is that whichever solution we arrive at, it should feel intuitive and consistent.

Approach 1

Pros

  • Facilitates intuitive error messages for the user; e.g. "invalid character literal 'cc'; contains multiple characters"

Cons

  • Unintuitive implementation. Which error cases should we guard the user against, and present good error messages for? E.g. 'aa' multichar, 'a\' unterminated, 'aaa' multichar with potentially infinite look-ahead. New line in character literal? How far ahead should we scan before deciding which error message to produce? In which state should we leave the lexer once the error has been recorded. After the initial apostrophe? After the final apostrophe of a multichar character literal? After the first invalid byte in an otherwise valid token?

Approach 2

Pros

  • Intuitive and consistent handling of invalid tokens (although, not necessarily the easiest to implement). Always skip the first character of an invalid token, and try to restart the lexer from the character directly succeeding it in the input byte stream.

Cons

  • Will produce a lot of additional error messages, which may confuse the user and hide the original error.

Approach 3

Pros

  • Simple and intuitive implementation of lexer. At any point while lexing, if an unexpected character is located, output an error and continue from the unexpected character.

Example:

// Consume apostrophe.
if !l.accept("'") {
    l.emitErrorf("unterminated character literal")
    return lexToken
}

Try to consume the closing apostrophe of the character literal. If not present, output error and continue lexing a new token from the current position.

Cons

  • Results in additional, perhaps unintuitive error messages. Perhaps not as bad as with approach 2, but may still confuse the user.

@mewmew
Copy link
Owner Author

@mewmew mewmew commented on 5849769 Feb 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @sangisos :) Even if we discuss this topic in person, lets collect our thoughts here to make them publicly visible.

@sangisos
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that I still have a lot to learn to make an informed decision, and I really would like to avoid cleverly shooting ourselves in the feet. I would opt for approach 3 to avoid complexity from tedious work, but maybe try to help the user REALLY look at the first error.

An other approach might be to try to also lex the file backwards from the end, to maybe more accurately determine the start/end of the error, and helping with recovery, but that also sounds tedious, as I guess a whole new lexer would need to be written, looking at what to expect BEFORE some construct.

As stated, I think approach 3 is good enough, but better with some guidance to the user.

@mewmew
Copy link
Owner Author

@mewmew mewmew commented on 5849769 Feb 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An other approach might be to try to also lex the file backwards from the end, to maybe more accurately determine the start/end of the error, and helping with recovery, but that also sounds tedious, as I guess a whole new lexer would need to be written, looking at what to expect BEFORE some construct.

Wouldn't this create another instance of the same problem? An underminated character literal is unterminated no matter which direction we are lexing. E.g. which tokens and errors would the lexer produce for the following input:

// multichar character literals are analogous to lexing backwards or forwards.
int c = 'cc';
// unterminated character literals may be missing their first or second apostrophe. 
int c = c';
// unterminated character literals may be missing their first or second apostrophe. 
int c = 'c;

It is possible that lexing it backwards would facilitate better error messages. From a quick glance, I fail to see how the problem is made easier.

As stated, I think approach 3 is good enough, but better with some guidance to the user.

Ok, lets go with approach 3 for now, and try to help the user by outlining the first error, and including any additional errors below.

@sangisos
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that lexing it backwards would facilitate better error messages. From a quick glance, I fail to see how the problem is made easier.

No, I didn't mean it would be easier, quite the opposite, a more advanced "cleaver" approach which could possibly give better error messages, but being much more difficult and non-straight-forward to solve. That is why I didn't really recommend it as a start. It would need more than only lexing to solve.

Ok, lets go with approach 3 for now, and try to help the user by outlining the first error, and including any additional errors below.

Maybe having a helpful block right at the end repeating the first error with extra help/solutions, and explaining that some of the problems above could follow from the first repeated error. Also adding that a -v flag will give extra help for all errors or something.

Please sign in to comment.