Implement basic Lexer #102

rengolin · 2020-05-07T18:11:19Z

Tokenise the source in a simple structure. Examine the trade-offs of adding lexical blocks {} as lists of tokens instead of using a simple stream and keeping the context in the parser.

Acceptance criteria:

Can parse the existing Verona examples (write more examples if necessary)
Can dump the token is tree form
Has tests that show the expected tree structure / tokens from given source
Test coverage is 100% of what's exposed in existing examples

Not included:

Support for the full language
AST building, MLIR lowering

rengolin · 2020-05-08T20:05:09Z

@plietar I think we should reuse, or at least heavily borrow from your prototype.

plietar · 2020-05-08T20:28:02Z

The prototype doesn’t really have a lexer, it just feeds the source input into pegmatite.

It’s actually quite broken, where things like new_foo is interpreted as the new keyword, followed by the _foo identifier whereas it’s should be just one identifier.

davidchisnall · 2020-05-10T13:50:48Z

Pegmatite doesn't really have a lexer vs parser separation. Everything is a sequence of characters. Rules either refer to expressions of characters or to other rules. Though I strongly recommend against reusing any of the Pegmatite code, I actually like this model. The distinction makes sense in a push model, where the lexer produces a stream of tokens and drives a state machine to produce reductions. Both Clang and ponyc have a pull model (at least, if I understood @sylvanc's explanation of ponyc correctly), where the parser asks the lexter to interpret the next part of the stream as a particular token. This is necessary for any complex grammar (e.g. one with context-dependent keywords).

In this model, the 'parser' is really just a set of helpful routines for consuming one or more characters from the stream. It probably is useful to have something like Clang's SourceLocation and SourceRange, managed by whatever consumes characters. The prototype has this, with a 32-bit encoding that either directly embeds the file index, line, and character or refers into a table. That may be worth reusing. Clang has a notion of a Token, which includes a SoruceRange and a TokenKind. As I recall, Clang also maintains a table of identifiers, (basically, interned strings) so that identifer comparison is simple index comparison in this table. This may be a useful abstraction to keep around if it can be made sufficiently small, but I expect that Verona compilation units will be much smaller than [Objective-]C[++] ones (where the lack of a module system means that #include can make an individual compilation unit many MBs of text) and so the tradeoffs are different. We don't need to optimise for huge input files, but we may want to keep some clean abstractions so that we can do so later if necessary. Instantiations of generics are likely to make our AST / MLIR size significantly larger than our input text size.

mjp41 · 2020-05-11T06:53:55Z

I like the pull model, so would be keen to keep that route.

rengolin · 2020-05-15T09:43:28Z

We just had a long discussion about the grammar, lexing and parsing, and @sylvanc has a prototype that can produce an AST from Verona code that we want to use instead of writing a whole new one. This is in the interest of a better division of work, as @mjp41 put it, allowing us to work on different levels of the project at the same time. It is not a statement of what we want in the end.

The pull model really is a great idea and I think we should pursue, but that can be done in parallel with other parts of the project if we already have a working AST producer. Sylvan's prototype also allows us to change the grammar on the fly, which is really handy at this stage of development.

So, I'll close this ticket for now. Once we have a better idea on how we'll do this, we should create a new one, with more specific tasks.

davidchisnall · 2020-05-15T10:24:51Z

We just had a long discussion about the grammar, lexing and parsing

Who is 'we' in this context?

rengolin · 2020-05-15T10:51:13Z

Sorry, @sylvanc, @mjp41 and I.

mjp41 · 2020-05-15T11:54:21Z

@davidchisnall sorry for not inviting you. It started out as a ten minute chat about the prefix/infix operators, which snowballed into over an hour.

rengolin closed this as completed May 15, 2020

rengolin mentioned this issue May 15, 2020

Merge Sylvan's parser into Verona #111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement basic Lexer #102

Implement basic Lexer #102

rengolin commented May 7, 2020

rengolin commented May 8, 2020

plietar commented May 8, 2020 •

edited

Loading

davidchisnall commented May 10, 2020

mjp41 commented May 11, 2020

rengolin commented May 15, 2020

davidchisnall commented May 15, 2020

rengolin commented May 15, 2020

mjp41 commented May 15, 2020 •

edited

Loading

Implement basic Lexer #102

Implement basic Lexer #102

Comments

rengolin commented May 7, 2020

rengolin commented May 8, 2020

plietar commented May 8, 2020 • edited Loading

davidchisnall commented May 10, 2020

mjp41 commented May 11, 2020

rengolin commented May 15, 2020

davidchisnall commented May 15, 2020

rengolin commented May 15, 2020

mjp41 commented May 15, 2020 • edited Loading

plietar commented May 8, 2020 •

edited

Loading

mjp41 commented May 15, 2020 •

edited

Loading