Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement basic Lexer #102

Closed
rengolin opened this issue May 7, 2020 · 8 comments
Closed

Implement basic Lexer #102

rengolin opened this issue May 7, 2020 · 8 comments

Comments

@rengolin
Copy link
Contributor

rengolin commented May 7, 2020

Tokenise the source in a simple structure. Examine the trade-offs of adding lexical blocks {} as lists of tokens instead of using a simple stream and keeping the context in the parser.

Acceptance criteria:

  • Can parse the existing Verona examples (write more examples if necessary)
  • Can dump the token is tree form
  • Has tests that show the expected tree structure / tokens from given source
  • Test coverage is 100% of what's exposed in existing examples

Not included:

  • Support for the full language
  • AST building, MLIR lowering
@rengolin
Copy link
Contributor Author

rengolin commented May 8, 2020

@plietar I think we should reuse, or at least heavily borrow from your prototype.

@plietar
Copy link
Contributor

plietar commented May 8, 2020

The prototype doesn’t really have a lexer, it just feeds the source input into pegmatite.

It’s actually quite broken, where things like new_foo is interpreted as the new keyword, followed by the _foo identifier whereas it’s should be just one identifier.

@davidchisnall
Copy link
Contributor

Pegmatite doesn't really have a lexer vs parser separation. Everything is a sequence of characters. Rules either refer to expressions of characters or to other rules. Though I strongly recommend against reusing any of the Pegmatite code, I actually like this model. The distinction makes sense in a push model, where the lexer produces a stream of tokens and drives a state machine to produce reductions. Both Clang and ponyc have a pull model (at least, if I understood @sylvanc's explanation of ponyc correctly), where the parser asks the lexter to interpret the next part of the stream as a particular token. This is necessary for any complex grammar (e.g. one with context-dependent keywords).

In this model, the 'parser' is really just a set of helpful routines for consuming one or more characters from the stream. It probably is useful to have something like Clang's SourceLocation and SourceRange, managed by whatever consumes characters. The prototype has this, with a 32-bit encoding that either directly embeds the file index, line, and character or refers into a table. That may be worth reusing. Clang has a notion of a Token, which includes a SoruceRange and a TokenKind. As I recall, Clang also maintains a table of identifiers, (basically, interned strings) so that identifer comparison is simple index comparison in this table. This may be a useful abstraction to keep around if it can be made sufficiently small, but I expect that Verona compilation units will be much smaller than [Objective-]C[++] ones (where the lack of a module system means that #include can make an individual compilation unit many MBs of text) and so the tradeoffs are different. We don't need to optimise for huge input files, but we may want to keep some clean abstractions so that we can do so later if necessary. Instantiations of generics are likely to make our AST / MLIR size significantly larger than our input text size.

@mjp41
Copy link
Member

mjp41 commented May 11, 2020

I like the pull model, so would be keen to keep that route.

@rengolin
Copy link
Contributor Author

We just had a long discussion about the grammar, lexing and parsing, and @sylvanc has a prototype that can produce an AST from Verona code that we want to use instead of writing a whole new one. This is in the interest of a better division of work, as @mjp41 put it, allowing us to work on different levels of the project at the same time. It is not a statement of what we want in the end.

The pull model really is a great idea and I think we should pursue, but that can be done in parallel with other parts of the project if we already have a working AST producer. Sylvan's prototype also allows us to change the grammar on the fly, which is really handy at this stage of development.

So, I'll close this ticket for now. Once we have a better idea on how we'll do this, we should create a new one, with more specific tasks.

@davidchisnall
Copy link
Contributor

We just had a long discussion about the grammar, lexing and parsing

Who is 'we' in this context?

@rengolin
Copy link
Contributor Author

Sorry, @sylvanc, @mjp41 and I.

@mjp41
Copy link
Member

mjp41 commented May 15, 2020

@davidchisnall sorry for not inviting you. It started out as a ten minute chat about the prefix/infix operators, which snowballed into over an hour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants