Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mode for a grammar without actions #354

Open
matklad opened this Issue Mar 29, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@matklad
Copy link
Member

matklad commented Mar 29, 2018

Add a mode to LALRPOP, where the grammar does not include any actions, and LALRPOP instead generates a generic syntax tree. A possible implementation of a tree could be https://docs.rs/parse_tree (also published on crates.io). Ideally, created tree should include whitespace, comments and other trivia.

An interesting implementation approach is to have parser to generate some sort of events (abstracted with a trait with methods like start_node, finish_node, add_child). That way, the tree construction part can vary independently, and can do fancy stuff with white space as well.

@matklad

This comment has been minimized.

Copy link
Member Author

matklad commented Apr 2, 2018

So, I've manually implemented an "actionless" parser for TOML, as we've discussed on all-hands.

The code lives here:

https://github.com/matklad/tom/tree/dfefe55117800cfb8ef56626d38d59d44c24428c

Note that there are at least two approaches for "actionless" parsing: one is to generate a traditional AST semi-automatically from the grammar, another is to produce a generic parse-tree, libsyntax2 style. tom uses the latter approach.

tom overview

The ParseTree data structure lives in a separate repository:

https://github.com/matklad/parse_tree/tree/70969791c8389276bd2f2bbd0622cf2801165a00

It works, but a couple of major pieces are missing:

  • There's no place where to store errors.
  • It hard-codes String as an underling text type, for simplicity.

The parse_tree has the following moving parts:

  • text.rs -- utility structs for working with text offsets and ranges.
  • lib.rs -- the ParseTree data structure itself.
  • bottom_up/top_down _builder.rs -- procedural "constructors" of the ParseTree for LR and LL parsers (we'll be using only bottom_up in tom).
  • search.rs -- helpers for traversing the ParseTree. Note that this helpers could be reused for any language, which would be harded to achieve with more traditional AST.

The tom tree has roughly three layers: a LALRPOP-based parse-tree parser, an AST wrapping the parse tree and a couple of refactorings to show how the IDE might use all of this.

The parser is actionless, in the sense that all productions has () type, and they just log shift/reduce events via Events trait.

There's a bit of copy-paste in the parser, which I don't think it is possible to get rid of with the help of the macros: I need to be able parametrize by values and to count symbols.

A little bookkeeping is needed to feed the events to the ParseTree builder, because we need to account for whitespace (no support for comments, because LALRPOP lexer apparently does not support them yet).

The AST is layered on top of the parse tree as described in the libsyntax2 RFC, and is mostly generated, with a couple of hand-written bits:

  • AstNode trait, which describes how the layering happens (via the cast fn)
  • Generated AstNodes for various symbols of the grammar. Note how Val and Key are enums.
  • The code-generation is described in build.rs.
  • Hand-written extensions of the AST, which neatly abstract over identical syntactic elements of different symbols. In theory, this could be generated as well perhaps?

This whole AST thing is the main reason why I rave about the parse trees approach so enthusiastically, it seems very neat to me at lest :-)

Finally, to check that all of the above actually makes any sense, the ide module contains a couple of style-preserving transformation of the source code, which you would expect to find in the good IDE. The pub functions and tests in this modules are a good place to start reading the source code deeply. The ide module relies on the edit module for actual source code transformations.

lessons for LALRPOP

If we decide that we should add a parse tree mode to LALRPOP, here's the list of things we need to think about:

  • Errors. With parse trees I think the parsing can, and should, be infailable. All syntax errors must be a part of the parse tree itself. Worst case, the parse tree for a file is a giant Error node, which contains all tokens as children.

  • Error recovery. One of the crucial benefits of parse trees is that they naturally represent partial parses. But to do that, we need some kind of error recovery. I haven't experimented with current LALRPOP error recovery strategies, but I am somewhat unenthusiastic about <!> rules: they seem to require to predict the places where error could happen, and the truth is that errors can happen anywhere! I would prefer an ability to say, positively, "if you have seen a production up to this symbol, than do a reduce even in presence of errors", sort-of like GrammarKit's pin and fall's <commit> work.

  • Syntax for designating auxiliaryly rules: in the tom's grammar, not every rule produces a node in a ParseTree, some rules are there just to organize code in a cleaner way.

  • Syntax for arbitrary nesting of choices: my understanding is that LALRPOP supports choice operator only at the top-level of a production, because specifying actions for nested choices would be unwidely. However, if we remove actions, we totally can support nested choices :-)

next steps

So, what are the next steps here? :-) I think first of all we need to decide if we at all want to go in this actionless untyped incomplete parse trees direction.

If we are, we probably should start to move some of the bits directly into lalrpop; at least the events based parser, but probably the parse trees and the generated ast as well.

And then I would actually love to do a crazy thing: to reimagine the concrete syntax of LALRPOP itself, implement it with parse trees and write a library for LALRPOP-IDE. That way, we'll get precise syntax highting, go to definition and similar stuff for LALRPOP itself, and we will be able to experiment with applying libsyntax2 ideas in real life.

@nikomatsakis curious to hear what do you think about all this and hope you'll have some "free" time to take a look here :-)

@nikomatsakis

This comment has been minimized.

Copy link
Collaborator

nikomatsakis commented Apr 6, 2018

@matklad

Awesome work! Some initial thoughts:

The AST is layered on top of the parse tree as described in the libsyntax2 RFC, and is mostly generated, with a couple of hand-written bits:

As I mentioned at the All Hands, I feel pretty strongly that we should refactor the ParseTree library to not use &'p str slices but rather ranges of locations. This is meant to be a long-lived data structure, and Rust references are really best suited to "temporary" things that are confined to some stack frame. (If you use super long-lived arenas, like the compiler, then that's different, but that is probably not something we want to force everyone to do.)

Errors.

There are still some errors that LALRPOP presently does not recover from. We should be able to fix that, though, particularly with the work in #303 -- which I've kinda' almost got working, but not quite, because of some annoying problems. That work makes LALRPOP much more malleable though. The main errors we don't recover from (iirc) are tokenizer errors and when =>? yields a Err.

Error recovery.

I agree with your instincts here -- I think though you may be able to simulate this with ! pretty easily. Can you maybe sketch out how you imagine error recovery working, or give a more specific pointer to how it works in one of those projects? (I'll try to go digging later, but pointers that let me avoid searching would be helpful)

Anyway, I definitely think we can make changes here. There is nothing special about the current error recovery setup, we just kind of copied it from YACC. I suppose it'd be nice to look also at the patterns that Gluon has -- you may want to preserve the ability to write ! and do customizable recovery, but also have some "top-level" recovery.

Syntax for designating auxiliaryly rules: in the tom's grammar, not every rule produces a node in a ParseTree, some rules are there just to organize code in a cleaner way.

Hmm, so, we have #[inline], which means this but also something else -- that is, it inlines the rule into its callers, which helps to avoid LR(1) conflicts. I think we don't want to conflate the two, because overuse of #[inline] can make for very inefficient grammars.

What I had imagined though is that we would have some general annotations for controlling your parse tree. For example, you might also want to conflate various nonterminals into one and rename them:

#[parse_tree(name = "Foo")]
Foo1 = { ... };

#[parse_tree(name = "Foo")]
Foo2 = { ... };

This could arise, for example, in Rust's grammar, where expressions can have many different starting points, depending on which subset of expressions are legal in a particular place.

If we have that, it seems reasonable to have #[parse_tree(inline)] or something, which means "inline this into my parent's parse tree"? Is that what you were suggesting?

Syntax for arbitrary nesting of choices

That just sounds like a lowering step to me. I agree the main obstacle is choosing a syntax. Can you say a bit more about when you would want this?

One thing I've always wnated to add is a precedence lowering step, that lets you write out operator precedence and associativity more declaratively. Something like:

Expr: Expr = {
    Expr "*" Expr,
    Expr "/" Expr,
} else {
    Expr "+" Expr,
    Expr "-" Expr,
};

That would desugar to something like this (I'm probably getting the details wrong here):

Expr: Expr = Expr1;
Expr1 = { Expr1 "+" Expr, Expr1 "-" Expr, Expr2 };
Expr2 = { Expr2 "*" Expr, ... };

So, what are the next steps here? :-) I think first of all we need to decide if we at all want to go in this actionless untyped incomplete parse trees direction.

I want to, this sounds great. I think in my ideal world it would work like this:

  • Some form of annotation that indicates your grammar is going to be generating events.
    • This could be having no actions at all, anywhere.
  • LALRPOP then generates something like <E>(&mut E) where E: Events, as you specified, and actions to invoke it appropriately.
    • This trait Events lives in lalrpop-util, which maybe ought to be renamed to lalrpop-runtime.
  • The parse_tree crate is entirely independent, but lives in the lalrpop org.

In other words, I hope that we can make it so that the generated parser code only knows about events, but we offer a super nice parse tree library that one can readily use and is well-integrated. It would also handle spans, error reporting, etc.

And then I would actually love to do a crazy thing: to reimagine the concrete syntax of LALRPOP itself, implement it with parse trees and write a library for LALRPOP-IDE.

I don't quite understand what you mean here yet =) but I was also going to suggest that once we get things working, I'd love to port LALRPOP to use it. I am all about the bootstrapping.

@matklad

This comment has been minimized.

Copy link
Member Author

matklad commented Apr 6, 2018

As I mentioned at the All Hands, I feel pretty strongly that we should refactor the ParseTree library to not use &'p str slices but rather ranges of locations.

I am not sure I understand what this refers to, but I agree. Specifically, at rust hands we chatted about how to abstract over various kinds of text storage, and one solution was to parametrize parse tree over the underlying test, like ParseTree<String>. Now I feel that parse trees should be completely ignorant about the underling text, and store only offsets. This means that clients would have to do some API-wrapping to bind together parse tree nodes and the original text, but I now thing that such wrapping is inevitable. Here's the example of why is this so:

// A file for a `Foo` language, which wraps a parse tree and adds a cache which maps
// references to definitions. 
struct FooFile {
  parse_tree: parse_tree::ParseTree<String>,
  use_def: HashMap<NodeId, NodeId>,
}

fn resolve_reference(node: parse_tree::Node<String>) -> Option<parse_tree::Node<String>> {
  // because `ParseTree` has access to text, we can do
  // `node.file()` and `node.text()`.
  // However, `file` is only ParseTree, it won't give us access to `use_def`
  
  let tree: &ParseTree<String> = node.parse_tree();
  let use_id = node.id();
  // Does not work, because tree does not have access to use_def
  let def_id = tree.use_def[use_id]; 
  tree.get_node(def_id)
}

So, I think it makes sense to make ParseTree as simple as possible, and instead let users to create there own "node refenrece", which bind together a node id and a context, which include ParseTree, source text and probably some other data like symbol tables.

tokenizer errors

Tokenizer errors are easy: you just make tokenizer infailable. Given a non-empty input, tokenizer should always produce a first token of the input. The token's length must be positive, but it's symbol might be BAD_TOKEN.

give a more specific pointer to how it works in one of those projects?

I think this comment summarizes all I know: rust-lang/rfcs#2256 (comment). If you like, we can schedule a 15 minutes video call so that I can show how error recovery attributes affect parsing in fall (this might be fun, because fall is very interactive, and you can see in almost real time how changes to grammar affect parse trees).

Is that what you were suggesting?

Exactly! And you example with #[parse_tree(name = "Foo")] is also something I had in mind.

Can you say a bit more about when you would want this?

I find them just generally useful. For example, here's the fall's grammar for use declarations, which uses nested alternatives ({} are just parenthesis, | is used to specify alternatives):

pub rule use_decl {
  'use' <commit> {
    mod_path {alias | {'::' use_spec}?}
  | '::'? use_spec
  } ';'
}

I don't quite understand what you mean here yet =)

So, actually there were three thoughts in there:

  1. because grammars for parse trees don't use actions at all, we might want to change syntax to optimize for that use-case.
  2. I am generally not to happy about current concrete syntax of LALRPOP. It is nice, but I think we probably could do better.
  3. If we switch to parse trees for LALRPOP itself, we'll be able to provide IDE support.

So, parse trees for LALRPOP itself might be a good excuse to experiment with LALRPOP's concrete syntax (here's a straw man proposal: https://gist.github.com/matklad/c15ba52dff7450303962ba5c0dda00e9 ) :)

@matklad

This comment has been minimized.

Copy link
Member Author

matklad commented Apr 19, 2018

Current status: https://github.com/matklad/lalrpop/tree/acb0445ea4715d3653c5b1f11f18e7a58f9a5acc.

  • I've managed to bolt action less mode onto LALRPOP, as a separate lowering phase. This is probably the worst code I've ever written :)

  • At least one test works (grammar).

As a next step, I'd love to try to bootstrap LALRPOP itself from this style of parsing. I think that modifying existing front-end would be tricky, so my plan is to implement, with parse trees, an alternative surface syntax for LALRPOP, tailored for usage with parse trees, and lower that directly to repr::Grammar, bypassing pt::Grammar altogether.

@matklad

This comment has been minimized.

Copy link
Member Author

matklad commented Apr 20, 2018

Status update: apparently, I can't resist the desire to implement IDEs, so, instead of implementing the actual lowering, I've implemented syntax highlighting, extend selection and parse tree view :-)

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.