Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: inheritance for stateful lexers #93

Merged
merged 5 commits into from
Sep 18, 2018
Merged

Proposal: inheritance for stateful lexers #93

merged 5 commits into from
Sep 18, 2018

Conversation

nathan
Copy link
Collaborator

@nathan nathan commented Aug 19, 2018

PR so you can play around with it. I'll add tests if this seems like the right idea.

This adds a new include rule, which allows you to include all the rules from another state. include behaves as follows:

  • Each rule is included at most once.
  • Each rule is included at the earliest possible position in the rule list (i.e., the position where it is most able to match).
  • Including a state in itself does nothing.
  • Direct and indirect cyclical inclusions are allowed.

Here's an example of parsing JS-like templates:

const lexer = moo.states({
  $all: { err: moo.error },
  main: {
    include: 'std',
  },
  brace: {
    include: 'std',
    rbrace: { match: '}', pop: 1 },
  },
  template: {
    include: 'std',
    tmid: { match: /}(?:\\[^]|[^\\`])*?\${/, value: s => s.slice(1, -2) },
    tend: { match: /}(?:\\[^]|[^\\`])*?`/, value: s => s.slice(1, -1), pop: 1 },
  },
  std: {
    include: ['comment', 'ws'],
    id: /[A-Za-z]\w*/,
    op: /[!=]==|\+[+=]?|-[-=]|<<=?|>>>?=?|&&?|\|\|?|[<>!=/*&|^%]=|[~!,/*^?:%]/,
    tbeg: { match: /`(?:\\[^]|[^\\`])*?\${/, value: s => s.slice(1, -2), push: 'template' },
    tsim: { match: /`(?:\\[^]|[^\\`])*?`/, value: s => s.slice(1, -1) },
    str: { match: /'(?:\\[^]|[^\\'])*?'|"(?:\\[^]|[^\\"])*?"/, value: s => s.slice(1, -1) },
    lbrace: { match: '{', push: 'brace'},
  },
  ws: {
    ws: { match: /\s+/, lineBreaks: true },
  },
  comment: {
    lc: /\/\/.+/,
    bc: /\/\*[^]*?\*\//,
  },
})

lexer.reset('`just ` + /* comment */ // line\n`take ${one} and ${a}${two} and a` + {three: `${{four: five}}}`} / "six"')
for (const tok of lexer) {
  if (tok.type !== 'ws') console.log(tok.type, JSON.stringify(tok.value))
}

Here's how it behaves with cycles:

const lex2 = moo.states({
  $all: {
    ws: { match: /\s+/, lineBreaks: true },
  },
  a: {
    a: /a\w/,
    switch: { match: '|', next: 'b' },
    include: 'b',
  },
  b: {
    b: /\wb/,
    switch: { match: '|', next: 'a' },
    include: 'a',
  },
})
lex2.reset('ab ac bb ac cb | ab ac bb ac cb')
for (const tok of lexer) {
  if (tok.type !== 'ws') console.log(tok.type, JSON.stringify(tok.value))
}

@tjvr
Copy link
Collaborator

tjvr commented Aug 19, 2018

This is cool, thanks for investigating this!

My first thought is: wow, that's a fairly sophisticated example tokenizer you have there (the JS-template-like one). I'm not sure I could have written it myself; I wonder if we've made a tool that only you can use correctly ;-)

(As a minor point of style, I'd have kept the rbrace rule in std, and have my parser generate "unexpected }" errors.)

Each rule is included at the earliest possible position in the rule list (i.e., the position where it is most able to match).

Can you explain what you mean by this? Order is reasonably important in Moo lexers, so it would be good to understand this (and make sure we get the semantics right). Is it that in normal behaviour, the rules are inserted in place of the include; and the "earliest possible position" constraint only applies if a rule would be included more than once?

I suppose using a dictionary for this means we can't allow multiple include directives in different places in the rule list of a state. I'm not sure whether this would be common in practice.

I'm tempted to recommend writing this as a separate transform -- e.g. moo.withStates() -- which takes in a set of states with includes, and emits a set of states with all the includes expanded. But mostly I'd just find this easier to comprehend and test; there's no reason the public API has to be broken up this way.

@nathan
Copy link
Collaborator Author

nathan commented Aug 19, 2018

Is it that in normal behaviour, the rules are inserted in place of the include; and the "earliest possible position" constraint only applies if a rule would be included more than once?

Correct. So if std includes comment, and I include comment before I include std, the rules from comment match at the level of the comment inclusion.

I suppose using a dictionary for this means we can't allow multiple include directives in different places in the rule list of a state. I'm not sure whether this would be common in practice.

That's a good point. I can think of two solutions:

  1. If that case seems uncommon enough, just use the array rule notation we already support:
state: [
  {include: 'comment'},
  {name: 'id', match: /\w+/},
  {include: 'std'},
  // ...
],
  1. Otherwise, switch to or allow a notation like this:
state: {
  include_comment: 1,
  id: /\w+/,
  include_std: 1,
  // ...
},

(FWIW, include rules already optionally accept an array of states to include, e.g., {include: ['ws', 'comment']}, so this is only necessary if you want to include in two different places in the rule ordering.)

The include resolver is pretty separable; it's just this chunk of compileStates. Feel free to relegate it to its own function for testing.

@tjvr
Copy link
Collaborator

tjvr commented Aug 31, 2018

I'd missed your suggestion of $all there.

I think this is cool, and includes are important! However, is there a good reason not to use JavaScript's object spread syntax?

const ws = {
  ws: { match: /\s+/, lineBreaks: true },
}

const comment = {
  lc: /\/\/.+/,
  bc: /\/\*[^]*?\*\//,
}

const std = {
    ...comment,
    ...ws,
    id: /[A-Za-z]\w*/,
    op: /[!=]==|\+[+=]?|-[-=]|<<=?|>>>?=?|&&?|\|\|?|[<>!=/*&|^%]=|[~!,/*^?:%]/,
    tbeg: { match: /`(?:\\[^]|[^\\`])*?\${/, value: s => s.slice(1, -2), push: 'template' },
    tsim: { match: /`(?:\\[^]|[^\\`])*?`/, value: s => s.slice(1, -1) },
    str: { match: /'(?:\\[^]|[^\\'])*?'|"(?:\\[^]|[^\\"])*?"/, value: s => s.slice(1, -1) },
    lbrace: { match: '{', push: 'brace'},
}

const main = {
  ...std,
}

const brace = {
  ...std,
  rbrace: { match: '}', pop: 1 },
}

const template = {
  ...std,
  tmid: { match: /}(?:\\[^]|[^\\`])*?\${/, value: s => s.slice(1, -2) },
  tend: { match: /}(?:\\[^]|[^\\`])*?`/, value: s => s.slice(1, -1), pop: 1 },
}

const lexer = moo.states({
  $all: { err: moo.error },
  main, brace, template,
})

@nathan
Copy link
Collaborator Author

nathan commented Sep 1, 2018

is there a good reason not to use JavaScript's object spread syntax?

I'm not sure it's a good reason, but a reason is that this implements adding rather than replacing rules. Compare:

moo.states({
  base: {op: ['*', '/', '+', '-']},
  mod: {include: 'base', op: ['%']},
})

where mod matches *, /, +, -, and % as op, with:

const base = {op: ['*', '/', '+', '-']}
const mod = {...base, op: ['%']}
moo.states({base, mod})

where mod only matches % as op. I was originally doing this in my JS-ish lexer above to distinguish /-op from /-regex, but it became too complicated for an example, so I took it out.

tjvr added a commit that referenced this pull request Sep 18, 2018
@tjvr tjvr merged commit ce77f13 into master Sep 18, 2018
@tjvr tjvr deleted the include branch September 18, 2018 12:22
@tjvr
Copy link
Collaborator

tjvr commented Sep 18, 2018

this implements adding rather than replacing rules

I'm sold.

I'm still not sure if this is the right interface, as discussed previously. But I'd like to merge this and play around with it for a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants