Proposal: inheritance for stateful lexers #93

nathan · 2018-08-19T19:01:20Z

PR so you can play around with it. I'll add tests if this seems like the right idea.

This adds a new include rule, which allows you to include all the rules from another state. include behaves as follows:

Each rule is included at most once.
Each rule is included at the earliest possible position in the rule list (i.e., the position where it is most able to match).
Including a state in itself does nothing.
Direct and indirect cyclical inclusions are allowed.

Here's an example of parsing JS-like templates:

const lexer = moo.states({
  $all: { err: moo.error },
  main: {
    include: 'std',
  },
  brace: {
    include: 'std',
    rbrace: { match: '}', pop: 1 },
  },
  template: {
    include: 'std',
    tmid: { match: /}(?:\\[^]|[^\\`])*?\${/, value: s => s.slice(1, -2) },
    tend: { match: /}(?:\\[^]|[^\\`])*?`/, value: s => s.slice(1, -1), pop: 1 },
  },
  std: {
    include: ['comment', 'ws'],
    id: /[A-Za-z]\w*/,
    op: /[!=]==|\+[+=]?|-[-=]|<<=?|>>>?=?|&&?|\|\|?|[<>!=/*&|^%]=|[~!,/*^?:%]/,
    tbeg: { match: /`(?:\\[^]|[^\\`])*?\${/, value: s => s.slice(1, -2), push: 'template' },
    tsim: { match: /`(?:\\[^]|[^\\`])*?`/, value: s => s.slice(1, -1) },
    str: { match: /'(?:\\[^]|[^\\'])*?'|"(?:\\[^]|[^\\"])*?"/, value: s => s.slice(1, -1) },
    lbrace: { match: '{', push: 'brace'},
  },
  ws: {
    ws: { match: /\s+/, lineBreaks: true },
  },
  comment: {
    lc: /\/\/.+/,
    bc: /\/\*[^]*?\*\//,
  },
})

lexer.reset('`just ` + /* comment */ // line\n`take ${one} and ${a}${two} and a` + {three: `${{four: five}}}`} / "six"')
for (const tok of lexer) {
  if (tok.type !== 'ws') console.log(tok.type, JSON.stringify(tok.value))
}

Here's how it behaves with cycles:

const lex2 = moo.states({
  $all: {
    ws: { match: /\s+/, lineBreaks: true },
  },
  a: {
    a: /a\w/,
    switch: { match: '|', next: 'b' },
    include: 'b',
  },
  b: {
    b: /\wb/,
    switch: { match: '|', next: 'a' },
    include: 'a',
  },
})
lex2.reset('ab ac bb ac cb | ab ac bb ac cb')
for (const tok of lexer) {
  if (tok.type !== 'ws') console.log(tok.type, JSON.stringify(tok.value))
}

tjvr · 2018-08-19T21:10:06Z

This is cool, thanks for investigating this!

My first thought is: wow, that's a fairly sophisticated example tokenizer you have there (the JS-template-like one). I'm not sure I could have written it myself; I wonder if we've made a tool that only you can use correctly ;-)

(As a minor point of style, I'd have kept the rbrace rule in std, and have my parser generate "unexpected }" errors.)

Each rule is included at the earliest possible position in the rule list (i.e., the position where it is most able to match).

Can you explain what you mean by this? Order is reasonably important in Moo lexers, so it would be good to understand this (and make sure we get the semantics right). Is it that in normal behaviour, the rules are inserted in place of the include; and the "earliest possible position" constraint only applies if a rule would be included more than once?

I suppose using a dictionary for this means we can't allow multiple include directives in different places in the rule list of a state. I'm not sure whether this would be common in practice.

I'm tempted to recommend writing this as a separate transform -- e.g. moo.withStates() -- which takes in a set of states with includes, and emits a set of states with all the includes expanded. But mostly I'd just find this easier to comprehend and test; there's no reason the public API has to be broken up this way.

nathan · 2018-08-19T23:17:21Z

Is it that in normal behaviour, the rules are inserted in place of the include; and the "earliest possible position" constraint only applies if a rule would be included more than once?

Correct. So if std includes comment, and I include comment before I include std, the rules from comment match at the level of the comment inclusion.

I suppose using a dictionary for this means we can't allow multiple include directives in different places in the rule list of a state. I'm not sure whether this would be common in practice.

That's a good point. I can think of two solutions:

If that case seems uncommon enough, just use the array rule notation we already support:

state: [
  {include: 'comment'},
  {name: 'id', match: /\w+/},
  {include: 'std'},
  // ...
],

Otherwise, switch to or allow a notation like this:

state: {
  include_comment: 1,
  id: /\w+/,
  include_std: 1,
  // ...
},

(FWIW, include rules already optionally accept an array of states to include, e.g., {include: ['ws', 'comment']}, so this is only necessary if you want to include in two different places in the rule ordering.)

The include resolver is pretty separable; it's just this chunk of compileStates. Feel free to relegate it to its own function for testing.

tjvr · 2018-08-31T21:23:03Z

I'd missed your suggestion of $all there.

I think this is cool, and includes are important! However, is there a good reason not to use JavaScript's object spread syntax?

const ws = {
  ws: { match: /\s+/, lineBreaks: true },
}

const comment = {
  lc: /\/\/.+/,
  bc: /\/\*[^]*?\*\//,
}

const std = {
    ...comment,
    ...ws,
    id: /[A-Za-z]\w*/,
    op: /[!=]==|\+[+=]?|-[-=]|<<=?|>>>?=?|&&?|\|\|?|[<>!=/*&|^%]=|[~!,/*^?:%]/,
    tbeg: { match: /`(?:\\[^]|[^\\`])*?\${/, value: s => s.slice(1, -2), push: 'template' },
    tsim: { match: /`(?:\\[^]|[^\\`])*?`/, value: s => s.slice(1, -1) },
    str: { match: /'(?:\\[^]|[^\\'])*?'|"(?:\\[^]|[^\\"])*?"/, value: s => s.slice(1, -1) },
    lbrace: { match: '{', push: 'brace'},
}

const main = {
  ...std,
}

const brace = {
  ...std,
  rbrace: { match: '}', pop: 1 },
}

const template = {
  ...std,
  tmid: { match: /}(?:\\[^]|[^\\`])*?\${/, value: s => s.slice(1, -2) },
  tend: { match: /}(?:\\[^]|[^\\`])*?`/, value: s => s.slice(1, -1), pop: 1 },
}

const lexer = moo.states({
  $all: { err: moo.error },
  main, brace, template,
})

nathan · 2018-09-01T13:31:51Z

is there a good reason not to use JavaScript's object spread syntax?

I'm not sure it's a good reason, but a reason is that this implements adding rather than replacing rules. Compare:

moo.states({
  base: {op: ['*', '/', '+', '-']},
  mod: {include: 'base', op: ['%']},
})

where mod matches *, /, +, -, and % as op, with:

const base = {op: ['*', '/', '+', '-']}
const mod = {...base, op: ['%']}
moo.states({base, mod})

where mod only matches % as op. I was originally doing this in my JS-ish lexer above to distinguish /-op from /-regex, but it became too complicated for an example, so I took it out.

Merge branch 'include'

tjvr · 2018-09-18T12:29:16Z

this implements adding rather than replacing rules

I'm sold.

I'm still not sure if this is the right interface, as discussed previously. But I'd like to merge this and play around with it for a bit.

nathan requested a review from tjvr August 19, 2018 19:01

nathan force-pushed the include branch from f6364ac to bb2e4c5 Compare August 19, 2018 19:02

nathan mentioned this pull request Aug 22, 2018

README has misleading documentation for states? #97

Closed

nathan added 5 commits September 16, 2018 17:13

Inheritance for stateful lexers

639e5de

Fix indirect cycles

fbf3769

Remove unnecessary code

143a5fa

Fix other indirect cycles

9e1b458

Clean up

88cbef4

tjvr added a commit that referenced this pull request Sep 18, 2018

Merge #93 inheritance for stateful lexers

82acf8d

Merge branch 'include'

tjvr force-pushed the master branch from 82acf8d to 8c3e622 Compare September 18, 2018 12:19

tjvr force-pushed the include branch from ef10fd9 to 88cbef4 Compare September 18, 2018 12:22

tjvr merged commit ce77f13 into master Sep 18, 2018

tjvr deleted the include branch September 18, 2018 12:22

tjvr mentioned this pull request Sep 19, 2018

Handle parsing errors in moo.states() #91

Open

tjvr mentioned this pull request Nov 9, 2018

Document inheritance for stateful lexers #111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: inheritance for stateful lexers #93

Proposal: inheritance for stateful lexers #93

nathan commented Aug 19, 2018

tjvr commented Aug 19, 2018

nathan commented Aug 19, 2018 •

edited

tjvr commented Aug 31, 2018

nathan commented Sep 1, 2018

tjvr commented Sep 18, 2018 •

edited

Proposal: inheritance for stateful lexers #93

Proposal: inheritance for stateful lexers #93

Conversation

nathan commented Aug 19, 2018

tjvr commented Aug 19, 2018

nathan commented Aug 19, 2018 • edited

tjvr commented Aug 31, 2018

nathan commented Sep 1, 2018

tjvr commented Sep 18, 2018 • edited

nathan commented Aug 19, 2018 •

edited

tjvr commented Sep 18, 2018 •

edited