Tracking state #158

dynajoe · 2017-05-21T20:05:06Z

How might you track state while using this parsing library? Parsing a language that's indentation sensitive would require such state tracking.

wavebeem · 2017-05-21T22:18:49Z

It's possible to create new parsers while parsing a file using .chain, which can be used to solve this problem.

const util = require('util');
const P = require('parsimmon');

const Statement =
  P.lazy(() =>
    P.alt(
      Number.skip(P.string('\n')),
      Block
    )
  );

const Number =
  P.regexp(/[0-9]+/)
    .map(s => ['Number', +s])
    .desc('a number');

const Block =
  P.seq(P.string('block:\n'), P.whitespace, Statement)
    .chain(([block, indent, statement]) => {
      const Indent = P.string(indent);
      return Indent
        .then(Statement)
        .many()
        .map(statements => [statement, ...statements]);
    })
    .map(statements => ['Block', statements]);

const text = `\
block:
  1
  2
  3
  block:
    4
    5
  block:
    6
    7
    8
    9
`;
const ast = Block.tryParse(text);
const s = util.inspect(ast, {depth: null});
console.log(s);

The key is using .chain inside the Block parser to grab the indentation level of the first statement in the block and then require all other statements in that same block to start with that indentation. Please let me know if you have any questions about that. I think I'll add this to the examples folder.

And best of luck parsing an indentation sensitive language, they're certainly a little trickier than languages that aren't.

Fixes #158; Adds Python example

wavebeem · 2017-05-22T14:49:40Z

If this doesn't answer your question please let me know.

dynajoe · 2017-05-22T14:56:10Z

I may not be understanding it thoroughly enough. Here's an example:

In the language this works because it's all on one line:

type alias Foo = Maybe Int

this is also valid because Int is indented:

type alias Foo = Maybe 
    Int

However, this is NOT valid because Int must not occur at the beginning of a line. However, while parsing Maybe ... you want to allow newlines and other whitespace. Consuming newlines as part of parsing Maybe\t\s\n etc etc may need to change depending on context.

type alias Foo = Maybe 
Int

Maybe I'm not thinking about this right?

dynajoe · 2017-05-22T14:57:14Z

Also, thank you for the example!

wavebeem · 2017-05-22T15:04:26Z

Oh, so is this more of a Haskell-like indentation then? Because frankly I don't even understand how their rules work, and I frequently had indentation errors in my code when using it. My guess is that something like type alias starts off effectively a statement here, but after that… there needs to be some amount of indentation?

type alias Foo = Either
  Int
       String

Is that valid?

dynajoe · 2017-05-22T15:06:23Z

Indeed that would be valid. I'm trying to parse the Elm programming language: http://elm-lang.org/

dynajoe · 2017-05-22T15:08:16Z

It may be a 2-pass kind of deal. Split the input on statements \n(type|import) etc etc and then run each of those through statement parsers and aggregate the results.

wavebeem · 2017-05-22T15:14:10Z

So maybe for your whitespace parser you need some kind of setup like "either it's just spaces, or it's any amount of whitespace (including newlines), followed by AT LEAST one space.

type WhyNotBoth a b =
    Left a
             | Right b
 | Both a b

Apparently that snippet is valid, so I . think my whitespace idea might be sufficient to work with Elm?

As for the 2-pass ordeal, is this so you can know whether identifiers represent types or imports while parsing?

dynajoe · 2017-05-22T15:19:08Z

Ah, no i thought this would be a way to not have to specialize what is valid whitespace throughout the parsing of a statement. Essentially, I'm just trying to think about how to break the problem down into smaller parts.

wavebeem · 2017-05-22T15:26:50Z

x =
 let
  y = 1
 in
 y + y

it looks like here the variables in a let expression must be indented more than the let itself… but it's also legal just to write it all on one line.

x =
 let y = 1 in y + y

This sounds like a tricky language to parse. You might need some kind of combination approach of the python-ish example I gave but also allowing inline forms in addition to block forms.

wavebeem · 2017-05-22T15:32:06Z

You might have to give let two cases where it either has exactly one binding inside with no newlines or it has a "block" inside…

wavebeem · 2017-06-02T03:59:10Z

I wish you good luck in your parsing adventures for Elm. If you end up writing a parser for it, please send me an email or open an issue with your findings. I've never parsed a language such as it, and would like to learn a good strategy for it.

In the mean time, I don't have any plans to add explicit state tracking to Parsimmon because it's not a very good fit for JS (given lack of native persistent data structures) and I'm skeptical it could be added in to Parsimmon without a breaking API change.

wavebeem · 2017-06-04T04:47:51Z

Hmm, I just realized my Python-ish example doesn't actually work right. It just requires blocks be indented the same way, but indentation can actually DECREASE, which is obviously incorrect. I'm gonna reopen this one so I can remember to think about it.

wavebeem · 2017-06-08T05:00:11Z

NOTE TO SELF:

If each statement carried around its own indentation level as metadata, then I could make the block parser assert that the new indentation level is deeper than the old indentation level.

bd82 · 2017-06-08T06:57:50Z

Indentation is normally resolved at the lexing stage.
However, Parsimmon is a lexerless parsing library.

But what is the difference between parsing a sequence of characters or a sequence of "Tokens"?
Both are a sequence of "things" which we can match against.

Could Parsimmon be adapted to handle "generic" tokens?
That will enable resolving issues such as this by plugging in a separate tokenizer.

wavebeem · 2017-06-08T14:29:45Z

Right, and I have written such a lexer before.

https://github.com/wavebeem/lattescript/blob/master/ls/lexer.js#L108

https://github.com/wavebeem/lattescript/blob/master/ls/lexer.js#L196-L214

But I'm still thinking about how to translate this into Parsimmon, given that Parsimmon can backtrack, unlike a usual lexer, so global state may become incorrect.

mixtur · 2017-06-08T16:36:39Z

Well. Given that state is immutable, you can store it on stack. In fact you already have that kind of state. It is current position in input string. So, pass some other immutable state together with it and that's it.

wavebeem · 2017-06-09T02:27:16Z

Easier said than done, given that JS data structures are not built to be immutable, and this would probably need to be a breaking API change for the Parsimmon constructor. I'll have to keep thinking on this right now.

wavebeem · 2017-06-09T04:13:27Z

This would work, but I explicitly tell people not to mutate global state during parse time… heh.

let util = require('util');
let P = require('..');

///////////////////////////////////////////////////////////////////////

// If this were actually Python, "Block" wouldn't be a statement on its own,
// but rather "If" and "While" would be statements that used "Block" inside.
let Statement =
  P.lazy(() => P.alt(ExpressionStatement, Block));

// Just a simple `foo()` style function call.
let FunctionCall =
  P.regexp(/[a-z]+/).skip(P.string('()'))
    .map(s => ['FunctionCall', s])
    .desc('a function call');

// To make it a statement we just need a newline afterward.
let ExpressionStatement =
  FunctionCall.skip(P.string('\n'));

let Indent =
  P.regexp(/[ ]+/).map(s => s.length);

let indentStack = [0];

function peek(stack) {
  return stack[stack.length - 1];
}

function indentSized(n) {
  return P.string(' ').times(n).desc(n + ' spaces');
}

// The general idea of this is to assume there's "block:" on its own line,
// then we capture the whitespace used to indent the first statement of the
// block, and require that every other statement has the same exact string of
// indentation in front of it.
let Block =
  P.seq(
    P.string('block:\n').then(Indent),
    Statement
  ).chain(args => {
    // `.chain` is called after a parser succeeds. It returns the next parser
    // to use for parsing. This allows subsequent parsing to be dependent on
    // previous text.
    let [indent, statement] = args;
    let max = peek(indentStack);
    if (indent < max) {
      return P.fail('indentation deeper than ' + max);
    } else {
      indentStack.push(indent);
    }
    return indentSized(indent).then(Statement).many()
      .map(statements => [statement].concat(statements));
  })
  .map(statements => ['Block', statements]);

///////////////////////////////////////////////////////////////////////

let text = `\
block:
    a()
    b()
    c()
    block:
        d()
        e()
        f()
    q()
    block:
        g()
        h()
        i()
        j()
    x()
    y()
    z()
`;

function prettyPrint(x) {
  let opts = {depth: null, colors: 'auto'};
  let s = util.inspect(x, opts);
  console.log(s);
}

let ast = Block.tryParse(text);
prettyPrint(ast);

caballus · 2017-06-09T12:50:58Z

I don't get it. You never pop indentStack. Why is it still correct?

This input will not pass your parser.

let text = `\
block:
    a()
    block:
        b()
        block:
            c()
        d()
    block:
        e()
    f()
`;

But I see the idea now. You can use .chain instead of .map. Then you can track whatever state you want. And it will rollback on alts etc. But it is kinda awkward. Even when you don't need that state you still have to use .chain for every subsequent parser to not lose it.

wavebeem · 2017-06-09T13:39:46Z

It's probably not correct. I didn't test it very thoroughly last night, heh.

wavebeem · 2017-07-04T21:18:29Z

If anyone in this thread is interested, I'm taking a stab at a solution for this issue right now over in this PR: #190

I would appreciate any and all feedback on the PR.

cc @anko since you 👍 'd my post but aren't a watcher.

Fixes #158 adds example of parsing Python

wavebeem · 2017-07-09T22:14:42Z

Check out the new python-ish example for my suggestion on how to parse an indentation-based language. I declined the PR I made because I think this solution is better.

https://github.com/jneen/parsimmon/blob/master/examples/python-ish.js

wavebeem added the question label May 21, 2017

wavebeem closed this as completed in 8b6767b May 22, 2017

wavebeem pushed a commit that referenced this issue May 22, 2017

Merge pull request #159 from /issues/158

ccf3f64

Fixes #158; Adds Python example

wavebeem reopened this May 22, 2017

wavebeem closed this as completed Jun 2, 2017

wavebeem reopened this Jun 4, 2017

wavebeem closed this as completed in f3ab4ec Jul 9, 2017

wavebeem pushed a commit that referenced this issue Jul 9, 2017

Merge pull request #194 from /issues/158

ff13033

Fixes #158 adds example of parsing Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking state #158

Tracking state #158

dynajoe commented May 21, 2017

wavebeem commented May 21, 2017

wavebeem commented May 22, 2017

dynajoe commented May 22, 2017

dynajoe commented May 22, 2017

wavebeem commented May 22, 2017

dynajoe commented May 22, 2017

dynajoe commented May 22, 2017

wavebeem commented May 22, 2017

dynajoe commented May 22, 2017

wavebeem commented May 22, 2017

wavebeem commented May 22, 2017

wavebeem commented Jun 2, 2017

wavebeem commented Jun 4, 2017

wavebeem commented Jun 8, 2017

bd82 commented Jun 8, 2017

wavebeem commented Jun 8, 2017

mixtur commented Jun 8, 2017

wavebeem commented Jun 9, 2017

wavebeem commented Jun 9, 2017

caballus commented Jun 9, 2017 •

edited

Loading

wavebeem commented Jun 9, 2017

wavebeem commented Jul 4, 2017

wavebeem commented Jul 9, 2017

Tracking state #158

Tracking state #158

Comments

dynajoe commented May 21, 2017

wavebeem commented May 21, 2017

wavebeem commented May 22, 2017

dynajoe commented May 22, 2017

dynajoe commented May 22, 2017

wavebeem commented May 22, 2017

dynajoe commented May 22, 2017

dynajoe commented May 22, 2017

wavebeem commented May 22, 2017

dynajoe commented May 22, 2017

wavebeem commented May 22, 2017

wavebeem commented May 22, 2017

wavebeem commented Jun 2, 2017

wavebeem commented Jun 4, 2017

wavebeem commented Jun 8, 2017

bd82 commented Jun 8, 2017

wavebeem commented Jun 8, 2017

mixtur commented Jun 8, 2017

wavebeem commented Jun 9, 2017

wavebeem commented Jun 9, 2017

caballus commented Jun 9, 2017 • edited Loading

wavebeem commented Jun 9, 2017

wavebeem commented Jul 4, 2017

wavebeem commented Jul 9, 2017

caballus commented Jun 9, 2017 •

edited

Loading