Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking state #158

Closed
dynajoe opened this issue May 21, 2017 · 23 comments
Closed

Tracking state #158

dynajoe opened this issue May 21, 2017 · 23 comments
Labels

Comments

@dynajoe
Copy link

dynajoe commented May 21, 2017

How might you track state while using this parsing library? Parsing a language that's indentation sensitive would require such state tracking.

@wavebeem
Copy link
Collaborator

It's possible to create new parsers while parsing a file using .chain, which can be used to solve this problem.

const util = require('util');
const P = require('parsimmon');

const Statement =
  P.lazy(() =>
    P.alt(
      Number.skip(P.string('\n')),
      Block
    )
  );

const Number =
  P.regexp(/[0-9]+/)
    .map(s => ['Number', +s])
    .desc('a number');

const Block =
  P.seq(P.string('block:\n'), P.whitespace, Statement)
    .chain(([block, indent, statement]) => {
      const Indent = P.string(indent);
      return Indent
        .then(Statement)
        .many()
        .map(statements => [statement, ...statements]);
    })
    .map(statements => ['Block', statements]);

const text = `\
block:
  1
  2
  3
  block:
    4
    5
  block:
    6
    7
    8
    9
`;
const ast = Block.tryParse(text);
const s = util.inspect(ast, {depth: null});
console.log(s);

The key is using .chain inside the Block parser to grab the indentation level of the first statement in the block and then require all other statements in that same block to start with that indentation. Please let me know if you have any questions about that. I think I'll add this to the examples folder.

And best of luck parsing an indentation sensitive language, they're certainly a little trickier than languages that aren't.

wavebeem pushed a commit that referenced this issue May 22, 2017
Fixes #158; Adds Python example
@wavebeem
Copy link
Collaborator

If this doesn't answer your question please let me know.

@dynajoe
Copy link
Author

dynajoe commented May 22, 2017

I may not be understanding it thoroughly enough. Here's an example:

In the language this works because it's all on one line:

type alias Foo = Maybe Int

this is also valid because Int is indented:

type alias Foo = Maybe 
    Int

However, this is NOT valid because Int must not occur at the beginning of a line. However, while parsing Maybe ... you want to allow newlines and other whitespace. Consuming newlines as part of parsing Maybe\t\s\n etc etc may need to change depending on context.

type alias Foo = Maybe 
Int

Maybe I'm not thinking about this right?

@dynajoe
Copy link
Author

dynajoe commented May 22, 2017

Also, thank you for the example!

@wavebeem
Copy link
Collaborator

Oh, so is this more of a Haskell-like indentation then? Because frankly I don't even understand how their rules work, and I frequently had indentation errors in my code when using it. My guess is that something like type alias starts off effectively a statement here, but after that… there needs to be some amount of indentation?

type alias Foo = Either
  Int
       String

Is that valid?

@wavebeem wavebeem reopened this May 22, 2017
@dynajoe
Copy link
Author

dynajoe commented May 22, 2017

Indeed that would be valid. I'm trying to parse the Elm programming language: http://elm-lang.org/

@dynajoe
Copy link
Author

dynajoe commented May 22, 2017

It may be a 2-pass kind of deal. Split the input on statements \n(type|import) etc etc and then run each of those through statement parsers and aggregate the results.

@wavebeem
Copy link
Collaborator

So maybe for your whitespace parser you need some kind of setup like "either it's just spaces, or it's any amount of whitespace (including newlines), followed by AT LEAST one space.

type WhyNotBoth a b =
    Left a
             | Right b
 | Both a b

Apparently that snippet is valid, so I . think my whitespace idea might be sufficient to work with Elm?

As for the 2-pass ordeal, is this so you can know whether identifiers represent types or imports while parsing?

@dynajoe
Copy link
Author

dynajoe commented May 22, 2017

Ah, no i thought this would be a way to not have to specialize what is valid whitespace throughout the parsing of a statement. Essentially, I'm just trying to think about how to break the problem down into smaller parts.

@wavebeem
Copy link
Collaborator

x =
 let
  y = 1
 in
 y + y

it looks like here the variables in a let expression must be indented more than the let itself… but it's also legal just to write it all on one line.

x =
 let y = 1 in y + y

This sounds like a tricky language to parse. You might need some kind of combination approach of the python-ish example I gave but also allowing inline forms in addition to block forms.

@wavebeem
Copy link
Collaborator

You might have to give let two cases where it either has exactly one binding inside with no newlines or it has a "block" inside…

@wavebeem
Copy link
Collaborator

wavebeem commented Jun 2, 2017

I wish you good luck in your parsing adventures for Elm. If you end up writing a parser for it, please send me an email or open an issue with your findings. I've never parsed a language such as it, and would like to learn a good strategy for it.

In the mean time, I don't have any plans to add explicit state tracking to Parsimmon because it's not a very good fit for JS (given lack of native persistent data structures) and I'm skeptical it could be added in to Parsimmon without a breaking API change.

@wavebeem wavebeem closed this as completed Jun 2, 2017
@wavebeem
Copy link
Collaborator

wavebeem commented Jun 4, 2017

Hmm, I just realized my Python-ish example doesn't actually work right. It just requires blocks be indented the same way, but indentation can actually DECREASE, which is obviously incorrect. I'm gonna reopen this one so I can remember to think about it.

@wavebeem wavebeem reopened this Jun 4, 2017
@wavebeem
Copy link
Collaborator

wavebeem commented Jun 8, 2017

NOTE TO SELF:

If each statement carried around its own indentation level as metadata, then I could make the block parser assert that the new indentation level is deeper than the old indentation level.

@bd82
Copy link
Contributor

bd82 commented Jun 8, 2017

Indentation is normally resolved at the lexing stage.
However, Parsimmon is a lexerless parsing library.

But what is the difference between parsing a sequence of characters or a sequence of "Tokens"?
Both are a sequence of "things" which we can match against.

Could Parsimmon be adapted to handle "generic" tokens?
That will enable resolving issues such as this by plugging in a separate tokenizer.

@wavebeem
Copy link
Collaborator

wavebeem commented Jun 8, 2017

Right, and I have written such a lexer before.

https://github.com/wavebeem/lattescript/blob/master/ls/lexer.js#L108

https://github.com/wavebeem/lattescript/blob/master/ls/lexer.js#L196-L214

But I'm still thinking about how to translate this into Parsimmon, given that Parsimmon can backtrack, unlike a usual lexer, so global state may become incorrect.

@mixtur
Copy link

mixtur commented Jun 8, 2017

Well. Given that state is immutable, you can store it on stack. In fact you already have that kind of state. It is current position in input string. So, pass some other immutable state together with it and that's it.

@wavebeem
Copy link
Collaborator

wavebeem commented Jun 9, 2017

Easier said than done, given that JS data structures are not built to be immutable, and this would probably need to be a breaking API change for the Parsimmon constructor. I'll have to keep thinking on this right now.

@wavebeem
Copy link
Collaborator

wavebeem commented Jun 9, 2017

This would work, but I explicitly tell people not to mutate global state during parse time… heh.

let util = require('util');
let P = require('..');

///////////////////////////////////////////////////////////////////////

// If this were actually Python, "Block" wouldn't be a statement on its own,
// but rather "If" and "While" would be statements that used "Block" inside.
let Statement =
  P.lazy(() => P.alt(ExpressionStatement, Block));

// Just a simple `foo()` style function call.
let FunctionCall =
  P.regexp(/[a-z]+/).skip(P.string('()'))
    .map(s => ['FunctionCall', s])
    .desc('a function call');

// To make it a statement we just need a newline afterward.
let ExpressionStatement =
  FunctionCall.skip(P.string('\n'));

let Indent =
  P.regexp(/[ ]+/).map(s => s.length);

let indentStack = [0];

function peek(stack) {
  return stack[stack.length - 1];
}

function indentSized(n) {
  return P.string(' ').times(n).desc(n + ' spaces');
}

// The general idea of this is to assume there's "block:" on its own line,
// then we capture the whitespace used to indent the first statement of the
// block, and require that every other statement has the same exact string of
// indentation in front of it.
let Block =
  P.seq(
    P.string('block:\n').then(Indent),
    Statement
  ).chain(args => {
    // `.chain` is called after a parser succeeds. It returns the next parser
    // to use for parsing. This allows subsequent parsing to be dependent on
    // previous text.
    let [indent, statement] = args;
    let max = peek(indentStack);
    if (indent < max) {
      return P.fail('indentation deeper than ' + max);
    } else {
      indentStack.push(indent);
    }
    return indentSized(indent).then(Statement).many()
      .map(statements => [statement].concat(statements));
  })
  .map(statements => ['Block', statements]);

///////////////////////////////////////////////////////////////////////

let text = `\
block:
    a()
    b()
    c()
    block:
        d()
        e()
        f()
    q()
    block:
        g()
        h()
        i()
        j()
    x()
    y()
    z()
`;

function prettyPrint(x) {
  let opts = {depth: null, colors: 'auto'};
  let s = util.inspect(x, opts);
  console.log(s);
}

let ast = Block.tryParse(text);
prettyPrint(ast);

@caballus
Copy link

caballus commented Jun 9, 2017

I don't get it. You never pop indentStack. Why is it still correct?

This input will not pass your parser.

let text = `\
block:
    a()
    block:
        b()
        block:
            c()
        d()
    block:
        e()
    f()
`;

But I see the idea now. You can use .chain instead of .map. Then you can track whatever state you want. And it will rollback on alts etc. But it is kinda awkward. Even when you don't need that state you still have to use .chain for every subsequent parser to not lose it.

@wavebeem
Copy link
Collaborator

wavebeem commented Jun 9, 2017

It's probably not correct. I didn't test it very thoroughly last night, heh.

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 4, 2017

If anyone in this thread is interested, I'm taking a stab at a solution for this issue right now over in this PR: #190

I would appreciate any and all feedback on the PR.

cc @anko since you 👍 'd my post but aren't a watcher.

wavebeem pushed a commit that referenced this issue Jul 9, 2017
Fixes #158 adds example of parsing Python
@wavebeem
Copy link
Collaborator

wavebeem commented Jul 9, 2017

Check out the new python-ish example for my suggestion on how to parse an indentation-based language. I declined the PR I made because I think this solution is better.

https://github.com/jneen/parsimmon/blob/master/examples/python-ish.js

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants