-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking state #158
Comments
It's possible to create new parsers while parsing a file using const util = require('util');
const P = require('parsimmon');
const Statement =
P.lazy(() =>
P.alt(
Number.skip(P.string('\n')),
Block
)
);
const Number =
P.regexp(/[0-9]+/)
.map(s => ['Number', +s])
.desc('a number');
const Block =
P.seq(P.string('block:\n'), P.whitespace, Statement)
.chain(([block, indent, statement]) => {
const Indent = P.string(indent);
return Indent
.then(Statement)
.many()
.map(statements => [statement, ...statements]);
})
.map(statements => ['Block', statements]);
const text = `\
block:
1
2
3
block:
4
5
block:
6
7
8
9
`;
const ast = Block.tryParse(text);
const s = util.inspect(ast, {depth: null});
console.log(s); The key is using And best of luck parsing an indentation sensitive language, they're certainly a little trickier than languages that aren't. |
If this doesn't answer your question please let me know. |
I may not be understanding it thoroughly enough. Here's an example: In the language this works because it's all on one line:
this is also valid because
However, this is NOT valid because
Maybe I'm not thinking about this right? |
Also, thank you for the example! |
Oh, so is this more of a Haskell-like indentation then? Because frankly I don't even understand how their rules work, and I frequently had indentation errors in my code when using it. My guess is that something like
Is that valid? |
Indeed that would be valid. I'm trying to parse the Elm programming language: http://elm-lang.org/ |
It may be a 2-pass kind of deal. Split the input on statements |
So maybe for your whitespace parser you need some kind of setup like "either it's just spaces, or it's any amount of whitespace (including newlines), followed by AT LEAST one space.
Apparently that snippet is valid, so I . think my whitespace idea might be sufficient to work with Elm? As for the 2-pass ordeal, is this so you can know whether identifiers represent types or imports while parsing? |
Ah, no i thought this would be a way to not have to specialize what is valid whitespace throughout the parsing of a statement. Essentially, I'm just trying to think about how to break the problem down into smaller parts. |
it looks like here the variables in a
This sounds like a tricky language to parse. You might need some kind of combination approach of the python-ish example I gave but also allowing inline forms in addition to block forms. |
You might have to give |
I wish you good luck in your parsing adventures for Elm. If you end up writing a parser for it, please send me an email or open an issue with your findings. I've never parsed a language such as it, and would like to learn a good strategy for it. In the mean time, I don't have any plans to add explicit state tracking to Parsimmon because it's not a very good fit for JS (given lack of native persistent data structures) and I'm skeptical it could be added in to Parsimmon without a breaking API change. |
Hmm, I just realized my Python-ish example doesn't actually work right. It just requires blocks be indented the same way, but indentation can actually DECREASE, which is obviously incorrect. I'm gonna reopen this one so I can remember to think about it. |
NOTE TO SELF: If each statement carried around its own indentation level as metadata, then I could make the block parser assert that the new indentation level is deeper than the old indentation level. |
Indentation is normally resolved at the lexing stage. But what is the difference between parsing a sequence of characters or a sequence of "Tokens"? Could Parsimmon be adapted to handle "generic" tokens? |
Right, and I have written such a lexer before. https://github.com/wavebeem/lattescript/blob/master/ls/lexer.js#L108 https://github.com/wavebeem/lattescript/blob/master/ls/lexer.js#L196-L214 But I'm still thinking about how to translate this into Parsimmon, given that Parsimmon can backtrack, unlike a usual lexer, so global state may become incorrect. |
Well. Given that state is immutable, you can store it on stack. In fact you already have that kind of state. It is current position in input string. So, pass some other immutable state together with it and that's it. |
Easier said than done, given that JS data structures are not built to be immutable, and this would probably need to be a breaking API change for the Parsimmon constructor. I'll have to keep thinking on this right now. |
This would work, but I explicitly tell people not to mutate global state during parse time… heh. let util = require('util');
let P = require('..');
///////////////////////////////////////////////////////////////////////
// If this were actually Python, "Block" wouldn't be a statement on its own,
// but rather "If" and "While" would be statements that used "Block" inside.
let Statement =
P.lazy(() => P.alt(ExpressionStatement, Block));
// Just a simple `foo()` style function call.
let FunctionCall =
P.regexp(/[a-z]+/).skip(P.string('()'))
.map(s => ['FunctionCall', s])
.desc('a function call');
// To make it a statement we just need a newline afterward.
let ExpressionStatement =
FunctionCall.skip(P.string('\n'));
let Indent =
P.regexp(/[ ]+/).map(s => s.length);
let indentStack = [0];
function peek(stack) {
return stack[stack.length - 1];
}
function indentSized(n) {
return P.string(' ').times(n).desc(n + ' spaces');
}
// The general idea of this is to assume there's "block:" on its own line,
// then we capture the whitespace used to indent the first statement of the
// block, and require that every other statement has the same exact string of
// indentation in front of it.
let Block =
P.seq(
P.string('block:\n').then(Indent),
Statement
).chain(args => {
// `.chain` is called after a parser succeeds. It returns the next parser
// to use for parsing. This allows subsequent parsing to be dependent on
// previous text.
let [indent, statement] = args;
let max = peek(indentStack);
if (indent < max) {
return P.fail('indentation deeper than ' + max);
} else {
indentStack.push(indent);
}
return indentSized(indent).then(Statement).many()
.map(statements => [statement].concat(statements));
})
.map(statements => ['Block', statements]);
///////////////////////////////////////////////////////////////////////
let text = `\
block:
a()
b()
c()
block:
d()
e()
f()
q()
block:
g()
h()
i()
j()
x()
y()
z()
`;
function prettyPrint(x) {
let opts = {depth: null, colors: 'auto'};
let s = util.inspect(x, opts);
console.log(s);
}
let ast = Block.tryParse(text);
prettyPrint(ast); |
I don't get it. You never pop indentStack. Why is it still correct? This input will not pass your parser. let text = `\
block:
a()
block:
b()
block:
c()
d()
block:
e()
f()
`; But I see the idea now. You can use .chain instead of .map. Then you can track whatever state you want. And it will rollback on alts etc. But it is kinda awkward. Even when you don't need that state you still have to use .chain for every subsequent parser to not lose it. |
It's probably not correct. I didn't test it very thoroughly last night, heh. |
Fixes #158 adds example of parsing Python
Check out the new python-ish example for my suggestion on how to parse an indentation-based language. I declined the PR I made because I think this solution is better. https://github.com/jneen/parsimmon/blob/master/examples/python-ish.js |
How might you track state while using this parsing library? Parsing a language that's indentation sensitive would require such state tracking.
The text was updated successfully, but these errors were encountered: