Implement parametrizable rules #45

Open
ceymard opened this Issue Aug 25, 2011 · 16 comments

Projects

None yet

10 participants

@ceymard
ceymard commented Aug 25, 2011

It would be great to be able to parametrize rules with variables ;

string
   = '\"' parse_contents '\"' ->
   / '\'' parse_contents('\'') '\'' ->
   / '+' parse_contents('+') '+' -> /* sure why not :) */

parse_contents(terminator='\"')
    = ('\\' terminator / !terminator .)+ -> return stuff
@dmajda
Member
dmajda commented Jan 10, 2012

Do you have specific use case where this would save you significant amount of work or make something currently impossible possible?

@ceymard
ceymard commented Jan 10, 2012

It makes parsing indentation levels much easier by calling rules that have the level passed as an argument.

Also, in a pure DRY logic, when doing things like "stuff delimited by this character with escape sequence such", it is nicer to call something like delimited('\'', '\\') than just do the rule (and its actions !) three times.

@dmajda
Member
dmajda commented Jan 11, 2012

I should have been more clear. By "specific" I was looking for something like "I was working on a grammar of language X and there are 5 rules in there that could have been combined in one, here they are:" That is, I wanted to see real-world use case and real-world code. From that I can judge better in what cases this feature would be useful and for how many people.

Please don't take this as I am opposed to this feature per se. I just generally don't want to implement features useful only for a tiny fraction of languages or developers because of complexity and implementation cost. And in this case the cost is relatively high.

@ceymard
ceymard commented Jan 11, 2012

Just writing a parser for javascript, I could have string = delimited_by('\'') / delimited_by('\"') and then later regexp = delimited_by('/').

Lately, I've been writing a parser for a language of my own. I have something like this in a PEG framework I wrote for python :

LeftAssociative(op, subrule): l:subrule rest:(op subrule)* -> a_recursive_function_that_reverses_rest_and_makes_bintrees(l, op, subrule)

And I can then write:

...
exp_mult = LeftAssociative(/[\*\/%]/, exp_paren)
exp_add = LeftAssociative(/[\+-]/, exp_mult)

Since I have about as many levels of priority as in C++ (all its operators plus a few more), I'll let you imagine how useful in can be. I'm not done yet in the parsing expressions, yet I already use it 12 times.

@mcasimir

This would be great if combined with an 'import' feature

require(CommaSeparatedList.pegjs)
require(StringLiteral.pegjs)
require(IntegerLiteral.pegjs)

...

Function
 = name:FuncName "(" args:CommaSeparatedList(Literal)  ")" 

Hash
= "{"   props:CommaSeparatedList(Prop)   "}"

Prop
= Key ":" Literal

Literal =
  StringLiteral / IntegerLiteral
@krockot
krockot commented Jun 12, 2012

(This is a bit more complicated than OP's request, but it seemed too close to justify its own thread.)

I'm building an R5RS Scheme parser with the help of PEG.js. Everything's rosy except for quasiquotations, which require context-aware parsing. It would be useful to be able to parameterize rules for the sake of on-the-fly rule generation from templates, avoiding a large amount of awkward post-processing. For example, a simplified quasiquotation grammar might look like:

    quasiquotation = qq[1]
    qq[n] = "`" qq_template[n]
    qq_template[0] = expression
    qq_template[n] = simple_datum / list_qq_template[n] / unquotation[n]
    list_qq_template[n] = "(" qq_template[n]* ")" / qq[n+1]
    unquotation[n] = "," qq_template[n-1]

I am interested in contributing to the development of this feature if there's any interest in adding it to the tool.

@fresheneesz

The main reason to do this would be to support grammars sensitive to context, which if I'm not mistaken, most popular languages are (I know for sure that C and python have context specific stuff). According to Trevor Jim, Haskell is also not context free, and asserts that most langauges aren't:

http://trevorjim.com/haskell-is-not-context-free/
http://trevorjim.com/how-to-prove-that-a-programming-language-is-context-free/

Using external state in a parser that can backtrack (like PEG can) is dangerous, and can produce issues such as can be seen in this parser:

{   var countCs = 0;
}

start = ((x/y) ws*)* { return countCs }

x = "ab" c "d"
y = "a" bc "e"

c = "c" { countCs++; }
bc = "bc" { countCs++; }

ws = " " / "\n"

The above returns 2 instead of the correct answer of 1. Issues like this can be hard to reason about, can create insidious hard-to-find bugs, and when found can be very hard to work around at all, much less doing it elegantly. It unclear to me how to even do this without doing post-processing of data returned by PEG. If somehow your parser itself needs the count, its simply out of luck.

Currently, (dangerously) using external state is the only way to parse grammar that are sensitive to context. With parameterized rules, a parser could parse this without risking invalid state:

start = countCs:((x<0>/y<0>) ws*)* { return countCs.reduce(function(a,b){return a+b[0];}, 0); }

x<count> = "ab" theCount:c<count> "d" { return theCount; }
y<count> = "a" theCount:bc<count> "e" { return theCount; }

c<count> = "c" { return count++; }
bc<count> = "bc" { return count++; }

ws = " " / "\n"

David, you asked for real situations, and python's whitespace indentation syntax is clearly an example here. I want to do similar white-space indentation syntax in Lima (the programming language I'm making with PEG). But I would not want to implement anything like that when I could inadvertantly create invalid state that blows everything to hell. I could name any parsing construct that requires context, like C's x* y (is it x times y or y being defined as a pointer to an x-typed value?).

Note that for grammars sensitive to context to be parsable, one would necessarily need to pass information returned from subexpressions already matched into a parameterized rule - otherwise the parser can't actually use any of the context. Here's a real example of a string type I'm considering for Lima that only works if parameterized parsing is available and can access (as variables) lablels of previously matched expressions:

literalStringWithExplicitLength = "string[" n:number ":" characters<n> "]"
number = n:[0-9]* {return parseInt(n.join(''));}
characters<n> = c:. { // base case
  if(n>0) return null; // don't match base case unless n is 0
  else return c;
}
/ c:. cs:characters<n-1> {
  ret c+cs
}

This would be able parse a string like string[10:abcdefghij] . You can't do that with nice pure PEG.js as it stands. You have do something awful like:

{ var literalStringLengthLeft=undefined;
}
literalStringWithExplicitLength = "string[" n:specialNumber ":" characters "]"
specialNumber = n:number {
  literalStringLengthLeft = n;
  return n;
}
number = n:[0-9]* {return parseInt(n.join(''));}
characters = c:character cs:characters? {
  return c + cs
}
character = c:. {
  if(literalStringLengthLeft > 0) {
    literalStringLengthLeft--;
    return c;
  } else {
    literalStringLengthLeft = undefined;
    return null; // doesn't match
  }
}

Many many protocols have this kind of parsing need - for example, IPv4 packets have a field describing its total length. You need that context to properly parse the rest of the packet. Same is true for IPv6, UDP, and probably any other packet-based protocol. Most protocols using TCP are also going to need something like this, since one needs to be able to trasmit multiple completely separate objects using the same conceptual character stream.

Anyways, I hope I've given some good examples and reasons why I think this is not only a nice feature, not only a powerful feature, but really an essential feature that many parsers are missing (including, for the moment, PEG.js).

@otac0n
otac0n commented Jan 6, 2013

Pegasus (a project that shares most of its syntax with peg.js) solves this by having a #STATE{} expression that is given the ability to mutate a dictionary of state. This dictionary of state is backtracked when rules are backtracked. This allows it to support significant whitespace parsing (see the wiki entry about Significant Whitespace for the details).

Also, by backtracking the state along with the parsing cursor, memoization can be accomplished for stateful rules as well.

Peg.js could easily do the same, I reckon.

@fresheneesz

How does Pegasus manage backtracking state when rules backtrack? I can imagine that you could keep a snapshot of the whole program state that changed, and revert it back, but that would be expensive. I could imagine keeping a snapshot of only the variables that changed, but that would either require the user to specify it which would add complexity to creating parsers, or would require the parser to somehow figure out all the state changed in some bit of code. None of these sound ideal, so how does Pegasus do it?

Theoretically, the parser could avoid invalidly executed actions if A. actions are queued up in closures and only executed once the parser has fully completed, and B. because they execute after the parser has completed they couldn't cancel a rule match. Perhaps that scheme would be more optimal than the state backtracking done in pegasus?

Also, fixing the problem of invalid state is very nice indeed, but it doesn't solve the problem of expressibility I brought up related to a string literal like string[10:abcdefghij], but I'm definitely interested in how it works

@otac0n
otac0n commented Jan 6, 2013

It doesn't backtrack the whole program's state. It maintains an immutable dictionary of state. It saves the current state dictionary along with the cursor and whenever the cursor is backtracked, the state dictionary gets backtracked with it. The dictionary is immutable anywhere outside of #STATE{} actions, and is COPIED just before each state change.

There is a small performance penalty for setting an extra variable every time you advance the cursor, but this is far outweighed by the ability to memoize stateful rules. Also, this doesn't lead to tons of memory allocation, because the immutable nature of the state dictionary allows it to be shared until it is mutated. For example, if you didn't have state in your parser, there would be only one allocation: a single (empty) state dictionary.

JavaScript doesn't (to my knowledge) have the ability to make an object immutable, but that was mostly a safety feature. Peg.js would just need to copy a state dictionary before processing each #STATE{} code block and backtrack that whenever the cursor is backtracked.

@fresheneesz

Oh ok, so the user basically does have to specify what state they're changing. Thats pretty cool. But I still don't think it really covers the same benefits that parameterization does. It sounds like its probably useful in its own right for other things.

@dmajda dmajda changed the title from Parametrizable rules to Implement parametrizable rules Aug 14, 2015
@alkchr alkchr referenced this issue in PuerkitoBio/pigeon Sep 10, 2015
Open

Significant Whitespace Parsing #7

@tebbi
tebbi commented Nov 2, 2015

I just wrote a fork that supplies an environment, accessible using the variable env: https://github.com/tebbi/pegjs
This is the same as the #STATE{} object suggested above.
It is a quick hack, using a (package-)global variable, which is restored whenever a parsing function is left. The copying of env is accomplished with Object.create().

Here is an example grammar that uses it to parse whitespace-defined blocks a la Python:

{
  env.indLevel = -1
}

block =
  empty
  ind:ws* &{
    if (ind.length <= env.indLevel) return false;
    env.indLevel = ind.length;
    return true;
  }
  first:statement? rest:indStatement*
  {
    if (first) rest.unshift(first);
    return rest;
  }

indStatement =
  "\n" empty ind:ws* &{ return env.indLevel === ind.length; }
  stm:statement
  {return stm; }

statement =
    id:identifier ws* ":" ws* "\n"
    bl:block { return [id, bl]; }
  / identifier

identifier = s:[a-z]* { return s.join(""); }

empty = (ws* "\n")*

ws = [ \t\r]

Here is an example input for the resulting parser:

a
b:
   c
   d:
       e
   f
g
@Mingun Mingun added a commit to Mingun/pegjs that referenced this issue Jan 3, 2016
@Mingun Mingun Issue #45 (template rules): Support in grammar. Syntax for template d…
…efining:

List<expr, delim> 'Generic list'
  = h:expr t:(delim e:expr {return e;})* { return [h].concat(t); };

Syntax for template instantiation:

CommaSeparatedIntList = List<n, ','>;
n = n:$[0-9]+ {return parseInt(n, 10);}
9ff84c2
@Mingun Mingun added a commit to Mingun/pegjs that referenced this issue Jan 3, 2016
@Mingun Mingun Issue #45 (template rules): Regenerate parser. ae9149a
@Mingun Mingun added a commit to Mingun/pegjs that referenced this issue Jan 3, 2016
@Mingun Mingun Issue #45 (template rules): Add precheck pass for completes creating …
…AST and default handling for new AST nodes.
3e32263
@Mingun Mingun added a commit to Mingun/pegjs that referenced this issue Jan 3, 2016
@Mingun Mingun Issue #45 (template rules): Implement core components for template ru…
…le feature.
3591e77
@Mingun Mingun added a commit to Mingun/pegjs that referenced this issue Jan 3, 2016
@Mingun Mingun Issue #45 (template rules): Implement template rules for |optimize=sp…
…eed| mode.
65338da
@Mingun Mingun added a commit to Mingun/pegjs that referenced this issue Jan 3, 2016
@Mingun Mingun Issue #45 (template rules): Implement template rules for |optimize=si…
…ze| mode.
8445817
@Mingun Mingun added a commit to Mingun/pegjs that referenced this issue Jan 3, 2016
@Mingun Mingun Issue #45 (template rules): Correct existing passes for support templ…
…ate rules and correct testcases.
d58c710
@qwertie
qwertie commented May 29, 2016 edited

I have the impression that PEG.js doesn't support parameters of any kind on rules - which is surprising. This feature is very important to me.

What I need is simpler than the OP's request - the OP wants to modify the grammar itself depending on the parameter, but at a minimum I just need to pass an integer into a rule. Basically I want to translate an LLLPG rule that looks like this (where PrefixExpr is a high-precedence expression such as a prefix expression like -x, or an identifier...):

@[LL(1)]
rule Expr(context::Precedence)::LNode @{
    {prec::Precedence;}
    e:PrefixExpr(context)
    greedy
    (   // Infix operator
        &{context.CanParse(prec=InfixPrecedenceOf(LT($LI)))}
        t:(NormalOp|BQString|Dot|Assignment)
        rhs:Expr(prec)
        { ... }
    |   // Method_calls(with arguments), indexers[with indexes], generic!arguments
        &{context.CanParse(P.Primary)}
        e=FinishPrimaryExpr(e)
    |   // Suffix operator
        ...
    )*
    {return e;}
};
// Helper rule that parses one of the syntactically special primary expressions
@[private] rule FinishPrimaryExpr(e::LNode)::LNode @{
(   // call(function)
    "(" list:ExprList(ref endMarker) ")"
    { ... }
    |   // ! operator (generic #of)
        "!" ...
    |   // Indexer / square brackets
        {var args = (new VList!LNode { e });}
        "[" args=ExprList(args) "]"
        { ... }
    )
    {return e;}
};

My language has 25 precedence levels, and with these rules I have collapsed almost all of them to be processed by a single rule (you can think of Precedence as a wrapper around a couple of integers that describe the precedence of an operator). Moreover, my language has an infinite number of operators (basically any sequence of punctuation) and the precedence of an operator is decided after it is parsed. While it would be technically possible to parse the language in the usual way, with a separate rule for each of the 25 kinds of operator, that would be a horrible way to do it.

Also you can see here that the inner rule FinishPrimaryExpr constructs a syntax tree that incorporates a parameter passed from the enclosing rule.

So ... is there any way to pass parameters to a PEG.js rule?

@burnpanck

+1! In my case, I simply want to generate a parser for a syntax, where some delimiters are globally configurable. In this case, I can achieve this by replacing the delimiter literals by match anything expressions combined with a predicated, but it would be much more elegant (and also more efficient) if the match-everything could simply be replaced by a variable.

@bd82 bd82 referenced this issue in burnpanck/MathJax-third-party-extensions Jun 27, 2016
Open

Dynamically defined Delimiters. #4

@nunocastromartins

+1, any chance to see this implemented in the foreseeable future?

Another use-case. This is from your javascript.pegjs example:

(...)

RelationalExpression
  = head:ShiftExpression
    tail:(__ RelationalOperator __ ShiftExpression)*
    { return buildBinaryExpression(head, tail); }

RelationalOperator
  = "<="
  / ">="
  / $("<" !"<")
  / $(">" !">")
  / $InstanceofToken
  / $InToken

RelationalExpressionNoIn
  = head:ShiftExpression
    tail:(__ RelationalOperatorNoIn __ ShiftExpression)*
    { return buildBinaryExpression(head, tail); }

RelationalOperatorNoIn
  = "<="
  / ">="
  / $("<" !"<")
  / $(">" !">")
  / $InstanceofToken

(...)

  (...)
  / ForToken __
    "(" __
    init:(ExpressionNoIn __)? ";" __
    test:(Expression __)? ";" __
    update:(Expression __)?
    ")" __
    body:Statement
  (...)

(...)

All these ...NoIn rules (and there are a LOT of them) are required simply because of the for in statement. Wouldn't a much better approach to this be something like:

(...)

RelationalExpression<allowIn>
  = head:ShiftExpression
    tail:(__ RelationalOperator<allowIn> __ ShiftExpression)*
    { return buildBinaryExpression(head, tail); }

RelationalOperator<allowIn>
  = "<="
  / ">="
  / $("<" !"<")
  / $(">" !">")
  / $InstanceofToken
  / &{ return allowIn; } InToken
    {return "in";}

(...)

  (...)
  / ForToken __
    "(" __
    init:(Expression<false> __)? ";" __
    test:(Expression<true> __)? ";" __
    update:(Expression<true> __)?
    ")" __
    body:Statement
  (...)

(...)

This feels very similar to how, for example, the JavaScript grammar is specified: https://tc39.github.io/ecma262/#prod-IterationStatement (note the ~In)

A language I'm currently developing has this exact problem: I'd like to disable/enable some rules at certain points only. I'd very much like to refrain from duplicating every affected rule like you did for the JavaScript grammar.

Is there any alternative way to achieve this without duplicating the rules?

@dmajda
Member
dmajda commented Oct 11, 2016

+1, any chance to see this implemented in the foreseeable future?

This issue has a milestone assigned (post-1.0.0). The current version of PEG.js is 0.10.0. Obviously, post-1.0.0 issues will be dealt with after releasing 1.0.0, which will happen at some point after releasing 0.11.0 according to the roadmap.

This should answer your question. The best way to accelerate the whole process is to help with issues targeted for 0.11.0 and 1.0.0.

Is there any alternative way to achieve this without duplicating the rules?

A possible way is to track the state manually and then use semantic predicates. But this approach has issues with backtracking and I would not recommend it (in other words, when given a choice between rule duplication and manual state tracking, I would choose duplication).

@Mingun Mingun added a commit to Mingun/pegjs that referenced this issue Nov 27, 2016
@Mingun Mingun Issue #45 (template rules): Support in grammar. Syntax for template d…
…efining:

List<expr, delim> 'Generic list'
  = h:expr t:(delim e:expr {return e;})* { return [h].concat(t); };

Syntax for template instantiation:

CommaSeparatedIntList = List<n, ','>;
n = n:$[0-9]+ {return parseInt(n, 10);}
e426f6c
@Mingun Mingun added a commit to Mingun/pegjs that referenced this issue Nov 27, 2016
@Mingun Mingun Issue #45 (template rules): Regenerate parser. f596991
@Mingun Mingun added a commit to Mingun/pegjs that referenced this issue Nov 27, 2016
@Mingun Mingun Issue #45 (template rules): Add precheck pass for completes creating …
…AST and default handling for new AST nodes.
5f418c7
@Mingun Mingun added a commit to Mingun/pegjs that referenced this issue Nov 27, 2016
@Mingun Mingun Issue #45 (template rules): Implement core components for template ru…
…le feature.
ffe78df
@Mingun Mingun added a commit to Mingun/pegjs that referenced this issue Nov 27, 2016
@Mingun Mingun Issue #45 (template rules): Implement template rules for |optimize=sp…
…eed| mode.
d69fc59
@Mingun Mingun added a commit to Mingun/pegjs that referenced this issue Nov 27, 2016
@Mingun Mingun Issue #45 (template rules): Implement template rules for |optimize=si…
…ze| mode.
2b0ea7b
@Mingun Mingun added a commit to Mingun/pegjs that referenced this issue Nov 27, 2016
@Mingun Mingun Issue #45 (template rules): Correct existing passes for support templ…
…ate rules and correct testcases.
2c9bd36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment