Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Parametrizable rules #45

Open
ceymard opened this Issue · 11 comments

6 participants

@ceymard

It would be great to be able to parametrize rules with variables ;

string
   = '\"' parse_contents '\"' ->
   / '\'' parse_contents('\'') '\'' ->
   / '+' parse_contents('+') '+' -> /* sure why not :) */

parse_contents(terminator='\"')
    = ('\\' terminator / !terminator .)+ -> return stuff
@dmajda
Owner

Do you have specific use case where this would save you significant amount of work or make something currently impossible possible?

@ceymard

It makes parsing indentation levels much easier by calling rules that have the level passed as an argument.

Also, in a pure DRY logic, when doing things like "stuff delimited by this character with escape sequence such", it is nicer to call something like delimited('\'', '\\') than just do the rule (and its actions !) three times.

@dmajda
Owner

I should have been more clear. By "specific" I was looking for something like "I was working on a grammar of language X and there are 5 rules in there that could have been combined in one, here they are:" That is, I wanted to see real-world use case and real-world code. From that I can judge better in what cases this feature would be useful and for how many people.

Please don't take this as I am opposed to this feature per se. I just generally don't want to implement features useful only for a tiny fraction of languages or developers because of complexity and implementation cost. And in this case the cost is relatively high.

@ceymard

Just writing a parser for javascript, I could have string = delimited_by('\'') / delimited_by('\"') and then later regexp = delimited_by('/').

Lately, I've been writing a parser for a language of my own. I have something like this in a PEG framework I wrote for python :

LeftAssociative(op, subrule): l:subrule rest:(op subrule)* -> a_recursive_function_that_reverses_rest_and_makes_bintrees(l, op, subrule)

And I can then write:

...
exp_mult = LeftAssociative(/[\*\/%]/, exp_paren)
exp_add = LeftAssociative(/[\+-]/, exp_mult)

Since I have about as many levels of priority as in C++ (all its operators plus a few more), I'll let you imagine how useful in can be. I'm not done yet in the parsing expressions, yet I already use it 12 times.

@mcasimir

This would be great if combined with an 'import' feature

require(CommaSeparatedList.pegjs)
require(StringLiteral.pegjs)
require(IntegerLiteral.pegjs)

...

Function
 = name:FuncName "(" args:CommaSeparatedList(Literal)  ")" 

Hash
= "{"   props:CommaSeparatedList(Prop)   "}"

Prop
= Key ":" Literal

Literal =
  StringLiteral / IntegerLiteral
@krockot

(This is a bit more complicated than OP's request, but it seemed too close to justify its own thread.)

I'm building an R5RS Scheme parser with the help of PEG.js. Everything's rosy except for quasiquotations, which require context-aware parsing. It would be useful to be able to parameterize rules for the sake of on-the-fly rule generation from templates, avoiding a large amount of awkward post-processing. For example, a simplified quasiquotation grammar might look like:

    quasiquotation = qq[1]
    qq[n] = "`" qq_template[n]
    qq_template[0] = expression
    qq_template[n] = simple_datum / list_qq_template[n] / unquotation[n]
    list_qq_template[n] = "(" qq_template[n]* ")" / qq[n+1]
    unquotation[n] = "," qq_template[n-1]

I am interested in contributing to the development of this feature if there's any interest in adding it to the tool.

@fresheneesz

The main reason to do this would be to support grammars sensitive to context, which if I'm not mistaken, most popular languages are (I know for sure that C and python have context specific stuff). According to Trevor Jim, Haskell is also not context free, and asserts that most langauges aren't:

http://trevorjim.com/haskell-is-not-context-free/
http://trevorjim.com/how-to-prove-that-a-programming-language-is-context-free/

Using external state in a parser that can backtrack (like PEG can) is dangerous, and can produce issues such as can be seen in this parser:

{   var countCs = 0;
}

start = ((x/y) ws*)* { return countCs }

x = "ab" c "d"
y = "a" bc "e"

c = "c" { countCs++; }
bc = "bc" { countCs++; }

ws = " " / "\n"

The above returns 2 instead of the correct answer of 1. Issues like this can be hard to reason about, can create insidious hard-to-find bugs, and when found can be very hard to work around at all, much less doing it elegantly. It unclear to me how to even do this without doing post-processing of data returned by PEG. If somehow your parser itself needs the count, its simply out of luck.

Currently, (dangerously) using external state is the only way to parse grammar that are sensitive to context. With parameterized rules, a parser could parse this without risking invalid state:

start = countCs:((x<0>/y<0>) ws*)* { return countCs.reduce(function(a,b){return a+b[0];}, 0); }

x<count> = "ab" theCount:c<count> "d" { return theCount; }
y<count> = "a" theCount:bc<count> "e" { return theCount; }

c<count> = "c" { return count++; }
bc<count> = "bc" { return count++; }

ws = " " / "\n"

David, you asked for real situations, and python's whitespace indentation syntax is clearly an example here. I want to do similar white-space indentation syntax in Lima (the programming language I'm making with PEG). But I would not want to implement anything like that when I could inadvertantly create invalid state that blows everything to hell. I could name any parsing construct that requires context, like C's x* y (is it x times y or y being defined as a pointer to an x-typed value?).

Note that for grammars sensitive to context to be parsable, one would necessarily need to pass information returned from subexpressions already matched into a parameterized rule - otherwise the parser can't actually use any of the context. Here's a real example of a string type I'm considering for Lima that only works if parameterized parsing is available and can access (as variables) lablels of previously matched expressions:

literalStringWithExplicitLength = "string[" n:number ":" characters<n> "]"
number = n:[0-9]* {return parseInt(n.join(''));}
characters<n> = c:. { // base case
  if(n>0) return null; // don't match base case unless n is 0
  else return c;
}
/ c:. cs:characters<n-1> {
  ret c+cs
}

This would be able parse a string like string[10:abcdefghij] . You can't do that with nice pure PEG.js as it stands. You have do something awful like:

{ var literalStringLengthLeft=undefined;
}
literalStringWithExplicitLength = "string[" n:specialNumber ":" characters "]"
specialNumber = n:number {
  literalStringLengthLeft = n;
  return n;
}
number = n:[0-9]* {return parseInt(n.join(''));}
characters = c:character cs:characters? {
  return c + cs
}
character = c:. {
  if(literalStringLengthLeft > 0) {
    literalStringLengthLeft--;
    return c;
  } else {
    literalStringLengthLeft = undefined;
    return null; // doesn't match
  }
}

Many many protocols have this kind of parsing need - for example, IPv4 packets have a field describing its total length. You need that context to properly parse the rest of the packet. Same is true for IPv6, UDP, and probably any other packet-based protocol. Most protocols using TCP are also going to need something like this, since one needs to be able to trasmit multiple completely separate objects using the same conceptual character stream.

Anyways, I hope I've given some good examples and reasons why I think this is not only a nice feature, not only a powerful feature, but really an essential feature that many parsers are missing (including, for the moment, PEG.js).

@otac0n

Pegasus (a project that shares most of its syntax with peg.js) solves this by having a #STATE{} expression that is given the ability to mutate a dictionary of state. This dictionary of state is backtracked when rules are backtracked. This allows it to support significant whitespace parsing (see the wiki entry about Significant Whitespace for the details).

Also, by backtracking the state along with the parsing cursor, memoization can be accomplished for stateful rules as well.

Peg.js could easily do the same, I reckon.

@fresheneesz

How does Pegasus manage backtracking state when rules backtrack? I can imagine that you could keep a snapshot of the whole program state that changed, and revert it back, but that would be expensive. I could imagine keeping a snapshot of only the variables that changed, but that would either require the user to specify it which would add complexity to creating parsers, or would require the parser to somehow figure out all the state changed in some bit of code. None of these sound ideal, so how does Pegasus do it?

Theoretically, the parser could avoid invalidly executed actions if A. actions are queued up in closures and only executed once the parser has fully completed, and B. because they execute after the parser has completed they couldn't cancel a rule match. Perhaps that scheme would be more optimal than the state backtracking done in pegasus?

Also, fixing the problem of invalid state is very nice indeed, but it doesn't solve the problem of expressibility I brought up related to a string literal like string[10:abcdefghij], but I'm definitely interested in how it works

@otac0n

It doesn't backtrack the whole program's state. It maintains an immutable dictionary of state. It saves the current state dictionary along with the cursor and whenever the cursor is backtracked, the state dictionary gets backtracked with it. The dictionary is immutable anywhere outside of #STATE{} actions, and is COPIED just before each state change.

There is a small performance penalty for setting an extra variable every time you advance the cursor, but this is far outweighed by the ability to memoize stateful rules. Also, this doesn't lead to tons of memory allocation, because the immutable nature of the state dictionary allows it to be shared until it is mutated. For example, if you didn't have state in your parser, there would be only one allocation: a single (empty) state dictionary.

JavaScript doesn't (to my knowledge) have the ability to make an object immutable, but that was mostly a safety feature. Peg.js would just need to copy a state dictionary before processing each #STATE{} code block and backtrack that whenever the cursor is backtracked.

@fresheneesz

Oh ok, so the user basically does have to specify what state they're changing. Thats pretty cool. But I still don't think it really covers the same benefits that parameterization does. It sounds like its probably useful in its own right for other things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.