Greedy repeat parser problem #9

tpluscode · 2014-05-30T14:19:26Z

Hi

I'm trying to create a grammar to parse SPARQL property paths, as defined here. I only need part of that vocabulary. Unfortunately W3C uses their own EBNF syntax so instead of tranlating it to vanilla EBNF I decided to try rewrite the relevant rules using your shortcut syntax, which I find quite neat.

However, I've bumped into problems with rule PN_PREFIX.

PN_PREFIX = PN_CHARS_BASE ((PN_CHARS | '.')* PN_CHARS)?
PN_CHARS_BASE = (* basically any Unicode letter character *)
PN_CHARS = PN_CHARS_U | '-' | [0-9] | #x00B7 | [#x0300-#x036F] | [#x203F-#x2040]
PN_CHARS_U = PN_CHARS_BASE | '_'

In short, PN_PREFIX should match the prefix of a QName URI. For example, given a QName rdf:type it would be matching the rdf part. As per the rule, the first character must be a letter, and then additionally characters are allowed.

I rewrote PN_PREFIX as

pn_chars_base & ~(-(pn_chars | '.') & pn_chars)

pn_chars_base matches the r and then the RepeatParser matches df, which is unfortunate because then pn_chars fails, because it doesn't match the colon, thus failing entire optional pattern.

The intent is that pn_chars_base, pn_chars inside repeat and the last pn_chars matched r, d and f respectively so that the entire pn_prefix matched rdf.

Any idea what's not right with my grammar?

The text was updated successfully, but these errors were encountered:

cwensley · 2014-05-30T15:22:50Z

Hi,

Yeah, the RepeatParser is certainly greedy. Though you do have options for this scenario:

First, you can enforce the rule that a '.' must follow another character such as:

pn_chars_base & -((pn_chars & '.' & pn_chars) | pn_chars);

Or, if performance is a big concern (i.e. you are parsing millions of these things in a server scenario) you can use the RepeatParser's separator to allow an optional period in-between each character:

pn_chars_base & (-pn_chars).SeparatedBy(~(Parser)'.');

Hope this helps!

cwensley · 2014-05-30T15:25:09Z

Hm, thinking about that for a second, this will also work and is a little cleaner:

pn_chars_base & -(('.' & pn_chars) | pn_chars)

tpluscode · 2014-06-02T07:03:18Z

Thanks it works of course. I haven't done any grammars since studies. However the W3C syntax makes me wonder. I understand that the semantics of

(PN_CHARS | '.')* PN_CHARS

mean that matching string contains a PN_CHARS preceeded by zero or more PN_CHARS or a dot. However without peeking forward while parsing the repeated group the parser cannot know if any given matched character should actually be matched by the next production in sequence. Hence the greedy behaviour and it is logical.

The question though is, is it a valid EBNF notation and such parsers are actually used? Or is this syntax just more human-readable and indented to be easier to comprehend in written form?

cwensley · 2014-06-02T14:07:52Z

This is valid EBNF notation, and works with LALR parsers, though does not work with LL parsers like Eto.Parse. Being recursive descent, the repeat parser knows nothing about what should come after it. With LALR parsers, a huge 'table of possibilities' is typically created which allows it to handle patterns like this.

I've pondered the concept of adding look-ahead to Eto.Parse and it might be doable, however it may degrade performance which is not what I'd like to see.

tpluscode · 2014-06-02T14:13:02Z

Thanks for clarifying this for me.

I certainly won't need the enchanced functonality, given that I was able to achieve the desired result by adjusting my productions.

tpluscode mentioned this issue Jun 2, 2014

Matching surrogate pair characters #10

Closed

tpluscode closed this as completed Jun 5, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Greedy repeat parser problem #9

Greedy repeat parser problem #9

tpluscode commented May 30, 2014

cwensley commented May 30, 2014

cwensley commented May 30, 2014

tpluscode commented Jun 2, 2014

cwensley commented Jun 2, 2014

tpluscode commented Jun 2, 2014

Greedy repeat parser problem #9

Greedy repeat parser problem #9

Comments

tpluscode commented May 30, 2014

cwensley commented May 30, 2014

cwensley commented May 30, 2014

tpluscode commented Jun 2, 2014

cwensley commented Jun 2, 2014

tpluscode commented Jun 2, 2014