Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greedy repeat parser problem #9

Closed
tpluscode opened this issue May 30, 2014 · 5 comments
Closed

Greedy repeat parser problem #9

tpluscode opened this issue May 30, 2014 · 5 comments

Comments

@tpluscode
Copy link
Contributor

Hi

I'm trying to create a grammar to parse SPARQL property paths, as defined here. I only need part of that vocabulary. Unfortunately W3C uses their own EBNF syntax so instead of tranlating it to vanilla EBNF I decided to try rewrite the relevant rules using your shortcut syntax, which I find quite neat.

However, I've bumped into problems with rule PN_PREFIX.

PN_PREFIX = PN_CHARS_BASE ((PN_CHARS | '.')* PN_CHARS)?
PN_CHARS_BASE = (* basically any Unicode letter character *)
PN_CHARS = PN_CHARS_U | '-' | [0-9] | #x00B7 | [#x0300-#x036F] | [#x203F-#x2040]
PN_CHARS_U = PN_CHARS_BASE | '_'    

In short, PN_PREFIX should match the prefix of a QName URI. For example, given a QName rdf:type it would be matching the rdf part. As per the rule, the first character must be a letter, and then additionally characters are allowed.

I rewrote PN_PREFIX as

pn_chars_base & ~(-(pn_chars | '.') & pn_chars)

pn_chars_base matches the r and then the RepeatParser matches df, which is unfortunate because then pn_chars fails, because it doesn't match the colon, thus failing entire optional pattern.

The intent is that pn_chars_base, pn_chars inside repeat and the last pn_chars matched r, d and f respectively so that the entire pn_prefix matched rdf.

Any idea what's not right with my grammar?

@cwensley
Copy link
Member

Hi,

Yeah, the RepeatParser is certainly greedy. Though you do have options for this scenario:

First, you can enforce the rule that a '.' must follow another character such as:

pn_chars_base & -((pn_chars & '.' & pn_chars) | pn_chars);

Or, if performance is a big concern (i.e. you are parsing millions of these things in a server scenario) you can use the RepeatParser's separator to allow an optional period in-between each character:

pn_chars_base & (-pn_chars).SeparatedBy(~(Parser)'.');

Hope this helps!

@cwensley
Copy link
Member

Hm, thinking about that for a second, this will also work and is a little cleaner:

pn_chars_base & -(('.' & pn_chars) | pn_chars)

@tpluscode
Copy link
Contributor Author

Thanks it works of course. I haven't done any grammars since studies. However the W3C syntax makes me wonder. I understand that the semantics of

(PN_CHARS | '.')* PN_CHARS

mean that matching string contains a PN_CHARS preceeded by zero or more PN_CHARS or a dot. However without peeking forward while parsing the repeated group the parser cannot know if any given matched character should actually be matched by the next production in sequence. Hence the greedy behaviour and it is logical.

The question though is, is it a valid EBNF notation and such parsers are actually used? Or is this syntax just more human-readable and indented to be easier to comprehend in written form?

@cwensley
Copy link
Member

cwensley commented Jun 2, 2014

This is valid EBNF notation, and works with LALR parsers, though does not work with LL parsers like Eto.Parse. Being recursive descent, the repeat parser knows nothing about what should come after it. With LALR parsers, a huge 'table of possibilities' is typically created which allows it to handle patterns like this.

I've pondered the concept of adding look-ahead to Eto.Parse and it might be doable, however it may degrade performance which is not what I'd like to see.

@tpluscode
Copy link
Contributor Author

Thanks for clarifying this for me.

I certainly won't need the enchanced functonality, given that I was able to achieve the desired result by adjusting my productions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants