Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limiting Parsings for Ambiguous Grammars (Bocages) #53

Closed
YafahEdelman opened this issue Mar 24, 2015 · 23 comments
Closed

Limiting Parsings for Ambiguous Grammars (Bocages) #53

YafahEdelman opened this issue Mar 24, 2015 · 23 comments

Comments

@YafahEdelman
Copy link
Contributor

I know its a long shot and not really something earley was meant for, but are there any methods or optimizations we can implement to deal with ambiguous grammars with exponential parsings? For instance:

num -> num "+" num | [0-9]:+

This seems simple, but nearley can't deal well with large statements of this type because the amount of possible parsings is exponential with the number of plus signs. Could we implement the option of limiting the number of parsings?

@kach
Copy link
Owner

kach commented Mar 24, 2015

That's not a well-written grammar, though. You should write num -> num "+" [0-9]:+ | [0-9]:+. If there's an ambiguous grammar, the assumption is that you want all possible parsings (I can think of use-cases where this is a legitimate assumption).

@YafahEdelman
Copy link
Contributor Author

But it would be nice to have there be an option rather than have nearley just make this assumption. Do you have any ideas on how to use this?

@kach
Copy link
Owner

kach commented Mar 25, 2015

Well, I think it would be better to write a linting service that warns you if your grammar is ambiguous. In fact, that would be really nice! Also, algorithmically nontrivial, because I can't quickly think of a way to statically determine if a grammar is ambiguous.

@YafahEdelman
Copy link
Contributor Author

After a bit of searching around because making such an algorithm did seem quiet interesting I discovered that it was extremely non-trivial, aka, not computable. Although maybe it will be possible to make a heuristic that works in a lot of basic cases, I'll have to research some more.
Edit: Maybe its only not computable for general grammars? I'm not sure, I'll keep researching.

@tjvr
Copy link
Collaborator

tjvr commented Apr 10, 2015

Well, I think it would be better to write a linting service that warns you if your grammar is ambiguous.

You know that's undecidable in general, right? As JacobEdelman points out above.

@YafahEdelman
Copy link
Contributor Author

Although its undecidable in general it is possible to add a mode that allows precedence and will only find one possible parsing. I'm still not sure exactly what kind of speed increase this would give you but I've implemented a (toy) parser that can do it.

@tjvr
Copy link
Collaborator

tjvr commented Apr 10, 2015

Because Earley builds a parse table with all possible states, how can you avoid computing all possible parsings?

I'm not sure what you mean by adding precedence, but that sounds like a bad idea. @Hardmath123 is a fan of using BNF to express operator precedence :-)

@YafahEdelman
Copy link
Contributor Author

A general outline is that you can selectively compute parts of the parse table in an attempt to compute the final parsing. If your parsing was not successful you compute more of the parse table. I'll look at doing that but I do agree that if it does end up causing a dramatic speed increase it won't be worth it.

@patrickhuber
Copy link

You could build the shared packed parse forest (SPPF) as you go. Elizabeth Scott noted this in her 2008 paper "SPPF-Style Parsing From Earley Recognisers" .

Marpa's Jeffery Kegler uses the term Bocage to denote this structure as you can see here .

@kach
Copy link
Owner

kach commented Apr 14, 2015

Hmm, @patrickhuber, that looks pretty interesting. From what I can tell, this should allow us to create some sort of lazy iterable data structure where each iteration results in one valid parsing. The obvious advantage is that for very ambiguous grammars with exponential parses, it would be more memory-efficient. On the other hand, building an SPPF takes time, and so there's a speed tradeoff. This sounds like a fun experiment.

(Again, my first choice is that people just learn to write unambiguous grammars—in 99.9% of cases, it's very obvious how to write your grammar unambiguously and you don't actually want to enumerate all possible parsings anyway.)

@patrickhuber
Copy link

Near the bottom of the paper she creates a inline representation that runs at the same time complexity as the Earley algorithm. Jeffery Kegler stated in the marpa forums that building the parse tree using binary, reusable nodes follows the same complexity as recognition. On trivial samples, you may notice a very small bit of overhead, but as the size of the parse grows, the overhead will be inconsequential.

I agree that the easy option is to rewrite the grammar, but part of the appeal of the Earley algorithm is its ability to handle ambiguity with ease.

@kach
Copy link
Owner

kach commented Apr 14, 2015

Whoo, okay, I'm sold. Time to dig into some papers and find someone smart enough to teach me how to implement it (@JacobEdelman ? @blob8108 ? @rwindelz ?).

@YafahEdelman
Copy link
Contributor Author

I'll look it over later today/tomorrow.
On Apr 14, 2015 6:48 PM, "Hardmath123" notifications@github.com wrote:

Whoo, okay, I'm sold. Time to dig into some papers and find someone smart
enough to teach me how to implement it (@JacobEdelman
https://github.com/JacobEdelman ? @blob8108
https://github.com/blob8108 ? @rwindelz https://github.com/rwindelz
?).


Reply to this email directly or view it on GitHub
#53 (comment).

@patrickhuber
Copy link

If it helps, I have a implementation here in C#. https://github.com/patrickhuber/Pliant/tree/shared-packed-parse-forest/libraries/Pliant. It doesn't currently work for Leo Items. Main body of code is in the PulseRecognizer.cs.

@kach
Copy link
Owner

kach commented Apr 15, 2015

Ah, nice to know. This will be a handy reference (the collection of links in your README is also very useful!).

(We haven't implemented Leo optimization yet, so that's not an issue. I should really go read about that soon.)

@kach kach changed the title Limiting Parsings for Ambiguous Grammars Limiting Parsings for Ambiguous Grammars (Bocages) Apr 15, 2015
@erezsh
Copy link

erezsh commented Jun 13, 2016

Personally, I would be very happy if there was some boolean parameter I could give the parser (say, "disable_ambiguity"), that would throw an exception if detecting an ambiguous input.

Obviously we shouldn't write ambiguous grammars if we don't want them, but we also shouldn't write bugs in general, and that still happens some time. And some grammars can be pretty big. It'd be nice to have a graceful exit, instead of a hang / memory error.

@kach
Copy link
Owner

kach commented Jun 13, 2016

See, the problem with this is that since nearley is streaming, you never know if the input is "ambiguous" because the ambiguity might be resolved later. Let me give you an example:

main -> cow_or_colorado | cone_or_colorado
cow_or_colorado -> "CO" "W":?
cone_or_colorado -> "CO" "NE":?

Suppose I gave this parser "C" and "O". Now, it thinks you could either have a cow_or_colorado or a cone_or_colorado—but it can't throw an error yet! This is because if you feed the parser "W" or "NE", the ambiguity will get resolved in favor of the former or latter, respectively. Nearley never knows when you're done giving it input.

This means, by the way, that you could always detect ambiguity on your end by looking at the results array and ensuring that its length is exactly 1.

@erezsh
Copy link

erezsh commented Jun 22, 2016

Well, the reason I ask is that I wrote a python version of nearley (just the parser) for my parsing library, and I want to be able to protect my users from writing ambiguous grammars that practically never end. Right now it's very hard to discover exactly which rule is ambiguous.

You can see it here btw: https://github.com/erezsh/plyplus/blob/master/plyplus/pearley.py
It works very well, but unfortunately it's still very slow (10x time compared to PLY).

@tjvr
Copy link
Collaborator

tjvr commented Jun 22, 2016

@erezsh On the subject of optimisation: I wrote a python version a while ago, it might have some useful optimisations

@kach
Copy link
Owner

kach commented Jun 22, 2016

@erezsh pearley looks really cool! Very exciting.

I notice most of the code seems to be a direct port of nearley's—in this case, I think I can answer your comment on line 67 (https://github.com/erezsh/plyplus/blob/master/plyplus/pearley.py#L67). Let me know if you'd like that.

Again, as I've said, you can't detect ambiguity automatically. Not possible. The solution is to inspect the output with small test cases—then it becomes really easy to see, intuitively, what is getting parsed in multiple ways. Perhaps you can figure out a clever way to automate this process—if you do, let me know and I would be happy to include it in the nearley package. :-)

@wizzard0
Copy link

wizzard0 commented Feb 24, 2017

My 5 cents: being able to enumerate exponential number of parses is very important and is what makes nearley, um, the parser of choice (for NLP and for generated grammars at least).
Of course it makes it slower at times... but hey, universality is almost always at odds with performance.

btw the grammar num -> num "+" num | [0-9]:+ does not in fact generate ambiguous parses in playground at least, how can get my 2^n parses there? :)

@kach
Copy link
Owner

kach commented Feb 24, 2017

It's certainly supposed to show all parses… I think the author is using an outdated version of nearley, a version which had a bug that suppressed ambiguous output in some cases (including your case). Not sure, though, because the client JS is packed.

@kach
Copy link
Owner

kach commented Aug 24, 2017

Closing because #163 can continue this discussion.

@kach kach closed this as completed Aug 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants