Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong(?) result on some grammars with infinite results #35

Open
acertain opened this issue Jan 7, 2018 · 10 comments
Open

Wrong(?) result on some grammars with infinite results #35

acertain opened this issue Jan 7, 2018 · 10 comments

Comments

@acertain
Copy link

acertain commented Jan 7, 2018

grammar = mdo
  r <- rule $ (pure 1 <|> ((+1) <$> r))
  pure r

produces just 1, not [1,2,3...]

I don't think it's possible to have this work and

grammar = mdo
  r <- rule $ (pure 1 <|> r)
  pure r

not produce an infinite list of results

I think this is a sensible choice because it currently can't deal with infinite results, but it should be documented somewhere. Theoretically it should be possible to return an infinite lazy list.

@acertain acertain changed the title Wrong result on some grammars with infinite results Wrong(?) result on some grammars with infinite results Jan 7, 2018
@ollef
Copy link
Owner

ollef commented Jan 8, 2018

Hey, and thanks for your report!

You're right, unless the current behaviour is documented this should probably be considered as a bug.

The reason for the current behaviour is that the epsilon derivations of a rule r (those that don't consume any input) are computed assuming that r has no epsilon derivations. That is why the part using r recursively in your grammars doesn't contribute to the result.

The first idea that might spring to mind then is to tie the knot when computing epsilon derivations and assume that r's epsilon derivations are the result of the computation while computing the epsilon derivations. The problem with this approach is that it makes left-recursive grammars loop.

So instead I'm experimenting with an idea of productivity: While traversing the production looking for epsilon derivations, keep track of whether we've produced any results yet. When encountering r again after we've already produced some results it's safe to tie the knot.

A problem with this is that it's potentially order-dependent:

r <- rule $ pure 1 <|> r

would return [1, 1, 1, ...] because r comes after pure 1, while

r <- rule $ r <|> pure 1

would return [1] because we haven't produced anything yet when encountering r.

Maybe we could reorder the alternatives to make also this return an infinite list but I'm worried that that might not work in general, because you don't know what order to use if there are several productive alternatives.

Do you have any input or ideas? Do you have any more good test cases or important use cases?

Also, what do you mean by "I don't think it's possible to have this work and ... not produce an infinite list of results"? If the first one should return [1, 2, 3, ...], shouldn't the second return [1, 1, 1, ...]?

Cheers!

@acertain
Copy link
Author

acertain commented Jan 13, 2018

If the first one should return [1, 2, 3, ...], shouldn't the second return [1, 1, 1, ...]?

That's what I mean

It should be possible to expose multiple variants of rule:

  1. use a HashMap or StableNames etc to dedup results
  2. return infinite results
  3. the current behavior
  4. just loop
  5. something else??

I'm working on a rewrite of the parser where all but 2 should be easy to implement. Any ideas for other variants?

My rewrite doesn't compute epsilon derivations or interpret Prods and instead has instances directly on

newtype Parser s e i a = Parser {
  unParser ::
    ParseEnv s e i ->
    (a -> ParseEnv s e i -> ST s (ParseEnv s e i)) ->
    ST s (ParseEnv s e i)
}

including Monad and a fix :: (Parser s e i a -> Parser s e i a) -> Parser s e i a function

So far, computing epsilon derivations has been avoidable but I think it might be unavoidable for infinite results

@ollef
Copy link
Owner

ollef commented Jan 14, 2018

Hmm, I'm not too fond of the idea of exposing several variants of rule, unless I can be convinced that there are important use cases for such variants. I think I'd prefer this library just to do the Right Thing, provided we can figure out what that should be. Of course, that doesn't stop you from doing these things in your projects. :)

I think I've come up with something better than what I outlined in my previous post. What do you think of the following?

  1. Compute epsilon derivations for r assuming that r's epsilons are [] (i.e. the current behaviour).
  2. If step 1 produces any results, compute the epsilon derivations again but tie the knot.

This scheme supports left-recursion without looping provided the rule doesn't have any epsilon derivations, and additionally makes your two grammars produce infinite results. It makes r <- r <|> pure 1 loop if you try to force the results but you can still get a report, which I think is reasonable.

So to me it seems promising --- I just have to convince myself that this is the right thing to do in general. I have an implementation here, and the commit before has an implementation of the scheme from my previous message.

Your rewrite sounds intriguing. I'd be interested to hear how you're handling things like left-recursion in it! :)

@acertain
Copy link
Author

acertain commented Aug 20, 2018

A wip version of my rewrite is at https://gist.github.com/fread2281/256e47aff8903d7da98d9ea6b4cff63f. A couple things:

  • Currently it doesn't compute epsilons and instead builds up a Results structure differently. I'd like to compute epsilons and expose a bindResults :: Parser s e i a -> ([a] -> Parser s e i b) -> Parser s e i b so the user can capture ambiguity errors in subexpressions and give nicer error messages (e.g. a language with infix operators, that prints the possible parses when an operator expression is ambiguous). I think epsilons should be possible to do final-encoded, but I haven't worked it out yet
  • It provides instances directly instead of building and interpreting a GADT. GHC should be able to optimize it more, but it's a different interface.
  • I haven't yet looked at the order in which it returns results

@acertain
Copy link
Author

acertain commented Dec 1, 2018

Here's a problem(?) with my implementation, that I'm trying to figure out how to fix. I think Earley has the same problem? If not, it'd be very helpful if you could try to explain how Earley avoids it. I've benchmarked my implementation and Earley and they both have the same order of magnitude of function calls (of the most commonly called functions, but it's still within n^3 i think) in the prof files, so I think it still has this problem.

Can merge results at same rule w/ diff start pos & same end pos!
(if a calls b at pos i and j, and i-k & j-k both return results, only need to return once with merged results)
however, with fuly binarized/memoized grammars, ignoring alts, we can assume that a = b <*> c, but c must be a rule, which does merging for us

But this still adds too many callbacks to c (if b returns at i-k & j-k, then c gets two instances of a in its conts), and callbacks in c cost per position where c succeeds starting from k.

Practical, General Parser Combinators (Meerkat) avoids this by using a HashSet of Conts in rules.

However, this might not cause worst case complexity to go over cubic with fully binarized grammars. According to the meerkat paper,

In Appendix A.3 we show that the execution of memoized CPS
recognizers of Figure 2 and 6 can require O(n^(m+1)) operations, where
m is the length of the longest rule in the grammar. The reason for
such unbounded polynomial behaviour is that the same continuation
can be called multiple times at the same input position. As illustrated
in Appendix A.3, this happens when the same continuation is
added to different continuation lists that are associated with calls
made to the same recognizer but at different input positions.
If the recognizer produces the same input position starting from different
input positions, duplicate calls are made. The duplicate calls further
result in adding the same continuations to the same continuation
lists multiple times

I think it still affects non-worst case complexity though :(

I don't know if this is avoidable without using StableNames or similar to compare Conts

@ollef
Copy link
Owner

ollef commented Dec 2, 2018

If I understand you correctly, the question is if we can merge the continuations of invocations of the same rule that start at different positions and end at the same position. Is this what you're asking?

This library doesn't do that, but it doesn't seem to me to be a sound optimisation. Maybe I misunderstood what you meant, though.

@acertain
Copy link
Author

acertain commented Dec 2, 2018

There's two ways to deal with the issue in the example:

  1. Merge the continuations of invocations of the same rule that start at different positions and end at the same position and come from the same position in their parent rule: this requires extra bookkeeping per rule, and some way to compare where in the grammar conts return to.
  2. Merge continuations of a rule that return to the same position in the grammar & start from the same position: this still needs some way to compare where in the grammar conts return to.

In the example, 2 would be merging the continuations after they get added to c, so when c returns results in the future it costs less, while 1 would be merging the continuations as b returns (before c gets called).

The Meerkat paper does 2, but I think 1 would work too.

I think they should have the same asymptotics, except with a Monad instance, where 1 can do better.

@ChristopherKing42
Copy link

Maybe you could tie the knot, but then do a breadth first search instead of a depth first search. That way, both r <- rule $ (pure 1 <|> ((+1) <$> r)) and r <- rule $ (((+1) <$> r) <|> pure 1) would return infinite lists. In fact, it would work with sensibly with any binary tree with infinitely many leaves (although a binary tree with infinitely many nodes but only finitely many leaves would diverge after returning all of its leaves).

@ollef
Copy link
Owner

ollef commented Aug 28, 2019

Yeah, that might work actually!

@ChristopherKing42
Copy link

Also, an idea on how to do that efficiently. First you create a value representing a lazy binary tree, and then you search that binary tree using iterative deepening search. Then, if the compiler fuses the binary tree, it would be equivalent to an imperative iterative deepening search. If it chooses to keep the binary tree in memory instead, it would be equivalent to an imperative breadth first search (except that you are scanning the tree itself in a iterative deepening manner). An optimizing compiler (like GHC) can choose which one is better based on the cost of recalculating the values in the tree, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants