Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out-of-order results #15

Open
sboosali opened this issue Nov 6, 2015 · 29 comments
Open

out-of-order results #15

sboosali opened this issue Nov 6, 2015 · 29 comments

Comments

@sboosali
Copy link
Collaborator

sboosali commented Nov 6, 2015

below, edit has nullable rules, and insertion has a wildcard. the parse results are all present, but some alternatives that appear before other alternatives ("lexically" in the grammar), appear later in the results.

e.g. given (full script is below):

          Editing <$> edit
    <|> Insertion <$> insertion

we get:

["whole","line"]
Insertion ["whole","line"]
Editing (Edit Copy Whole Line)

rather than:

["whole","line"]
Editing (Edit Copy Whole Line)
Insertion ["whole","line"]

I don't know if early parsers specify the order of the results for an ambiguous grammar, but it's how other parser libraries behave. and either way, would be nice to have: my hack is to postprocess the set of results, and rank them. this is messier (as ranking can be "nonlocal"), and maybe slower, than the first result being the expected one.

to reproduce:

import Control.Applicative
import Text.Earley

data Edit   = Edit Action Slice Region     deriving (Show)
data Action = Copy | Delete | Cut          deriving (Show)
data Slice  = Whole | Backwards | Forwards deriving (Show)
data Region = Char | Word | Line           deriving (Show)

type Text = String

edit :: Grammar r (Prod r String Text Edit) 
edit = do

 action <- rule $ "action"
   <=> Copy        <$ symbol "copy"
   <|> Delete      <$ symbol "delete"
   <|> Cut         <$ symbol "cut"

 slice <- rule $ "slice"
   <=> Whole     <$ symbol "whole"
   <|> Backwards <$ symbol "backwards"
   <|> Forwards  <$ symbol "forwards"

 region <- rule $ "region"
   <=> Char <$ symbol "char"
   <|> Word <$ symbol "word"
   <|> Line <$ symbol "line"

 edit' <- rule $ "edit"
   <=> Edit <$> action            <*> (slice -?- Whole) <*> (region -?- Line)
   <|> Edit <$> (action -?- Copy) <*> slice             <*> region
   -- <=> Edit <$> action <*> slice <*> region

 return edit'

data Command
 = Editing Edit
 | Insertion [Text]
 deriving (Show)

command :: Grammar r (Prod r String Text Command) 
command = do

  edit' <- edit

  insertion' <- rule $ some anyWord

  command' <- rule $ "command"
    <=> Editing <$> edit'
    <|> Insertion <$> insertion'

  return command'

anyWord :: Prod z String Text Text
anyWord = satisfy (const True) 

parseCommand :: [Text] -> ([Command], Report String [Text])
parseCommand s = fullParses (parser command) s 

(<=>) :: e -> Prod r e t a -> Prod r e t a 
(<=>) = flip (<?>)
infix 2 <=> 

(-?-) :: Prod r e t a -> a -> Prod r e t a 
(-?-) p x = maybe x id <$> optional p 

printParses :: (Show a, Show e, Show ts) => (ts -> ([a], Report e ts)) -> ts -> IO ()
printParses theParser theSentence = do 
 let (theResults, theReport) = theParser theSentence 
 putStrLn"" 
 print theSentence 
 _ <- traverse print theResults 
 print theReport 

main :: IO () 
main = do

 putStrLn"" 
 printParses parseCommand (words "line")
 printParses parseCommand (words "copy")
 printParses parseCommand (words "copy whole")
 printParses parseCommand (words "whole line")
 printParses parseCommand (words "copy whole line")
 printParses parseCommand (words "not an edit")

{- outputs: 

["line"]
Insertion ["line"]
Report {position = 1, expected = [], unconsumed = []}

["copy"]
Editing (Edit Copy Whole Line)
Insertion ["copy"]
Report {position = 1, expected = ["region","slice"], unconsumed = []}

["copy","whole"]
Insertion ["copy","whole"]
Editing (Edit Copy Whole Line)
Report {position = 2, expected = ["region"], unconsumed = []}

["whole","line"]
Insertion ["whole","line"]
Editing (Edit Copy Whole Line)
Report {position = 2, expected = [], unconsumed = []}

["copy","whole","line"]
Editing (Edit Copy Whole Line)
Editing (Edit Copy Whole Line)
Insertion ["copy","whole","line"]
Report {position = 3, expected = [], unconsumed = []}

["not","an","edit"]
Insertion ["not","an","edit"]
Report {position = 3, expected = [], unconsumed = []}

-} 
@ollef
Copy link
Owner

ollef commented Nov 9, 2015

Agreed, this would be nice to have. I will think about it some more, but I don't think it can be done easily given the current implementation. Especially now that the fix of #14 has been pushed, since that means we're sort of partitioning rules into nullable and non-nullable parts. We would have to undo that partitioning after the fact.

@ollef
Copy link
Owner

ollef commented Nov 10, 2015

If we're not able to do this we should update the documentation to make it clear that there are no guarantees about the order of the results.

@sboosali
Copy link
Collaborator Author

I was thinking about taking a look at this today (though I doubt I'll make any progress).

it seems that (in parse) the messiness of the result order comes from how the states are taken off the list of states, not just the order they are put on it. since this involves the following references, I still haven't wrapped my head around how things run.

(I know this is vague, but) would it be better to (1) actually compute results in the correct order, or (2) manually thread an alternative's "list of indices" from its position in an Alter (and that Alter's position, et cetera), and then re-create the correct order?

(I've been working off of 0.10.0.1)

@ollef
Copy link
Owner

ollef commented Nov 18, 2015

I would prefer (1). Setting aside the complication that is the partitioning into nullables and non-nullables, I think you're on the right track. Continuations are added to a list at various points in time and later consumed in the order they appear in the list which means that they're processed last first. They're in stacks but they should have been in queues. The list of states for the next position has the same problem. A few well-placed reverses might do the trick as a first experiment --- at least for grammars without nullable rules.

@sboosali
Copy link
Collaborator Author

sboosali commented Mar 6, 2016

i tried blindly reordering the state list a while back ( master...sboosali:master ) but couldn't get it to work.

if we want this, (having learned more about chart parsing) i think we have to ensure that the expansion of the nonterminals (earley's "predict" rule) is performed in a top-down and left-to-right way. i think it's top-down already, but not left-to-right. same with terminals (earley's "scan" rule). i think predict must be applied before scan, but not sure yet.

as you say, keeping separate sets of states complicates this; and, i cant wrap my head around how this works with the "build all the edges, then advance a token" approach of this library. on paper, it seemed like you might need to keep track of earlier states to "pop" back to them, but i could be wrong.

another benefit of the depth-first-search approach / in-order results for ambiguous grammars is: we can return the first result long before (unless there's enough sharing, in which case it doesn't matter) the rest of the results, and this first result is likely to be what the user expects.

(sorry for the possibly incoherent braindump, wanted to get this down before i forgot)

@acertain
Copy link

acertain commented Oct 3, 2016

This isn't possible in every case, consider

grammar = mdo
  r <- rule $ fmap f r <|> pure a
  return r

The first result would be let res = f res in res which is bad, I would expect to get an infinite list let l = a:fmap f l

@ollef
Copy link
Owner

ollef commented Oct 3, 2016

I think the ordering that we're (somewhat implicitly) talking about here is what we might call "left-to-right grammar order" (basically that the results from parsing p come before those of q in p <|> q). Using that ordering I think what you call bad is the correct result.

Of course, there may be other orderings that would sometimes be convenient. It looks like you'd like to order the results by recursion depth. I don't quite understand what you're referring to by calling the grammar order bad or impossible though.

I did do some experiments with ordered results a while back, and I think I have a working implementation, but it affected performance in a very bad way because I couldn't find a way to do it without a bunch of list reversals.

@sboosali
Copy link
Collaborator Author

sboosali commented Oct 3, 2016

a working implementation

Olle, I want that so much! Can you share it?

I don't care if it's slow or buggy because:

  1. The strings I parse are short, take far less than a millisecond to do so, and even 100 ms would be acceptable.
  2. I've spent a few hours a week for the past few months researching chart parsers, and they're very tricky, especially with typed representations. I don't want to have to write my own, when you clearly know what you're doing (w.r.t. correctness / efficiency / safety). But I have been because of the following.
  3. I code by voice a lot. I have a type (like your Prod) that, for consistency, both (a) is serialized into a grammar that can be loaded by a speech recognition engine and (b) induces a parser on that same grammar, which parses recognitions from that speech engine. The grammar is extremely ambiguous, and my hack is to postprocess parse results, which are returned completely unordered (like breadth-first-ish), even though the "ranking" is boilerplate and exactly left-to-right. So even if your implementation is buggy, I can continue to postprocess the results until I debug it. Not needing to manage these "rankings" whenever I add a voice command or refactor an old one would make my life a lot easier. No exaggeration.

https://github.com/sboosali/commands-spiros/blob/8d2f127ec7fd9f0c596b35c596547adf1766ca86/config/Commands/Plugins/Spiros/Phrase/Run.hs#L65

@sboosali
Copy link
Collaborator Author

sboosali commented Oct 3, 2016

fread2281,

Yes, that's exactly the behavior I want. (but for less-contrived grammars, whose recursive productions consume input).

@acertain
Copy link

acertain commented Oct 3, 2016

The reason I call that behavior bad is because it makes the parser library non-total. I want to use it in agda and my toy agda-like language, so that's a problem. Total Parser Combinators forbids that grammar with the type system, I think it's possible to do similar in haskell.

@sboosali
Copy link
Collaborator Author

sboosali commented Oct 3, 2016

Only as an option. The fact that Earley parses any CFG is also why I use it
:-)

On Tue, Oct 4, 2016 at 1:31 AM, Valerie Charbonneau <
notifications@github.com> wrote:

The reason I call that behavior "bad" is because it makes the parser
library non-total, which makes me sad because I want to use it in agda and
my toy agda-like language. Total Parser Combinators forbids that grammar
with the type system, I think it's possible to do similar in haskell.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#15 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACNoMVvfO0IJXY7-W3T-jKg-A8yD6RHNks5qwYIwgaJpZM4GdrlE
.

(this message was composed with dictation: charitably interpret typos)Sam
Boosalis

@ollef
Copy link
Owner

ollef commented Oct 4, 2016

@sboosali: I pushed it here. Feel free to play around with it. There may be bugs.

@fread2281: Regarding totality: It's still up to the user of the library to force the result, just like it's up to the user to force the whole result list in case it's infinite. The parser library, if written correctly, should still be terminating. Basically, if I feed in a grammar that has infinite/circular solutions, I would expect a correct parser library to return those solutions. So I don't see this as bad behaviour, but I can see that it's not always convenient. :) The situation in Agda is different since it doesn't allow normal data to be circular.

@sboosali
Copy link
Collaborator Author

sboosali commented Oct 4, 2016

Thanks!!

On Tue, Oct 4, 2016 at 12:17 PM, Olle Fredriksson notifications@github.com
wrote:

@sboosali https://github.com/sboosali: I pushed it here
https://github.com/ollef/Earley/tree/OrderedResults. Feel free to play
around with it. There may be bugs.

@fread2281 https://github.com/fread2281: Regarding totality: It's still
up to the user of the library to force the result, just like it's up to the
user to force the whole result list in case it's infinite. The parser
library, if written correctly, should still be terminating. Basically, if I
feed in a grammar that has infinite/circular solutions, I would expect a
correct parser library to return those solutions. So I don't see this as
bad behaviour, but I can see that it's not always convenient. :) The
situation in Agda is different since it doesn't allow normal data to be
circular.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACNoMUtcPcxNB3ho5DxyHbgUUJ1cYCQhks5qwhnHgaJpZM4GdrlE
.

(this message was composed with dictation: charitably interpret typos)Sam
Boosalis

@ollef
Copy link
Owner

ollef commented Oct 7, 2016

Just found out from Guillaum on #haskell that the tests on that branch are failing, so it might need some work still. :/

@sboosali
Copy link
Collaborator Author

sboosali commented Oct 7, 2016

Yeah, my own tests failed too.

You said that epsilons were expanded separately, right?

On Friday, October 7, 2016, Olle Fredriksson notifications@github.com
wrote:

Just found out from Guillaum on #haskell that the tests on that branch are
failing, so it might need some work still. :/


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACNoMZxNol9QfeWPo6ld3HSvQ8h8SHCIks5qxnS5gaJpZM4GdrlE
.

(this message was composed with dictation: charitably interpret typos)Sam
Boosalis

@ollef
Copy link
Owner

ollef commented Oct 9, 2016

Yeah, epsilons are calculated separately to avoid loops in the more general parsing machinery. That's what the ruleNulls field contains.

@ollef
Copy link
Owner

ollef commented Oct 9, 2016

Does this branch work for you?

@sboosali
Copy link
Collaborator Author

sboosali commented Oct 9, 2016

for
https://github.com/sboosali/Earley/blob/master/examples/OrderedResults.hs
it does!

On Sun, Oct 9, 2016 at 3:43 PM, Olle Fredriksson notifications@github.com
wrote:

Does this branch https://github.com/ollef/Earley/tree/OrderedResults2
work for you?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACNoMeO4qaUYEgRW06L1QGaIEOdVO95Oks5qyUPrgaJpZM4GdrlE
.

(this message was composed with dictation: charitably interpret typos)Sam
Boosalis

@sboosali
Copy link
Collaborator Author

sboosali commented Oct 9, 2016

I'll run it on my (more complicated/ambiguous) dictation grammars too, but
I've been having multiple linker errors on Windows

On Sun, Oct 9, 2016 at 3:59 PM, Spiros Boosalis samboosalis@gmail.com
wrote:

for https://github.com/sboosali/Earley/blob/master/examples/Orde
redResults.hs it does!

On Sun, Oct 9, 2016 at 3:43 PM, Olle Fredriksson <notifications@github.com

wrote:

Does this branch https://github.com/ollef/Earley/tree/OrderedResults2
work for you?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACNoMeO4qaUYEgRW06L1QGaIEOdVO95Oks5qyUPrgaJpZM4GdrlE
.

(this message was composed with dictation: charitably interpret typos)Sam
Boosalis

(this message was composed with dictation: charitably interpret typos)Sam
Boosalis

@sboosali
Copy link
Collaborator Author

sboosali commented Oct 9, 2016

I see a few more swapped insertions, can you explain the changes? I'm looking at OrderedResults...OrderedResults2

@ollef
Copy link
Owner

ollef commented Oct 9, 2016

Nice to hear that it works. :)

Take how the list of next states is built up during processing the current list of states: Previously it was consed to, which means that the states stemming from the last state processed in the current list are first. Since alternatives are processed left-to-right that means that the next list will be right-to-left with respect to alternatives.

The list of continuations is similar: the processing order gets reversed if we just cons to it.

@sboosali
Copy link
Collaborator Author

sboosali commented Oct 9, 2016

Cool. I'd tried blindly reversing the order months ago, but that hadn't worked.

@guibou
Copy link

guibou commented Oct 9, 2016

Hello.

I did some tests and I don't know if I'm wrong in my understanding or if something is broken.

The following code:

data Test = T0 Test | T1 Test | T2 deriving (Show)

simpleParser = mdo
  draw <- E.rule $
    T0 <$> (sym "rot" *> draw)
    <|> T1 <$> (draw <* (sym "XX"))
    <|> T2 <$ sym "ident"

  let tok p   = (many $ E.satisfy isSpace) *> p
      sym x   = tok $ E.list x

  return draw

parse' = E.fullParses (E.parser simpleParser) "rot ident XX"

For me, it should returns as first results T0 (T1 T2), however it does not and returns T1 (T0 T2). However if I'm flipping the first two subrules of draw, such as:

  draw <- E.rule $
    T1 <$> (draw <* (sym "XX"))
    <|> T0 <$> (sym "rot" *> draw)
    <|> T2 <$ sym "ident"

I'm still getting T1 (T0 T2) which is the expected result for this parser, and which is the same result as the previous parser.

@guibou
Copy link

guibou commented Oct 9, 2016

I can add that, in this context, the result order is not influenced by any change in the ordering.

@sboosali
Copy link
Collaborator Author

sboosali commented Oct 9, 2016

Okay, I ported my actual grammar to Prod, and the results are still out-of-order:

["camel","start","date"]
Phrase [Dictated_ (Dictation ["camel","start","date"])]
Phrase [Joined_ CamelJoiner,Dictated_ (Dictation ["start","date"])]
Phrase [Dictated_ (Dictation ["camel"]),Dictated_ (Dictation ["start","date"])]
Phrase [Joined_ CamelJoiner,Dictated_ (Dictation ["start"]),Dictated_ (Dictation ["date"])]
Phrase [Dictated_ (Dictation ["camel"]),Dictated_ (Dictation ["start"]),Dictated_ (Dictation ["date"])]
Report {position = 3, expected = ["word_","phraseW","splitter","brackets","joiner","casing","separator","phraseA","dictation","phraseD","phraseC","phoneticAlphabet","pasted","phraseB"], unconsumed = []}

Phrase [Joined_ CamelJoiner,Dictated_ (Dictation ["start","date"])] should come first, since phraseA (a production with specific terminals) comes before phraseW / phraseD (the wildcard productions that match anything).

https://github.com/sboosali/Earley/blob/master/examples/OrderedExtremelyAmbiguous.hs#L170

I got to go eat, but I (1) still need to double check that my messy grammar is written in a "left to right" way, i.e. that the finite productions come to the left of the wildcard productions; and (2) will work on a minimal reproduction.

@ollef
Copy link
Owner

ollef commented Oct 10, 2016

@guibou: I think your expectations are correct, so it seems that recursive rules are not ordered properly yet.

@sboosali: Maybe you've run into the same problem as @guibou, but a minimal repro would help immensely just in case it's not.

Thanks to both of you!

I ran some measurements on the few benchmarks we have with an optimised version of the OrderedResults2 branch and it seems that it takes roughly 50% longer to parse in order.

@sboosali
Copy link
Collaborator Author

If it's relevant, my grammar is non-recursive (RecursiveDo was just for forwards declaring them), and the results are the same without it. (I pushed the new commit).

And, no thank you Olle!

@ollef
Copy link
Owner

ollef commented Oct 15, 2016

OK, here are the results of some more investigations on the branch:

I think there is a well-defined ordering that's being followed in the case of recursive rules, and I'm trying to pinpoint what it is. A problem is that the order of the results of an ambiguous grammar doesn't seem to be considered in the literature so there's no established nomenclature for it (like there is for e.g. leftmost/rightmost derivations etc). We seem to be favouring associating lexically to the left, and then the left-to-right grammar order. That explains @guibou's result where we have the following grammar:

r ::= 'pre' r | 'done' | r 'post'

When parsing the input pre done post the result that's associated (pre done) post comes first. The same thing happens with the following ambiguous expression grammar:

e ::= 'x' | e '+' e | e '*' e

Parsing the string x+x*x, the result that associates as (x+x)*x comes first regardless of the order of e's productions.

I don't currently see a way to change this ordering.

Note that it's possible to write unambiguous versions of both of the above grammars, i.e. something like:

r1 ::= 'pre' r1 | r2
r2 ::= r2 'post' | 'done'

and

e1 ::= e1 '+' e2 | e2
e2 ::= e2 '*' e3 | e3
e3 ::= 'x'

In my opinion this is a better option, at least when e.g. dealing with programming languages that generally have simple grammars. The exception is when you want the ambiguity or when you just have an exceptionally tricky grammar. Unambiguous grammars can be parsed faster.

Anyway, this ordering needs more investigation, tests, and documentation.

@ollef
Copy link
Owner

ollef commented Feb 8, 2017

I've added info about the order of results to the docs so this is no longer a bug. 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants