Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative formulation of Stream #72

Open
safareli opened this issue Jul 12, 2017 · 9 comments
Open

Alternative formulation of Stream #72

safareli opened this issue Jul 12, 2017 · 9 comments

Comments

@safareli
Copy link

safareli commented Jul 12, 2017

Here is what kind of Stream I came to independently, when I was playing with purescript-parsing:

class Stream f c | f -> c where
  uncons :: f -> Maybe { head :: c, tail :: f, updatePos :: Position -> Position }
  stripPrefix :: Prefix f -> f -> Maybe {  rest :: f, updatePos :: Position -> Position }

class HasUpdatePosition a where
  updatePos :: Position -> a -> Position

newtype Prefix a = Prefix a


-- example implementation for list
instance (Eq a, HasUpdatePosition a) => Stream (List.List a) a where
  uncons f = L.uncons f <#> \({ head, tail}) ->
    { head, tail, updatePos: (_ `updatePos` head)}
  stripPrefix (Prefix p) s = List.stripPrefix (List.Pattern p) s <#> \rest ->
    { rest, updatePos: unwrap (fold (p <#> (flip updatePos >>> Endo)))}

purescript-contrib/purescript-parsing#62

I think this formulation is nicer as you don't need to carry updatePosition function around, and I don't see why you need the m in uncons too.

Would like your thoughts on this formulation.

@ekmett
Copy link
Member

ekmett commented Jul 18, 2017

Folks use the m in uncons to allow "streams" that aren't fully in memory. E.g. Streaming from disk or pipes.

@safareli
Copy link
Author

Thanks @ekmett I have added the m to PureScript PR .

What you think on moving Position -> Position into Stream class?

@safareli
Copy link
Author

Also in mtl we have class Monad m => MonadState s m | m -> s where should the Stream also have m -> s dep?

@ekmett
Copy link
Member

ekmett commented Aug 1, 2017

Lots of people use different states that happen to run in the IO monad.

@ekmett
Copy link
Member

ekmett commented Aug 1, 2017

I'm not sure about the Position -> Position thing. To me it seems to entangle a couple of concerns, and complicates uncons. It seems you'd really want

Position -> f -> Maybe { head :: c, tail :: f, newPos :: Position }

which would allow for different parsing based on position, and avoids capturing a very expensive closure. An example where that might matter is if your lexer prepass did something like c preprocessing where it might need to know if it is at the start of file to know how to properly handle finding # directives. Regardless, building a function there means that this will never fully unbox into something that just runs on the stack no matter what instance you work with unlike the existing solution.

You're always paying full price.

@safareli
Copy link
Author

safareli commented Dec 3, 2017

This is what i came to:

class Stream s m t | s -> t where
  uncons :: ParserState s -> m (Maybe (Tuple t (ParserState s)))
  stripPrefix :: Prefix s -> ParserState s -> m (Maybe (ParserState s))

type ParserState s = { input :: s, pos :: Position }

@ekmett
Copy link
Member

ekmett commented Dec 3, 2017

Poking at this from the other side:

First uncons:

You've now converged almost back to the existing design. The major difference is that uncons doesn't get fed the parsec SourcePos, but rather just whatever s you ask for. Why? When it is a string, then the 's' can just be a tail of the string. When it is a file, you need something like a number of bytes consumed so far for seeking or a file handle and a pushback buffer. In neither case do you use the source position and we don't even look at the Char returned at present.

Here you're bolting in machinery to update position, but 'uncons' doesn't need to inspect the char it gives back, so there is no "double dipping" on work, there. Only the "update position" combinator passed to tokenPrim actually looks at the Char or whatever is given back by your stream.

So the common use-cases don't use the extra information you are supplying here to access the stream.

This consciously leaves open the notion of what is a 'column' to tokenPrim to fill in, enabling you to work with a more traditional notion of tabstops out of the box with the char combinators, etc. or plug in your own, so you can support, say, modelines, or choose to have every character (even tabs) be one 'column' wide, or using the number of utf-16 code units seen so far in the line so you can write a language server protocol implementation, etc.

At present Stream factors out the concern of updating position from getting data. In the scenario you propose wed couple them more tightly forcing users to use newtypes to split these cases back apart.

It feels to me that this part of your design couples the stream to the notion of columns in a really tight way that then require more code duplication (copying the update position function into stream types that previously lived agnostic of parsec's rather awful SourcePos type) and newtypes to tease apart, but doesn't actually pay out in a win in terms of avoiding duplicating any work at runtime.

Now stripPrefix:

Adding something like stripPrefix could be quite useful for performance. The major objections to overcome there I'd see is that currently Parsec doesn't make any use of type families and the aforementioned issues with entangling the update of position with the internals of the stream types in the API you provide.

Here things become more nebulous as at this point you are looking at the characters given back and so my argument about how uncons need not concern itself with what is fetched falls apart a bit. Interestingly there is a second option when evaluating stripPrefix-like combinator, which is to have whatever calls it take an argument like Prefix s -> SourcePos -> SourcePos so that some calculations can be shared across calls rather than having it come back as part of the state update. This way it can be lifted out by the compiler and computed once for a given prefix you might be trying to strip, e.g. a given keyword "foo" will always be 3 columns wide. For sharing, many practical implementations could represent the SourcePos -> SourcePos using something like a fairly simple

data Delta = Delta !Int !Int
instance Monoid Delta where
  mempty = Delta 0 0
  mappend (Delta a b) (Delta 0 d) = Delta a (b + d)
  mappend (Delta a _) (Delta c d) = Delta (a + c) d

Then convert Prefix s -> Delta to get a cacheable result, then apply the delta with Delta -> SourcePos -> SourcePos to get the Prefix s -> SourcePos -> SourcePos while carefully caching that intermediate delta in the environment.

With that in mind, it seems even stripPrefix benefits from going to something like the current approach and splitting position updates from reading from the stream.

@safareli
Copy link
Author

safareli commented Dec 4, 2017

Thanks for detailed response! I would need to think on this, meanwhile, where can I find definition of Delta?

@ekmett
Copy link
Member

ekmett commented Dec 7, 2017

I wrote it inline above as an example. It isn't used by parsec.

You can read Delta x y as 'move down x lines, and if x is > 1 the column is absolute, otherwise its relative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants