Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix left context with UTF-8 input in bytestring wrappers #165

Closed
wants to merge 3 commits into from
Closed

Fix left context with UTF-8 input in bytestring wrappers #165

wants to merge 3 commits into from

Conversation

abt8601
Copy link

@abt8601 abt8601 commented Nov 17, 2020

Fixes #53

In the original implementation of the bytestring wrappers, alexGetByte maintains the last seen byte instead of the last seen character. This causes the left context to cease proper function. This patch introduces a fix of the issue.

Since I have to change the structure of the AlexInput type, this is a breaking change.

@abt8601 abt8601 changed the title Fix forward context with UTF-8 input in bytestring wrappers Fix left context with UTF-8 input in bytestring wrappers Nov 17, 2020
Two of the three (or four) return values are actually useless and are
removed.
@Ericson2314
Copy link
Collaborator

This does make me wonder, is alexGetByte even the right layer of abstraction? Might it be better to pop a whole character?

With this change, both the String (native pop char) and ByteString (native pop byte) are complex, and rightfully so. With alexGetChar, the String could become trivial, and the ByteString case basically becomes no worse, since, as you demonstrate, we already need to track what byte of the character we're at anyways.

What do you think?

@abt8601
Copy link
Author

abt8601 commented Jan 2, 2021

I think having alexGetChar makes things simpler even on ByteString, since we wouldn't even need to track which byte we're at of the current character in AlexInput.

I guess the reason for having alexGetByte is performance, since Alex internally uses UTF-8 encoded byte sequence, as stated in the documentation.

@Ericson2314
Copy link
Collaborator

I'll have to ponder more what "internally uses UTF-8" means. Maybe @alanz or @jyp remember something from writing 892688f a decade ago? :D

@alanz
Copy link

alanz commented Jan 3, 2021

It's a long time ago :)

IIRC, getByte accumulates input one Word8 at a time, and only cranks the state machine when it hits a character boundary. So basically it does the [Word8] -> Char` conversion. And because the commit talks about the NFA blowing up in size and needing to be minimized, I think this may be pushed right into the generated DFA too.

I am not sure if there was proper unicode support in Char at the time.

Either way, GHC parses from a StringBuffer, so I think it needs the unicode processing for a list of bytes.

My brain dump.

@Ericson2314
Copy link
Collaborator

Thanks!

I am not sure if there was proper unicode support in `Char at the time.

Ah, wonderful, this is just the thing I was hoping to hear. Yes I am getting more sure we should just be outlining the UTF-8 state machine at this point. We could even have support other encodings that way.

(It's wonderful how regular languages serially compose, I only wish someone would do the research so we can do the same with context free ones!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

basic-bytestring wrapper does not work with left contexts
3 participants