Fix left context with UTF-8 input in bytestring wrappers #165

abt8601 · 2020-11-17T06:57:30Z

Fixes #53

In the original implementation of the bytestring wrappers, alexGetByte maintains the last seen byte instead of the last seen character. This causes the left context to cease proper function. This patch introduces a fix of the issue.

Since I have to change the structure of the AlexInput type, this is a breaking change.

Two of the three (or four) return values are actually useless and are removed.

Ericson2314 · 2020-12-25T18:56:46Z

This does make me wonder, is alexGetByte even the right layer of abstraction? Might it be better to pop a whole character?

With this change, both the String (native pop char) and ByteString (native pop byte) are complex, and rightfully so. With alexGetChar, the String could become trivial, and the ByteString case basically becomes no worse, since, as you demonstrate, we already need to track what byte of the character we're at anyways.

What do you think?

abt8601 · 2021-01-02T15:32:57Z

I think having alexGetChar makes things simpler even on ByteString, since we wouldn't even need to track which byte we're at of the current character in AlexInput.

I guess the reason for having alexGetByte is performance, since Alex internally uses UTF-8 encoded byte sequence, as stated in the documentation.

Ericson2314 · 2021-01-02T17:49:46Z

I'll have to ponder more what "internally uses UTF-8" means. Maybe @alanz or @jyp remember something from writing 892688f a decade ago? :D

alanz · 2021-01-03T12:03:38Z

It's a long time ago :)

IIRC, getByte accumulates input one Word8 at a time, and only cranks the state machine when it hits a character boundary. So basically it does the [Word8] -> Char` conversion. And because the commit talks about the NFA blowing up in size and needing to be minimized, I think this may be pushed right into the generated DFA too.

I am not sure if there was proper unicode support in Char at the time.

Either way, GHC parses from a StringBuffer, so I think it needs the unicode processing for a list of bytes.

My brain dump.

Ericson2314 · 2021-01-03T16:40:54Z

Thanks!

I am not sure if there was proper unicode support in `Char at the time.

Ah, wonderful, this is just the thing I was hoping to hear. Yes I am getting more sure we should just be outlining the UTF-8 state machine at this point. We could even have support other encodings that way.

(It's wonderful how regular languages serially compose, I only wish someone would do the research so we can do the same with context free ones!)

abt8601 added 2 commits November 17, 2020 14:29

bytestring wrappers: Maintain last character

2207e59

Add test case corresponding to #53

c47ad0d

abt8601 changed the title ~~Fix forward context with UTF-8 input in bytestring wrappers~~ Fix left context with UTF-8 input in bytestring wrappers Nov 17, 2020

Bytestring wrappers: Rewrite flush

3faf110

Two of the three (or four) return values are actually useless and are removed.

Ericson2314 mentioned this pull request Jan 4, 2021

Provide macros corresponding to the Unicode general categories #126

Open

abt8601 closed this Jun 9, 2021

abt8601 deleted the fix-forward-context branch June 9, 2021 03:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix left context with UTF-8 input in bytestring wrappers #165

Fix left context with UTF-8 input in bytestring wrappers #165

abt8601 commented Nov 17, 2020 •

edited

Ericson2314 commented Dec 25, 2020

abt8601 commented Jan 2, 2021 •

edited

Ericson2314 commented Jan 2, 2021

alanz commented Jan 3, 2021

Ericson2314 commented Jan 3, 2021

Fix left context with UTF-8 input in bytestring wrappers #165

Fix left context with UTF-8 input in bytestring wrappers #165

Conversation

abt8601 commented Nov 17, 2020 • edited

Ericson2314 commented Dec 25, 2020

abt8601 commented Jan 2, 2021 • edited

Ericson2314 commented Jan 2, 2021

alanz commented Jan 3, 2021

Ericson2314 commented Jan 3, 2021

abt8601 commented Nov 17, 2020 •

edited

abt8601 commented Jan 2, 2021 •

edited