-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token-length reported by alexScan
miscomputed in --latin1
mode
#119
Comments
hvr
added a commit
to haskell-hvr/microaeson
that referenced
this issue
Apr 28, 2018
andreasabel
added a commit
to andreasabel/alex
that referenced
this issue
Jan 26, 2020
The computation of the length component of AlexToken was tailored to the utf8 encoding, and didn't work correctly for latin1. This is fixed by having a new flag ALEX_LATIN1 in templates/GenericTemplate.hs that turns on code that increases the length by 1 for each byte, while for utf8 something more sophisticated is done. The fix requires more template instances to be generated. To streamline the instance generation, now all 2^4 = 16 template instances are generated for the 4 flags - ghc - latin1 - nopred - debug To ensure consistent reference to the template instance, a function templateFileName residing both in src/Main and gen-alex-sdist/Main needs to be kept consistent, should more dimensions be added to the template. (Putting this function into a separate file that is included by both modules could be an option, but seemed not enough in the spirit of cabal-organized projects.)
simonmar
added a commit
that referenced
this issue
Jan 27, 2020
[ fixed #119 ] latin1 encoding: each byte counts as 1 char
This issue is still open in 3.2.6. |
Yes, I do want to release with all the good new stuff since. But at the time I did 3.2.6, CI I was broken, so I was being very very cautious. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The following repro-case demonstrates the problem:
The cause is most likely the one already pointed out in #63, i.e.
https://github.com/simonmar/alex/blob/ff84f447bbca5f3b660fcdc5c3124920c7197b1c/templates/GenericTemplate.hs#L178-L180
which tries to count code-points encoded in UTF8, but which makes no sense when in the 8-bit clean
--latin1
mode.The text was updated successfully, but these errors were encountered: