Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token-length reported by alexScan miscomputed in --latin1 mode #119

Closed
hvr opened this issue Oct 11, 2017 · 2 comments
Closed

Token-length reported by alexScan miscomputed in --latin1 mode #119

hvr opened this issue Oct 11, 2017 · 2 comments

Comments

@hvr
Copy link
Member

hvr commented Oct 11, 2017

The following repro-case demonstrates the problem:

-- -*- haskell -*-
{
module Main where

import qualified Data.ByteString as B
import Data.Word
}

%encoding "latin1"

:-

<0> [\x01-\xff]+ { False }
<0> [\x00]       { True  }

{
type AlexInput = B.ByteString

alexGetByte :: AlexInput -> Maybe (Word8,AlexInput)
alexGetByte = B.uncons

alexInputPrevChar :: AlexInput -> Char
alexInputPrevChar = undefined

-- generated by @alex@
alexScan :: AlexInput -> Int -> AlexReturn Bool

{-

GOOD cases:

("012\NUL3","012","\NUL3",3,3,False)
("\NUL0","\NUL","0",1,1,True)
("012","012","",3,3,False)

BAD case:

("0@P`p\128\144\160","0@P`p","",5,8,False)

expected:

("0@P`p\128\144\160","0@P`p\128\144\160","",8,8,False)

-}
main :: IO ()
main = do
    go (B.pack [0x30,0x31,0x32,0x00,0x33]) -- GOOD
    go (B.pack [0x00,0x30]) -- GOOD
    go (B.pack [0x30,0x31,0x32]) -- GOOD

    go (B.pack [0x30,0x40,0x50,0x60,0x70,0x80,0x90,0xa0]) -- BAD
  where
    go inp = case (alexScan inp 0) of
               -- expected invariant: len == B.length inp - B.length inp'
               AlexToken inp' len b -> print (inp, B.take len inp, inp',len,B.length inp - B.length inp',b)
}

The cause is most likely the one already pointed out in #63, i.e.

https://github.com/simonmar/alex/blob/ff84f447bbca5f3b660fcdc5c3124920c7197b1c/templates/GenericTemplate.hs#L178-L180

which tries to count code-points encoded in UTF8, but which makes no sense when in the 8-bit clean --latin1 mode.

hvr added a commit to haskell-hvr/microaeson that referenced this issue Apr 28, 2018
andreasabel added a commit to andreasabel/alex that referenced this issue Jan 26, 2020
The computation of the length component of AlexToken was tailored to
the utf8 encoding, and didn't work correctly for latin1.

This is fixed by having a new flag ALEX_LATIN1 in
templates/GenericTemplate.hs that turns on code that increases the
length by 1 for each byte, while for utf8 something more sophisticated
is done.

The fix requires more template instances to be generated.  To streamline
the instance generation, now all 2^4 = 16 template instances are
generated for the 4 flags

  - ghc
  - latin1
  - nopred
  - debug

To ensure consistent reference to the template instance, a function

  templateFileName

residing both in src/Main and gen-alex-sdist/Main needs to be kept
consistent, should more dimensions be added to the template.

(Putting this function into a separate file that is included by both
modules could be an option, but seemed not enough in the spirit of
cabal-organized projects.)
simonmar added a commit that referenced this issue Jan 27, 2020
[ fixed #119 ] latin1 encoding: each byte counts as 1 char
@andreasabel
Copy link
Member

This issue is still open in 3.2.6.

@Ericson2314
Copy link
Collaborator

Yes, I do want to release with all the good new stuff since. But at the time I did 3.2.6, CI I was broken, so I was being very very cautious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants