Token-length reported by `alexScan` miscomputed in `--latin1` mode #119

hvr · 2017-10-11T08:10:47Z

The following repro-case demonstrates the problem:

-- -*- haskell -*-
{
module Main where

import qualified Data.ByteString as B
import Data.Word
}

%encoding "latin1"

:-

<0> [\x01-\xff]+ { False }
<0> [\x00]       { True  }

{
type AlexInput = B.ByteString

alexGetByte :: AlexInput -> Maybe (Word8,AlexInput)
alexGetByte = B.uncons

alexInputPrevChar :: AlexInput -> Char
alexInputPrevChar = undefined

-- generated by @alex@
alexScan :: AlexInput -> Int -> AlexReturn Bool

{-

GOOD cases:

("012\NUL3","012","\NUL3",3,3,False)
("\NUL0","\NUL","0",1,1,True)
("012","012","",3,3,False)

BAD case:

("0@P`p\128\144\160","0@P`p","",5,8,False)

expected:

("0@P`p\128\144\160","0@P`p\128\144\160","",8,8,False)

-}
main :: IO ()
main = do
    go (B.pack [0x30,0x31,0x32,0x00,0x33]) -- GOOD
    go (B.pack [0x00,0x30]) -- GOOD
    go (B.pack [0x30,0x31,0x32]) -- GOOD

    go (B.pack [0x30,0x40,0x50,0x60,0x70,0x80,0x90,0xa0]) -- BAD
  where
    go inp = case (alexScan inp 0) of
               -- expected invariant: len == B.length inp - B.length inp'
               AlexToken inp' len b -> print (inp, B.take len inp, inp',len,B.length inp - B.length inp',b)
}

The cause is most likely the one already pointed out in #63, i.e.

https://github.com/simonmar/alex/blob/ff84f447bbca5f3b660fcdc5c3124920c7197b1c/templates/GenericTemplate.hs#L178-L180

which tries to count code-points encoded in UTF8, but which makes no sense when in the 8-bit clean --latin1 mode.

The text was updated successfully, but these errors were encountered:

The computation of the length component of AlexToken was tailored to the utf8 encoding, and didn't work correctly for latin1. This is fixed by having a new flag ALEX_LATIN1 in templates/GenericTemplate.hs that turns on code that increases the length by 1 for each byte, while for utf8 something more sophisticated is done. The fix requires more template instances to be generated. To streamline the instance generation, now all 2^4 = 16 template instances are generated for the 4 flags - ghc - latin1 - nopred - debug To ensure consistent reference to the template instance, a function templateFileName residing both in src/Main and gen-alex-sdist/Main needs to be kept consistent, should more dimensions be added to the template. (Putting this function into a separate file that is included by both modules could be an option, but seemed not enough in the spirit of cabal-organized projects.)

[ fixed #119 ] latin1 encoding: each byte counts as 1 char

andreasabel · 2021-01-10T20:09:07Z

This issue is still open in 3.2.6.

Ericson2314 · 2021-01-10T20:11:15Z

Yes, I do want to release with all the good new stuff since. But at the time I did 3.2.6, CI I was broken, so I was being very very cautious.

hvr added a commit to haskell-hvr/microaeson that referenced this issue Apr 28, 2018

Workaround haskell/alex#119

13c2805

andreasabel mentioned this issue Jan 26, 2020

[ #71 ] warn about nullable regexs in the absence of start codes #155

Merged

simonmar closed this as completed in ae525e3 Jan 27, 2020

simonmar added a commit that referenced this issue Jan 27, 2020

Merge pull request #156 from andreasabel/issue119

574ec8c

[ fixed #119 ] latin1 encoding: each byte counts as 1 char

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token-length reported by `alexScan` miscomputed in `--latin1` mode #119

Token-length reported by `alexScan` miscomputed in `--latin1` mode #119

hvr commented Oct 11, 2017

andreasabel commented Jan 10, 2021

Ericson2314 commented Jan 10, 2021

Token-length reported by alexScan miscomputed in --latin1 mode #119

Token-length reported by alexScan miscomputed in --latin1 mode #119

Comments

hvr commented Oct 11, 2017

andreasabel commented Jan 10, 2021

Ericson2314 commented Jan 10, 2021

Token-length reported by `alexScan` miscomputed in `--latin1` mode #119

Token-length reported by `alexScan` miscomputed in `--latin1` mode #119