-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted word breaking with fairly large text #19
Comments
Can confirm (Linux, ICU 56.1). For me it always happens after ~6300 processed words, no matter what part of the text I take. It seems to have something to do with garbage collection. Exhibit A, doesn't fail: {-# LANGUAGE OverloadedStrings, ScopedTypeVariables #-}
import qualified Data.Text.IO as T
import qualified Data.Text as T
import qualified Data.Text.ICU as ICU
import qualified Data.Text.ICU.Break as IO
import Unsafe.Coerce
import Data.Text.Foreign
main = do
file <- T.readFile "test.txt"
breaks' (ICU.breakWord "en-US") file
return ()
breaks' :: forall a. ICU.Breaker a -> T.Text -> IO ()
breaks' b t = do
bi :: IO.BreakIterator IO.Word <-
IO.clone (unsafeCoerce (b :: ICU.Breaker a))
IO.setText bi t
let go p = do
mix <- IO.next bi
case mix of
Nothing -> return ()
Just n -> do
s <- IO.getStatus bi
let d = n-p
u = dropWord16 p t
print (n, p, takeWord16 d u)
go n
go =<< IO.first bi Exhibit B, fails unless run with breaks' :: forall a. ICU.Breaker a -> T.Text -> IO [I16]
breaks' b t = do
bi :: IO.BreakIterator IO.Word <-
IO.clone (unsafeCoerce (b :: ICU.Breaker a))
IO.setText bi t
let go p = do
mix <- IO.next bi
case mix of
Nothing -> return []
Just n -> do
s <- IO.getStatus bi
let d = n-p
u = dropWord16 p t
print (n, p, takeWord16 d u)
(n:) `fmap` go n
go =<< IO.first bi |
Should be fixed in
I suppose that #4 had the same cause. |
Hi,
first of it all: thank you for the library.
I bumped against a strange problem with word breaking on a large amount text.
With test.txt (just c&p from Wikipedia Haskell) and this snipped:
ICU starts somewhere in the middle to break on character border, here is the critical transition:
After this point, nearly every character isolated. But not always, sometimes chars are bundled pairwise.
Note: I experienced this bug first with german text extracted from epub chapters. The behavior seems a bit chaotic: Mainly chars are seperated, but somethimes words or parts of word are surviving.
I'm using
icu4c/56.1
on OS X installed viabrew install icu4c
.The text was updated successfully, but these errors were encountered: