Unicode Case Mapping #461

picnoir · 2018-01-12T11:25:58Z

Fixes issue #271.

This PR contains two features:

A script that read Unicode case mapping tables and generates a CaseMapping.hs file.
The caseConvert function which uses those mapping in order to both upper and lower case a UTF-8 string.

CaseMapping.hs Generator

This part has been inspired by the text package. Basically, I just translated their parser using the foundation's Parse semantics.

This generator reads both CaseFolding.txt and SpecialCasing.txt files and generate an Haskell equivalent at basement/Basement/String/CaseMapping.hs .

I included a generateCaseMapping.sh convenience script. I had some trouble finding the right stack flags, you can see this file more as a bit of documentation than anything else :)

This script should be manually run for each Unicode revision: usually once a year, around June.

Upper and Lower Case Conversion

The conversion is done in two pass:

A first pass (see the caseConvertNBuff function) where we check:
- the new buffer size needed for the case conversion.
- if the String needs any modification for this case conversion. (See @ndmitchell comment)
A second pass (see the caseConvert function) where we allocate the new buffer (if needed) and fill it using the case conversion function applied on the source string.

Possible Issues

The parser will not be built by the CI, hence, it could break at any moment in the future. Have we any way to force the build of this module by Travis? How should we do it?
I did not benchmarked this change yet.

Side Notes

This is my first contribution to foundation, I am totally out of my comfort zone here. I think this PR should be carefully reviewed before even thinking about merging it, especially the caseConvertNBuff function which could lead to some memory problems if not implemented right.

I did this for training purposes, any kind of feedback is more than welcome :)

Part of issue haskell-foundation#271.

ndmitchell · 2018-01-12T11:30:47Z

Awesome! Thanks for your work on this. I haven't yet reviewed but will do so.

The way I would handle updating and keeping the script alive would be to have the generator run every time in the CI and at the end of the CI do a git diff and fail if there are changes - indicating the generator needs rerunning. The cost to Vincent is that the CI will fail once a year for reasons that are not his fault, but it does serve as a very handy reminder.

ndmitchell · 2018-01-12T11:31:39Z

basement/Basement/String.hs


+-- | A unicode string size may increase during a case conversion operation.
+--   This function calculates the new buffer size for a case conversion.
+--   Returns Nothing if no case conversion is needed.


Can the size decrease as well?

Indeed, it could!

ndmitchell · 2018-01-12T11:32:18Z

basement/Basement/String.hs

+      where
+        !nextI = nextWithIndexer getIdx
+        eSize !e = if e == '\0' 
+                      then 0


Why is 0 of size 0 and not 1? Scratch that, I understand now

ndmitchell · 2018-01-12T11:33:21Z

basement/Basement/UTF8/Types.hs

 newtype StepASCII = StepASCII Word8

+-- | Specialized tuple used for case mapping.
+data CM = CM {-# UNPACK #-} !Char {-# UNPACK #-} !Char {-# UNPACK #-} !Char deriving (Eq)


Should these be Char or Word8?

Please document the invariant that it's isomorphic to [Char] and that all trailing Char must be \0.

Suggest instead using CM Int Int Int with -1 as the sentinel, since the rest of the foundation stuff is Int not Char for the UTFsize etc.

ndmitchell · 2018-01-12T11:36:08Z

tests/Test/Foundation/String.hs

+         , CheckPlan "B should capitalize to B" $ validate "B" $ upper "B" == "B"
+         , CheckPlan "é should capitalize to É" $ validate "é" $ upper "é" == "É"
+         , CheckPlan "ß should capitalize to SS" $ validate "ß" $ upper "ß" == "SS"
+         , CheckPlan "ﬄ should capitalize to FFL" $ validate "ﬄ" $ upper "ﬄ" == "FFL"


Can you please add a random test that capitalising is idempotent or something? Anything that hammers on long strings will show up GC and segfault issues which are the most likely problems.

Given all the sentinels around 0 please make sure you test both \0a and \0\0 (one that requires transforming and one that does not). I suspect they are broken.

Great idea!

Alright, I added the property test.

I am a bit confused by your second comment thought. Isn't \x0a LF and \0\0 two nulls char? I am assuming you would like a test case where the heading bytes are 0s, would upper à == À (upper \x00e0 == \x00c0) do the trick? (I just tested, it capitalizes this case successfully)

I am maybe missing the point here...

I meant ['\0','a']. Specifically, try a test where nothing in the string converts (so it fails out early), and a test where they do convert (so you test the conversion path too).

Right, got it!

Done.

ndmitchell · 2018-01-12T11:37:20Z

scripts/caseMapping/CaseMapping.hs

@@ -0,0 +1,41 @@
+{-# LANGUAGE OverloadedStrings #-}
+
+module CaseMapping where


I always require an explicit export list. Not sure how Vincent feels though (if he does then an HLint rule could enforce it)

Oops, my bad, I made that during the prototyping phase, completely forgot to add it when cleaning...

I'm fine either way, since it's a on the side script

ndmitchell

I think the use of \0 as the sentinel in CM is problematic for strings containing Int. Suggest you:

Switch to Int and use -1 as the sentinel.
Always have a fast-path for the second char being -1, as that will be the common case by a vast margin and can avoid looking at the third char.

ndmitchell · 2018-01-12T12:41:53Z

basement/Basement/UTF8/Types.hs

 newtype StepASCII = StepASCII Word8

+-- | Specialized tuple used for case mapping.
+data CM = CM {-# UNPACK #-} !Char {-# UNPACK #-} !Char {-# UNPACK #-} !Char deriving (Eq)


Please document the invariant that it's isomorphic to [Char] and that all trailing Char must be \0.

ndmitchell · 2018-01-12T12:42:51Z

basement/Basement/String.hs

+            | otherwise = do
+                let !(c, idx') = nextI idx
+                    !cm@(CM c1 c2 c3) = op c 
+                    !cSize = (eSize c1) + (eSize c2) + (eSize c3)


Unnecessary brackets

Also if c2 == 0 we can avoid adding in c3, which is going to be a saving in the common case.

ndmitchell · 2018-01-12T12:43:36Z

basement/Basement/String.hs

+                let !(c, idx') = nextI idx
+                    !cm@(CM c1 c2 c3) = op c 
+                    !cSize = (eSize c1) + (eSize c2) + (eSize c3)
+                    !nconvert = convert || not ((c1 == c) && (c2 == '\0') && (c3 == '\0'))


Do you need to test c3? If c2 == 0 doesn't that imply c3 == 0?

Indeed it does :)

Fixed.

The whole expression might be clearer as changed' = changed || c1 /= c || c2 /= 0

ndmitchell · 2018-01-12T12:44:03Z

basement/Basement/String.hs

+                      then 0
+                      else charToBytes (fromEnum e)
+        loop !idx ns convert
+            | idx == end = if convert


Suggest renaming convert to changed - you are always doing the conversion, the question is if things change during that process.

ndmitchell · 2018-01-12T12:47:03Z

basement/Basement/String.hs

+caseConvert op s@(String ba) 
+  = case nBuff of
+      Nothing -> s
+      (Just nLen) -> runST $ unsafeCopyFrom s nLen go


Unnecessary brackets.

ndmitchell · 2018-01-12T12:48:03Z

basement/Basement/String.hs

+      let !(CM c1 c2 c3) = op c 
+      dstIdx'    <- writeChar c1 dstIdx 
+      dstIdx''   <- writeChar c2 dstIdx'
+      nextDstIdx <- writeChar c3 dstIdx''


if c2 is 0 can't you shortcut testing c3? It makes the code less clean, but it's so very common that it's probably worth it.

It sadly makes the code quite messy. But your're right, the common case should be short.

Fixed, if the code looks too messy, I can revert this change.

ndmitchell · 2018-01-12T12:50:43Z

basement/Basement/UTF8/Types.hs

 newtype StepASCII = StepASCII Word8

+-- | Specialized tuple used for case mapping.
+data CM = CM {-# UNPACK #-} !Char {-# UNPACK #-} !Char {-# UNPACK #-} !Char deriving (Eq)


Suggest instead using CM Int Int Int with -1 as the sentinel, since the rest of the foundation stuff is Int not Char for the UTFsize etc.

ndmitchell · 2018-01-12T12:51:30Z

tests/Test/Foundation/String.hs

+         , CheckPlan "B should capitalize to B" $ validate "B" $ upper "B" == "B"
+         , CheckPlan "é should capitalize to É" $ validate "é" $ upper "é" == "É"
+         , CheckPlan "ß should capitalize to SS" $ validate "ß" $ upper "ß" == "SS"
+         , CheckPlan "ﬄ should capitalize to FFL" $ validate "ﬄ" $ upper "ﬄ" == "FFL"


Given all the sentinels around 0 please make sure you test both \0a and \0\0 (one that requires transforming and one that does not). I suspect they are broken.

ndmitchell · 2018-01-12T12:53:16Z

I will say this patch looks really really good - my requested changes (which @vincenthz really needs to confirm are the direction he wants to take things) are pretty minor - looks very nice. I definitely owe you a beer.

picnoir · 2018-01-12T14:39:29Z

Thanks for this quick review Neil!

Should I squash the requested changes or should I push a proper commit?

ndmitchell · 2018-01-12T14:44:19Z

That's a question for @vincenthz (although my general feeling is if people want squashed commits they can do it with the merge so no reason to get the contributors to do it)

ndmitchell · 2018-01-12T14:45:24Z

Do let me know when you've done all the changes you are planning on doing and I'll re-review.

Part of issue haskell-foundation#271.

picnoir · 2018-01-12T15:48:22Z

I just posted the changes.

The only change I did not push was the CM elements type change to Int. I'll wait for @vincenthz feedback to make my mind on that. It is a minor change anyways, we just need to add a toEnum call on the code generator side and a fromEnum call on the library side. I'll add this change tomorrow night if needed.

Until then, thanks again for the review and enjoy your weekend!

ndmitchell · 2018-01-12T15:59:26Z

Ah, not "0a" but "\0a" - so a null character followed by a. My suspicion is that the code is wrong for strings containing \0, and that you'll have to move to Int so you can have access to a sentinel value (e.g. -1).

ndmitchell · 2018-01-12T16:00:10Z

basement/Basement/String.hs

+                let !(c, idx') = nextI idx
+                    !cm@(CM c1 c2 c3) = op c 
+                    !cSize = if c2 == '\0' -- if c2 is empty, c3 will be empty as well.
+                              then eSize c1 


For c1 you can just call charToBytes directly and avoid one test against \0.

*facepalm

Thanks!

ndmitchell · 2018-01-12T16:01:46Z

basement/Basement/String.hs

+            !(Step c nextSrcIdx) = next src' srcIdx
+            writeChar cc wIdx =
+                if cc == '\0'
+                    then return wIdx 


You can avoid the \0 test for c1

I am directly using write for C1, see line 1360. But you're right, writeChar is a confusing name, I guess writeValidChar would be more appropriate!

Maybe writeMaybeChar?

ndmitchell · 2018-01-12T16:01:48Z

basement/Basement/String.hs

+          then return dstIdx'
+          else do
+            dstIdx''  <- writeChar c2 dstIdx'
+            writeChar c3 dstIdx''


Personally I'd just keep reusing dstIdx shadowing it as that eliminates a nasty class of failures where you get the wrong number of primes.

picnoir · 2018-01-12T16:10:20Z

Oh yes! I finally get the problem!

Indeed, it feels like the change to Int is mandatory...

Part of issue haskell-foundation#271.

picnoir · 2018-01-12T16:42:46Z

As you can see, handling \0 is not problematic anymore. This is a side effect of the latest changes (we do not check C1 anymore).

It still does feel like a time bomb to me, we need to change CM inner type to Int.

I don't have the time to do it right now, it'll be done before Monday.

ndmitchell · 2018-01-12T17:18:00Z

Ah, I didn't think of just not checking c1 - with that I'm happy for it to be either Int or Char - you can argue either way.

vincenthz · 2018-01-12T18:18:03Z

basement/Basement/String.hs

+            writeValidChar cc wIdx =
+                if cc == '\0'
+                    then return wIdx 
+                else do


style nitpick: the usual style is:

if x then a else b

vincenthz · 2018-01-12T18:19:38Z

scripts/caseMapping/CaseMapping.hs

+  cfs <- case pcf of
+           Left err -> putStrLn (show err) >> undefined
+           Right cf -> return cf
+  h <- openFile ("../../basement/Basement/String/CaseMapping.hs") WriteMode


might be easier to have the haskell program write to stdout and redirect the stdout to the file when wanting to rewrite

vincenthz · 2018-01-12T18:21:33Z

nice work @NinjaTrappeur ! also thanks for the review @ndmitchell !

vincenthz · 2018-01-13T12:16:32Z

I've squashed the commit list since @NinjaTrappeur requested it, but otherwise I don't have a problem to include the history (even if not perfect) in a normal merge. I believe there was no other stuff on @ndmitchell list of things (despite still being in red condition, let use @ndmitchell if I forgot anything). Thanks again @NinjaTrappeur

ndmitchell · 2018-01-13T16:07:12Z

The main thing we need to know is your approach to generation. Mine would be run in ever travis run, fail if there are changes, which means it will fail once a year. Other people have different approaches.

ndmitchell · 2018-01-13T16:08:48Z

And thinking of it, a test showing it is equal to text package would give a lot of confidence.

picnoir · 2018-01-13T22:05:23Z

I agree with @ndmitchell about the travis part:

it will run periodically the script, preventing it being outdated.
it will act as a heads up for every Unicode release. I just realized the text package is 1 release late.

I was actually just writing that part here.

It seems to fail on GHC < 8.2, it needs some adjustments. I'll have a deeper look on that tomorrow.

I'll also have a look on a test showing it is equal to text. I think it will also interesting to benchmark the two implementation and see if everything is alright performance wise.

@vincenthz Do you mind if I open a new PR where I can submit those changes?

vincenthz · 2018-01-14T10:30:04Z

@NinjaTrappeur I don't mind at all, the more the merrier.

As to a testing plan, I don't mind too much, the more automated the less likely it's going to create out-of-date problem, but also would like to make sure we still have fast tests (which means minimal extra/optional dependencies).

Showing equivalence to text is useful (provided that it's not out of date) but it's probably a good idea to limit it to the edge travis run which depends on text and bytestring already.

picnoir added 4 commits December 6, 2017 16:26

add unicode casefolding files parser

2e2f3f9

Part of issue haskell-foundation#271.

Add unicode special case parser.

bac11eb

Part of issue haskell-foundation#271.

Pretty printing hex values.

f9ead6c

Part of issue haskell-foundation#271.

Implement unicode multi character upper and lower.

219bc8a

Part of issue haskell-foundation#271.

ndmitchell approved these changes Jan 12, 2018

View reviewed changes

ndmitchell suggested changes Jan 12, 2018

View reviewed changes

picnoir added 2 commits January 12, 2018 16:42

Optimize caseConvert for single char convert.

dd5084b

Part of issue haskell-foundation#271.

Property testing upper.

175f5a8

Part of issue haskell-foundation#271.

ndmitchell reviewed Jan 12, 2018

View reviewed changes

NMichels review, batch 2.

18e80d6

Part of issue haskell-foundation#271.

vincenthz changed the title ~~Issue 271~~ Unicode Case Mapping Jan 12, 2018

vincenthz reviewed Jan 12, 2018

View reviewed changes

vincenthz merged commit 1bb7fca into haskell-foundation:master Jan 13, 2018

		@@ -0,0 +1,41 @@
		{-# LANGUAGE OverloadedStrings #-}

		module CaseMapping where

Unicode Case Mapping #461

Unicode Case Mapping #461

Uh oh!

Conversation

picnoir commented Jan 12, 2018

CaseMapping.hs Generator

Upper and Lower Case Conversion

Possible Issues

Side Notes

Uh oh!

ndmitchell commented Jan 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ndmitchell Jan 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

picnoir Jan 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ndmitchell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ndmitchell commented Jan 12, 2018

ndmitchell Jan 12, 2018 •

edited

Loading

picnoir Jan 12, 2018 •

edited

Loading

picnoir commented Jan 12, 2018 •

edited

Loading

picnoir commented Jan 12, 2018 •

edited

Loading