-
Notifications
You must be signed in to change notification settings - Fork 93
Unicode Case Mapping #461
Unicode Case Mapping #461
Conversation
Part of issue haskell-foundation#271.
Part of issue haskell-foundation#271.
Part of issue haskell-foundation#271.
Awesome! Thanks for your work on this. I haven't yet reviewed but will do so. The way I would handle updating and keeping the script alive would be to have the generator run every time in the CI and at the end of the CI do a git diff and fail if there are changes - indicating the generator needs rerunning. The cost to Vincent is that the CI will fail once a year for reasons that are not his fault, but it does serve as a very handy reminder. |
|
||
-- | A unicode string size may increase during a case conversion operation. | ||
-- This function calculates the new buffer size for a case conversion. | ||
-- Returns Nothing if no case conversion is needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the size decrease as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it could!
where | ||
!nextI = nextWithIndexer getIdx | ||
eSize !e = if e == '\0' | ||
then 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is 0 of size 0 and not 1? Scratch that, I understand now
newtype StepASCII = StepASCII Word8 | ||
|
||
-- | Specialized tuple used for case mapping. | ||
data CM = CM {-# UNPACK #-} !Char {-# UNPACK #-} !Char {-# UNPACK #-} !Char deriving (Eq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be Char or Word8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document the invariant that it's isomorphic to [Char]
and that all trailing Char
must be \0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest instead using CM Int Int Int
with -1
as the sentinel, since the rest of the foundation stuff is Int not Char for the UTFsize etc.
, CheckPlan "B should capitalize to B" $ validate "B" $ upper "B" == "B" | ||
, CheckPlan "é should capitalize to É" $ validate "é" $ upper "é" == "É" | ||
, CheckPlan "ß should capitalize to SS" $ validate "ß" $ upper "ß" == "SS" | ||
, CheckPlan "ffl should capitalize to FFL" $ validate "ffl" $ upper "ffl" == "FFL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a random test that capitalising is idempotent or something? Anything that hammers on long strings will show up GC and segfault issues which are the most likely problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given all the sentinels around 0 please make sure you test both \0a
and \0\0
(one that requires transforming and one that does not). I suspect they are broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I added the property test.
I am a bit confused by your second comment thought. Isn't \x0a
LF and \0\0 two nulls char? I am assuming you would like a test case where the heading bytes are 0s, would upper à == À
(upper \x00e0 == \x00c0
) do the trick? (I just tested, it capitalizes this case successfully)
I am maybe missing the point here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant ['\0','a']
. Specifically, try a test where nothing in the string converts (so it fails out early), and a test where they do convert (so you test the conversion path too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, got it!
Done.
scripts/caseMapping/CaseMapping.hs
Outdated
@@ -0,0 +1,41 @@ | |||
{-# LANGUAGE OverloadedStrings #-} | |||
|
|||
module CaseMapping where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always require an explicit export list. Not sure how Vincent feels though (if he does then an HLint rule could enforce it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, my bad, I made that during the prototyping phase, completely forgot to add it when cleaning...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine either way, since it's a on the side script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the use of \0
as the sentinel in CM is problematic for strings containing Int
. Suggest you:
- Switch to Int and use -1 as the sentinel.
- Always have a fast-path for the second char being -1, as that will be the common case by a vast margin and can avoid looking at the third char.
newtype StepASCII = StepASCII Word8 | ||
|
||
-- | Specialized tuple used for case mapping. | ||
data CM = CM {-# UNPACK #-} !Char {-# UNPACK #-} !Char {-# UNPACK #-} !Char deriving (Eq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document the invariant that it's isomorphic to [Char]
and that all trailing Char
must be \0
.
basement/Basement/String.hs
Outdated
| otherwise = do | ||
let !(c, idx') = nextI idx | ||
!cm@(CM c1 c2 c3) = op c | ||
!cSize = (eSize c1) + (eSize c2) + (eSize c3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary brackets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if c2 == 0 we can avoid adding in c3, which is going to be a saving in the common case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
basement/Basement/String.hs
Outdated
let !(c, idx') = nextI idx | ||
!cm@(CM c1 c2 c3) = op c | ||
!cSize = (eSize c1) + (eSize c2) + (eSize c3) | ||
!nconvert = convert || not ((c1 == c) && (c2 == '\0') && (c3 == '\0')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to test c3
? If c2 == 0
doesn't that imply c3 == 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it does :)
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole expression might be clearer as changed' = changed || c1 /= c || c2 /= 0
basement/Basement/String.hs
Outdated
then 0 | ||
else charToBytes (fromEnum e) | ||
loop !idx ns convert | ||
| idx == end = if convert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest renaming convert
to changed
- you are always doing the conversion, the question is if things change during that process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
basement/Basement/String.hs
Outdated
caseConvert op s@(String ba) | ||
= case nBuff of | ||
Nothing -> s | ||
(Just nLen) -> runST $ unsafeCopyFrom s nLen go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary brackets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
basement/Basement/String.hs
Outdated
let !(CM c1 c2 c3) = op c | ||
dstIdx' <- writeChar c1 dstIdx | ||
dstIdx'' <- writeChar c2 dstIdx' | ||
nextDstIdx <- writeChar c3 dstIdx'' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if c2 is 0
can't you shortcut testing c3? It makes the code less clean, but it's so very common that it's probably worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sadly makes the code quite messy. But your're right, the common case should be short.
Fixed, if the code looks too messy, I can revert this change.
newtype StepASCII = StepASCII Word8 | ||
|
||
-- | Specialized tuple used for case mapping. | ||
data CM = CM {-# UNPACK #-} !Char {-# UNPACK #-} !Char {-# UNPACK #-} !Char deriving (Eq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest instead using CM Int Int Int
with -1
as the sentinel, since the rest of the foundation stuff is Int not Char for the UTFsize etc.
, CheckPlan "B should capitalize to B" $ validate "B" $ upper "B" == "B" | ||
, CheckPlan "é should capitalize to É" $ validate "é" $ upper "é" == "É" | ||
, CheckPlan "ß should capitalize to SS" $ validate "ß" $ upper "ß" == "SS" | ||
, CheckPlan "ffl should capitalize to FFL" $ validate "ffl" $ upper "ffl" == "FFL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given all the sentinels around 0 please make sure you test both \0a
and \0\0
(one that requires transforming and one that does not). I suspect they are broken.
I will say this patch looks really really good - my requested changes (which @vincenthz really needs to confirm are the direction he wants to take things) are pretty minor - looks very nice. I definitely owe you a beer. |
Thanks for this quick review Neil! Should I squash the requested changes or should I push a proper commit? |
That's a question for @vincenthz (although my general feeling is if people want squashed commits they can do it with the merge so no reason to get the contributors to do it) |
Do let me know when you've done all the changes you are planning on doing and I'll re-review. |
Part of issue haskell-foundation#271.
I just posted the changes. The only change I did not push was the CM elements type change to Int. I'll wait for @vincenthz feedback to make my mind on that. It is a minor change anyways, we just need to add a toEnum call on the code generator side and a fromEnum call on the library side. I'll add this change tomorrow night if needed. Until then, thanks again for the review and enjoy your weekend! |
Ah, not |
basement/Basement/String.hs
Outdated
let !(c, idx') = nextI idx | ||
!cm@(CM c1 c2 c3) = op c | ||
!cSize = if c2 == '\0' -- if c2 is empty, c3 will be empty as well. | ||
then eSize c1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For c1
you can just call charToBytes
directly and avoid one test against \0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*facepalm
Thanks!
!(Step c nextSrcIdx) = next src' srcIdx | ||
writeChar cc wIdx = | ||
if cc == '\0' | ||
then return wIdx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid the \0 test for c1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am directly using write for C1, see line 1360. But you're right, writeChar is a confusing name, I guess writeValidChar would be more appropriate!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am directly using write for C1, see line 1360. But you're right, writeChar is a confusing name, I guess writeValidChar would be more appropriate!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe writeMaybeChar
?
basement/Basement/String.hs
Outdated
then return dstIdx' | ||
else do | ||
dstIdx'' <- writeChar c2 dstIdx' | ||
writeChar c3 dstIdx'' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'd just keep reusing dstIdx
shadowing it as that eliminates a nasty class of failures where you get the wrong number of primes.
Oh yes! I finally get the problem! Indeed, it feels like the change to Int is mandatory... |
Part of issue haskell-foundation#271.
As you can see, handling \0 is not problematic anymore. This is a side effect of the latest changes (we do not check C1 anymore). It still does feel like a time bomb to me, we need to change CM inner type to Int. I don't have the time to do it right now, it'll be done before Monday. |
Ah, I didn't think of just not checking c1 - with that I'm happy for it to be either |
writeValidChar cc wIdx = | ||
if cc == '\0' | ||
then return wIdx | ||
else do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nitpick: the usual style is:
if x
then a
else b
cfs <- case pcf of | ||
Left err -> putStrLn (show err) >> undefined | ||
Right cf -> return cf | ||
h <- openFile ("../../basement/Basement/String/CaseMapping.hs") WriteMode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be easier to have the haskell program write to stdout and redirect the stdout to the file when wanting to rewrite
nice work @NinjaTrappeur ! also thanks for the review @ndmitchell ! |
I've squashed the commit list since @NinjaTrappeur requested it, but otherwise I don't have a problem to include the history (even if not perfect) in a normal merge. I believe there was no other stuff on @ndmitchell list of things (despite still being in red condition, let use @ndmitchell if I forgot anything). Thanks again @NinjaTrappeur |
The main thing we need to know is your approach to generation. Mine would be run in ever travis run, fail if there are changes, which means it will fail once a year. Other people have different approaches. |
And thinking of it, a test showing it is equal to |
I agree with @ndmitchell about the travis part:
I was actually just writing that part here. It seems to fail on GHC < 8.2, it needs some adjustments. I'll have a deeper look on that tomorrow. I'll also have a look on a test showing it is equal to @vincenthz Do you mind if I open a new PR where I can submit those changes? |
@NinjaTrappeur I don't mind at all, the more the merrier. As to a testing plan, I don't mind too much, the more automated the less likely it's going to create out-of-date problem, but also would like to make sure we still have fast tests (which means minimal extra/optional dependencies). Showing equivalence to text is useful (provided that it's not out of date) but it's probably a good idea to limit it to the edge travis run which depends on text and bytestring already. |
Fixes issue #271.
This PR contains two features:
CaseMapping.hs Generator
This part has been inspired by the text package. Basically, I just translated their parser using the foundation's Parse semantics.
This generator reads both CaseFolding.txt and SpecialCasing.txt files and generate an Haskell equivalent at basement/Basement/String/CaseMapping.hs .
I included a generateCaseMapping.sh convenience script. I had some trouble finding the right stack flags, you can see this file more as a bit of documentation than anything else :)
This script should be manually run for each Unicode revision: usually once a year, around June.
Upper and Lower Case Conversion
The conversion is done in two pass:
Possible Issues
Side Notes
This is my first contribution to foundation, I am totally out of my comfort zone here. I think this PR should be carefully reviewed before even thinking about merging it, especially the caseConvertNBuff function which could lead to some memory problems if not implemented right.
I did this for training purposes, any kind of feedback is more than welcome :)