Skip to content
This repository was archived by the owner on Sep 20, 2023. It is now read-only.

Conversation

picnoir
Copy link
Contributor

@picnoir picnoir commented Jan 12, 2018

Fixes issue #271.

This PR contains two features:

  1. A script that read Unicode case mapping tables and generates a CaseMapping.hs file.
  2. The caseConvert function which uses those mapping in order to both upper and lower case a UTF-8 string.

CaseMapping.hs Generator

This part has been inspired by the text package. Basically, I just translated their parser using the foundation's Parse semantics.

This generator reads both CaseFolding.txt and SpecialCasing.txt files and generate an Haskell equivalent at basement/Basement/String/CaseMapping.hs .

I included a generateCaseMapping.sh convenience script. I had some trouble finding the right stack flags, you can see this file more as a bit of documentation than anything else :)

This script should be manually run for each Unicode revision: usually once a year, around June.

Upper and Lower Case Conversion

The conversion is done in two pass:

  • A first pass (see the caseConvertNBuff function) where we check:
    • the new buffer size needed for the case conversion.
    • if the String needs any modification for this case conversion. (See @ndmitchell comment)
  • A second pass (see the caseConvert function) where we allocate the new buffer (if needed) and fill it using the case conversion function applied on the source string.

Possible Issues

  • The parser will not be built by the CI, hence, it could break at any moment in the future. Have we any way to force the build of this module by Travis? How should we do it?
  • I did not benchmarked this change yet.

Side Notes

This is my first contribution to foundation, I am totally out of my comfort zone here. I think this PR should be carefully reviewed before even thinking about merging it, especially the caseConvertNBuff function which could lead to some memory problems if not implemented right.

I did this for training purposes, any kind of feedback is more than welcome :)

@ndmitchell
Copy link
Contributor

Awesome! Thanks for your work on this. I haven't yet reviewed but will do so.

The way I would handle updating and keeping the script alive would be to have the generator run every time in the CI and at the end of the CI do a git diff and fail if there are changes - indicating the generator needs rerunning. The cost to Vincent is that the CI will fail once a year for reasons that are not his fault, but it does serve as a very handy reminder.


-- | A unicode string size may increase during a case conversion operation.
-- This function calculates the new buffer size for a case conversion.
-- Returns Nothing if no case conversion is needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the size decrease as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it could!

where
!nextI = nextWithIndexer getIdx
eSize !e = if e == '\0'
then 0
Copy link
Contributor

@ndmitchell ndmitchell Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is 0 of size 0 and not 1? Scratch that, I understand now

newtype StepASCII = StepASCII Word8

-- | Specialized tuple used for case mapping.
data CM = CM {-# UNPACK #-} !Char {-# UNPACK #-} !Char {-# UNPACK #-} !Char deriving (Eq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be Char or Word8?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document the invariant that it's isomorphic to [Char] and that all trailing Char must be \0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest instead using CM Int Int Int with -1 as the sentinel, since the rest of the foundation stuff is Int not Char for the UTFsize etc.

, CheckPlan "B should capitalize to B" $ validate "B" $ upper "B" == "B"
, CheckPlan "é should capitalize to É" $ validate "é" $ upper "é" == "É"
, CheckPlan "ß should capitalize to SS" $ validate "ß" $ upper "ß" == "SS"
, CheckPlan "ffl should capitalize to FFL" $ validate "" $ upper "" == "FFL"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a random test that capitalising is idempotent or something? Anything that hammers on long strings will show up GC and segfault issues which are the most likely problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given all the sentinels around 0 please make sure you test both \0a and \0\0 (one that requires transforming and one that does not). I suspect they are broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea!

Copy link
Contributor Author

@picnoir picnoir Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I added the property test.

I am a bit confused by your second comment thought. Isn't \x0a LF and \0\0 two nulls char? I am assuming you would like a test case where the heading bytes are 0s, would upper à == À (upper \x00e0 == \x00c0) do the trick? (I just tested, it capitalizes this case successfully)

I am maybe missing the point here...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant ['\0','a']. Specifically, try a test where nothing in the string converts (so it fails out early), and a test where they do convert (so you test the conversion path too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, got it!

Done.

@@ -0,0 +1,41 @@
{-# LANGUAGE OverloadedStrings #-}

module CaseMapping where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always require an explicit export list. Not sure how Vincent feels though (if he does then an HLint rule could enforce it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, my bad, I made that during the prototyping phase, completely forgot to add it when cleaning...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way, since it's a on the side script

Copy link
Contributor

@ndmitchell ndmitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use of \0 as the sentinel in CM is problematic for strings containing Int. Suggest you:

  • Switch to Int and use -1 as the sentinel.
  • Always have a fast-path for the second char being -1, as that will be the common case by a vast margin and can avoid looking at the third char.

newtype StepASCII = StepASCII Word8

-- | Specialized tuple used for case mapping.
data CM = CM {-# UNPACK #-} !Char {-# UNPACK #-} !Char {-# UNPACK #-} !Char deriving (Eq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document the invariant that it's isomorphic to [Char] and that all trailing Char must be \0.

| otherwise = do
let !(c, idx') = nextI idx
!cm@(CM c1 c2 c3) = op c
!cSize = (eSize c1) + (eSize c2) + (eSize c3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary brackets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if c2 == 0 we can avoid adding in c3, which is going to be a saving in the common case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

let !(c, idx') = nextI idx
!cm@(CM c1 c2 c3) = op c
!cSize = (eSize c1) + (eSize c2) + (eSize c3)
!nconvert = convert || not ((c1 == c) && (c2 == '\0') && (c3 == '\0'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to test c3? If c2 == 0 doesn't that imply c3 == 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it does :)

Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole expression might be clearer as changed' = changed || c1 /= c || c2 /= 0

then 0
else charToBytes (fromEnum e)
loop !idx ns convert
| idx == end = if convert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest renaming convert to changed - you are always doing the conversion, the question is if things change during that process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

caseConvert op s@(String ba)
= case nBuff of
Nothing -> s
(Just nLen) -> runST $ unsafeCopyFrom s nLen go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary brackets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

let !(CM c1 c2 c3) = op c
dstIdx' <- writeChar c1 dstIdx
dstIdx'' <- writeChar c2 dstIdx'
nextDstIdx <- writeChar c3 dstIdx''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if c2 is 0 can't you shortcut testing c3? It makes the code less clean, but it's so very common that it's probably worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sadly makes the code quite messy. But your're right, the common case should be short.

Fixed, if the code looks too messy, I can revert this change.

newtype StepASCII = StepASCII Word8

-- | Specialized tuple used for case mapping.
data CM = CM {-# UNPACK #-} !Char {-# UNPACK #-} !Char {-# UNPACK #-} !Char deriving (Eq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest instead using CM Int Int Int with -1 as the sentinel, since the rest of the foundation stuff is Int not Char for the UTFsize etc.

, CheckPlan "B should capitalize to B" $ validate "B" $ upper "B" == "B"
, CheckPlan "é should capitalize to É" $ validate "é" $ upper "é" == "É"
, CheckPlan "ß should capitalize to SS" $ validate "ß" $ upper "ß" == "SS"
, CheckPlan "ffl should capitalize to FFL" $ validate "" $ upper "" == "FFL"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given all the sentinels around 0 please make sure you test both \0a and \0\0 (one that requires transforming and one that does not). I suspect they are broken.

@ndmitchell
Copy link
Contributor

I will say this patch looks really really good - my requested changes (which @vincenthz really needs to confirm are the direction he wants to take things) are pretty minor - looks very nice. I definitely owe you a beer.

@picnoir
Copy link
Contributor Author

picnoir commented Jan 12, 2018

Thanks for this quick review Neil!

Should I squash the requested changes or should I push a proper commit?

@ndmitchell
Copy link
Contributor

That's a question for @vincenthz (although my general feeling is if people want squashed commits they can do it with the merge so no reason to get the contributors to do it)

@ndmitchell
Copy link
Contributor

Do let me know when you've done all the changes you are planning on doing and I'll re-review.

@picnoir
Copy link
Contributor Author

picnoir commented Jan 12, 2018

I just posted the changes.

The only change I did not push was the CM elements type change to Int. I'll wait for @vincenthz feedback to make my mind on that. It is a minor change anyways, we just need to add a toEnum call on the code generator side and a fromEnum call on the library side. I'll add this change tomorrow night if needed.

Until then, thanks again for the review and enjoy your weekend!

@ndmitchell
Copy link
Contributor

Ah, not "0a" but "\0a" - so a null character followed by a. My suspicion is that the code is wrong for strings containing \0, and that you'll have to move to Int so you can have access to a sentinel value (e.g. -1).

let !(c, idx') = nextI idx
!cm@(CM c1 c2 c3) = op c
!cSize = if c2 == '\0' -- if c2 is empty, c3 will be empty as well.
then eSize c1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For c1 you can just call charToBytes directly and avoid one test against \0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*facepalm

Thanks!

!(Step c nextSrcIdx) = next src' srcIdx
writeChar cc wIdx =
if cc == '\0'
then return wIdx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid the \0 test for c1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am directly using write for C1, see line 1360. But you're right, writeChar is a confusing name, I guess writeValidChar would be more appropriate!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am directly using write for C1, see line 1360. But you're right, writeChar is a confusing name, I guess writeValidChar would be more appropriate!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe writeMaybeChar?

then return dstIdx'
else do
dstIdx'' <- writeChar c2 dstIdx'
writeChar c3 dstIdx''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd just keep reusing dstIdx shadowing it as that eliminates a nasty class of failures where you get the wrong number of primes.

@picnoir
Copy link
Contributor Author

picnoir commented Jan 12, 2018

Oh yes! I finally get the problem!

Indeed, it feels like the change to Int is mandatory...

@picnoir
Copy link
Contributor Author

picnoir commented Jan 12, 2018

As you can see, handling \0 is not problematic anymore. This is a side effect of the latest changes (we do not check C1 anymore).

It still does feel like a time bomb to me, we need to change CM inner type to Int.

I don't have the time to do it right now, it'll be done before Monday.

@ndmitchell
Copy link
Contributor

Ah, I didn't think of just not checking c1 - with that I'm happy for it to be either Int or Char - you can argue either way.

@vincenthz vincenthz changed the title Issue 271 Unicode Case Mapping Jan 12, 2018
writeValidChar cc wIdx =
if cc == '\0'
then return wIdx
else do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nitpick: the usual style is:

if x
    then a
    else b

cfs <- case pcf of
Left err -> putStrLn (show err) >> undefined
Right cf -> return cf
h <- openFile ("../../basement/Basement/String/CaseMapping.hs") WriteMode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be easier to have the haskell program write to stdout and redirect the stdout to the file when wanting to rewrite

@vincenthz
Copy link
Member

nice work @NinjaTrappeur ! also thanks for the review @ndmitchell !

@vincenthz vincenthz merged commit 1bb7fca into haskell-foundation:master Jan 13, 2018
@vincenthz
Copy link
Member

I've squashed the commit list since @NinjaTrappeur requested it, but otherwise I don't have a problem to include the history (even if not perfect) in a normal merge. I believe there was no other stuff on @ndmitchell list of things (despite still being in red condition, let use @ndmitchell if I forgot anything). Thanks again @NinjaTrappeur

@ndmitchell
Copy link
Contributor

The main thing we need to know is your approach to generation. Mine would be run in ever travis run, fail if there are changes, which means it will fail once a year. Other people have different approaches.

@ndmitchell
Copy link
Contributor

And thinking of it, a test showing it is equal to text package would give a lot of confidence.

@picnoir
Copy link
Contributor Author

picnoir commented Jan 13, 2018

I agree with @ndmitchell about the travis part:

  • it will run periodically the script, preventing it being outdated.
  • it will act as a heads up for every Unicode release. I just realized the text package is 1 release late.

I was actually just writing that part here.

It seems to fail on GHC < 8.2, it needs some adjustments. I'll have a deeper look on that tomorrow.

I'll also have a look on a test showing it is equal to text. I think it will also interesting to benchmark the two implementation and see if everything is alright performance wise.

@vincenthz Do you mind if I open a new PR where I can submit those changes?

@vincenthz
Copy link
Member

@NinjaTrappeur I don't mind at all, the more the merrier.

As to a testing plan, I don't mind too much, the more automated the less likely it's going to create out-of-date problem, but also would like to make sure we still have fast tests (which means minimal extra/optional dependencies).

Showing equivalence to text is useful (provided that it's not out of date) but it's probably a good idea to limit it to the edge travis run which depends on text and bytestring already.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants