Convert functions to use code points #15

MonoidMusician · 2017-11-20T21:18:26Z

This is more accurate because CodePoint represents any unicode character, whereas Char only works with non-astral characters, due to its JavaScript representation. Right?

I'll probably submit a PR to -strings to add an equivalent of charPoint directly to Data.String.CodePoints, which will clean this up a bit.

cdepillabout · 2017-11-21T02:58:04Z

@MonoidMusician Thanks for this PR!

To be honest, I'm not super knowledgable about unicode. Would you be able to link me to something explaining the difference between Javascript Chars and actual unicode characters?

Before I merge this, could you do the following things:

update the tests with unit tests that are not possible to write when using Char, but are now possible when using CodePoint
update the examples in the docoments for each function (for instance the isPunctuation function: https://github.com/purescript-contrib/purescript-unicode/pull/15/files#diff-a8371e327cfff9b134c465f53ecdd381L407)
create a Changelog.md, add it to package.json (or bower.json or whereever it needs to go), and add an explanation about changing Char to CodePoint, linking back to this PR

Also, purescript-parsing depends on purescript-unicode. After this is merged, would you be willing to send a small PR to purescript-parsing updating their uses of purescript-unicode?

cdepillabout · 2017-11-21T03:02:53Z

@michaelficarra, would it be possible for you to take a quick glance at this PR and make sure everything looks sane?

It looks like you're the author of Data.String.CodePoint, so I was thinking you might be the most qualified to tell whether this use of it is good.

In case you haven't see purescript-unicode before, it is basically a straight copy of Haskell's Data.Char module.

cdepillabout · 2017-11-21T03:15:26Z

Also, I wonder if it makes sense to change the module name from Data.Char.Unicode, to something like Data.CodePoint.Unicode.

michaelficarra · 2017-11-21T15:59:28Z

src/Data/Char/Unicode.purs

-toUpper :: Char -> Char
-toUpper = fromCharCode <<< uTowupper <<< toCharCode
+toUpper :: CodePoint -> CodePoint
+toUpper = modify uTowupper


toUpper can't be CodePoint -> CodePoint, it must be CodePoint -> String. But maybe that's something we should address separately, since I see it's currently broken anyway.

Okay ... Is this because a some precomposed characters don't have direct upper case equivalents and would need to be decomposed?

Possibly, but also see https://unicode.org/faq/casemap_charprop.html#11

Okay, cool ... I'm not familiar with the representation of the rules, so I wouldn't know how to fix that. Also, would we want to return String or Array CodePoint? (maybe the latter for the the Internal module?)

Array CodePoint sounds good.

currently it looks like toUpper does not change the eszett.

@MonoidMusician could you also add some documentation to toUpper explaining why the type is CodePoint -> Array CodePoint?

Also, I'm wondering if it would be helpful to have a function like unsafeToUpper :: CodePoint -> CodePoint for people who want to easily map over a list of codepoints, without worrying about being 100% correct. Its use should be discouraged of course.

michaelficarra · 2017-11-21T16:00:53Z

I don't like the use of unsafePartial, but yes, this looks like the right direction. Also, I agree that something like charPoint should be built in.

MonoidMusician · 2017-11-21T17:57:33Z

@cdepillabout Thanks for the suggestions!

According the the documentation for Prim.Char [1], it represents one UTF-16 code unit, i.e. all non-astral code points. But CodePoint is a newtype around Int so it can represent any Unicode value.

Int `superset` CodePoint `superset` Char I believe

I should make a PR for -strings to make most of those changes ... maybe codePoint{To,From}Char would be a good name instead of charPoint?

[1] https://pursuit.purescript.org/builtins/docs/Prim#t:Char

cdepillabout · 2017-11-22T03:18:42Z

Also, I wonder if it makes sense to change the module name from Data.Char.Unicode, to something like Data.CodePoint.Unicode.

After thinking about it some more, I think this might be a good change to make.

It would make it possible for us to go back and re-add a Data.Char.Unicode module that works on Char instead of CodePoint. That would make it easy for people to use without having to figure out the relationship between Char and CodePoint. (I'm assuming that most users

The documentation for Data.Char.Unicode should point out its deficiencies, and also point users to the more-correct Data.CodePoint.Unicode module.

michaelficarra · 2017-11-22T03:34:32Z

Why even have the Data.Char one? We have code unit based functions in -strings only for legacy and are considering moving around the modules to make CodePoint the default.

cdepillabout · 2017-11-22T05:21:40Z

@michaelficarra

We have code unit based functions in -strings only for legacy and are considering moving around the modules to make CodePoint the default.

Does "code unit" mean PureScript's Char?

I was thinking that there would be a lot of [beginner?] PureScript programmers that are using Char (since it is explained in the PureScript book and available in Prim). Many of them may want functions like toUpper, toLower, isSymbol etc they can easily use, without having to figure out the CodePoint type. That's why we should also provide a Data.Char.Unicode module in addition to (my proposed) Data.CodePoint.Unicode.

However, there will certainly be people that want to handle unicode correctly. They can use the Data.CodePoint.Unicode module. The Data.Char.Unicode should specifically recommend using Data.CodePoint.Unicode.

However, if the rest of the PureScript community is moving to completely use CodePoint instead of Char, then I think that you are correct. We don't really need the Data.Char.Unicode module.

MonoidMusician · 2017-12-27T06:29:33Z

I will update this PR once this is merged and released: purescript/purescript-strings#92

It looks like the mapping of “ß” 00DF to “SS” 0073 0073 is not specified in UnicodeData.txt but rather CaseFolding.txt. Should we parse that as well and use that for case data? Like you said, CodePoint -> Array CodePoint makes sense here, and we can also expose String -> String by using concatMap with the appropriate conversions.

# UnicodeData.txt specifies:
00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;;;;
# CaseFolding.txt specifies:
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S

I've started my own standard parsing library so if I could probably attempt to convert our parser here to PS from awk: https://github.com/MonoidMusician/purescript-uievents-key/blob/master/test/Generate.purs

Also should we generate it from the latest 10.0.0 instead of 6.0.0 linked in the README? I'm not sure how significant the changes are (more emoji anyone?? 😄)

And my last point: I renamed it to Data.CodePoints.Unicode but the strings package uses Data.String.CodePoints ... should I go to Data.String.CodePoints.Unicode to be consistent with that? I personally don't think that keeping Char around makes much sense, especially since we will be able to easily upcast that once the PR is merged.

Thanks for the feedback from both of you!

michaelficarra · 2018-01-02T03:57:04Z

Should we parse that as well and use that for case data?

Yes, if you want to provide case mapping functions, you should base them on that table.

Also should we generate it from the latest 10.0.0 instead of 6.0.0 linked in the README?

Yes, but stay consistent within major releases of purescript-unicode.

should I go to Data.String.CodePoints.Unicode to be consistent with that?

No, see purescript/purescript-strings#95.

cdepillabout · 2018-05-22T12:46:14Z

It looks like there has been some discussion about moving to CodePoints on purescript/purescript-strings#95, so it seems like it might be a good time to revisit this issue!

kritzcreek · 2018-05-25T09:35:57Z

I'm going to need a response to this "quickly" or I'll make a 0.12 release with the old API and we'll need to make another breaking release to incorporate this PR.

cdepillabout · 2018-05-25T12:25:41Z

@MonoidMusician Does this PR need to be updated? Or can it be merged in as-is?

MonoidMusician · 2018-05-25T12:31:38Z

I am starting to work on updating it :) Special casing might be a separate PR though ... maybe we should keep toUpper/toLower as-is (or rename now), and add new functions for Special Casing?

kritzcreek · 2018-05-25T12:48:22Z

@MonoidMusician For anything that you do, make sure you do it against the compiler/0.12 branch. This package-set should have all the dependencies you need: https://github.com/kRITZCREEK/package-sets/tree/test-core

EDIT:

fromCharCode returns a Maybe now, but since this library is pretty low-level you might want to redefine a non-Maybe version with the FFI and use that whenever you're sure you'll stay within bounds.

MonoidMusician · 2018-05-25T12:56:00Z

Hm, the dev dependency purescript-spec isn't updated ...

This is more accurate because`CodePoint` represents any unicode character, whereas `Char` only works with non-astral characters, due to its JavaScript representation.

They work with local modifications ...

MonoidMusician · 2018-05-25T15:24:36Z

I had to fix up the testing libraries locally, but the tests pass here now. @kritzcreek suggested using purescript-assert so we don't have to "wait" on spec being updated ...

jk tests still pass

thomashoneyman

This is really impressive, @MonoidMusician, and thank you for your work. I only have minor comments to make; on the whole it's good to see this modernized and ready for Unicode 13.0.0, as well as on a sounder footing with the switch to code points.

Admittedly, I'm not particularly well-versed in the subtleties of unicode, so perhaps if @michaelficarra has time for another look that would be helpful as well.

src/Data/String/Unicode.purs

src/Data/CodePoint/Unicode.purs

.github/workflows/ci.yml

thomashoneyman · 2021-01-25T00:08:33Z

src/Data/CodePoint/Unicode/Internal.purs

@@ -1,10 +1,10 @@
 -----------------------------------------------------------
 -- This is an automatically generated file: do not edit
-- Generated by ubconfc at Fri Nov 10 20:05:16 PST 2017
+-- Generated by ubconfc at Tue Jan 12 18:57:20 CST 2021


I don't have a good way to verify this file other than seeing that the tests still pass. From a scan through it looks fine to me. If you have any suggestions on verifying this, @MonoidMusician, I'm all ears!

Also, we have some documentation that describes generating these files anew:

https://github.com/purescript-contrib/purescript-unicode/tree/main/docs#generating-internal-modules

You've made some changes to how these are generated; if there's any information you think would be helpful to maintainers working on this in the future, please consider adding it to that documentation file as well. Thanks!

Thanks, I missed that it was moved to there, will update.

thomashoneyman · 2021-01-25T00:11:16Z

fullcase.js

+import Data.CodePoint.Unicode.Internal (bsearch, uTowlower, uTowtitle, uTowupper)
+import Data.Maybe (Maybe(..))
+
+type CaseRec =


Could we add documentation comments to this and the declarations throughout the file, as this is a public module? Simple ones are fine, but they can help guide maintainers in the future as well.

I moved it to a new Internal folder – users should not use these functions on Ints but instead the ones on CodePoints.

Co-authored-by: Thomas Honeyman <admin@thomashoneyman.com>

And update documentation for generating internal modules

Instead of hypothetical performance benefits

MonoidMusician · 2021-01-25T19:50:01Z

Okay, thanks for the feedback! I think I addressed what was addressable.

packages.dhall

thomashoneyman

👍 Looks good to me! Thanks for your work! If no other maintainer notices an issue before this must be merged in order to make the 0.14 release, then I will merge this PR. I'm going to leave it open in the meantime.

fullcase.js

README.md

fullcase.js

michaelficarra

Generally LGTM. What do you think about also exporting the Unicode version as a String?

Co-authored-by: Michael Ficarra <github@michael.ficarra.me>

MonoidMusician · 2021-01-27T04:43:02Z

Okay, I think I addressed your feedback, thanks! I don't think we need to export the Unicode version from the library, I just want to make sure it's available for documentation when someone goes looking for it. Unicode seems to be pretty stable so I don't think it affects most users.

MonoidMusician · 2021-01-27T05:00:13Z

I just noticed that this fixes #28 (I used hexadecimal escapes).

thomashoneyman · 2021-01-27T05:49:49Z

Fantastic work, @MonoidMusician! Thank you, and if anyone does happen to notice an issue we can merge a fix.

michaelficarra reviewed Nov 21, 2017

View reviewed changes

MonoidMusician force-pushed the master branch from 7a544c3 to 99e1549 Compare December 26, 2017 05:47

kritzcreek changed the base branch from master to compiler/0.12 May 25, 2018 13:41

MonoidMusician and others added 5 commits May 25, 2018 08:46

Convert functions to use code points

54d4ecf

This is more accurate because`CodePoint` represents any unicode character, whereas `Char` only works with non-astral characters, due to its JavaScript representation.

Change namespace to Data.CodePoint.Unicode

8bf5762

Update docs too

e38d89a

Update for 0.12

2926f07

Update tests

fbfddea

They work with local modifications ...

MonoidMusician force-pushed the master branch from f6147f7 to fbfddea Compare May 25, 2018 13:47

MonoidMusician added 3 commits May 25, 2018 10:12

Clean up docs again, fix out of bounds error

2535356

Update to Unicode 10.0.0

5a47d12

Fix import warnings

ef04b33

YOLO

61fe785

jk tests still pass

MonoidMusician and others added 2 commits January 13, 2021 22:41

Update changelog

2dcb0d1

Update .gitignore

9c9fca3

thomashoneyman requested changes Jan 25, 2021

View reviewed changes

MonoidMusician and others added 8 commits January 25, 2021 12:51

Apply suggestions from code review

40f0ff8

Co-authored-by: Thomas Honeyman <admin@thomashoneyman.com>

conv -> convert

dbf5853

Co-authored-by: Thomas Honeyman <admin@thomashoneyman.com>

Move Casing to Internal module

3975999

And update documentation for generating internal modules

Address more review suggestions

3886937

Update README

b7ccb27

Mention that *Simple variants preserve the number of code points

04e348f

Instead of hypothetical performance benefits

Merge remote-tracking branch 'purescript-contrib/main'

6cccaaf

Oops, need to rename usages of conv -> convert

ae100fa

thomashoneyman reviewed Jan 25, 2021

View reviewed changes

packages.dhall Outdated Show resolved Hide resolved

Update packages.dhall

ba77c47

thomashoneyman approved these changes Jan 25, 2021

View reviewed changes