Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert functions to use code points #15

Merged
merged 34 commits into from Jan 27, 2021

Conversation

MonoidMusician
Copy link
Contributor

This is more accurate because CodePoint represents any unicode character, whereas Char only works with non-astral characters, due to its JavaScript representation. Right?

I'll probably submit a PR to -strings to add an equivalent of charPoint directly to Data.String.CodePoints, which will clean this up a bit.

@cdepillabout
Copy link
Member

cdepillabout commented Nov 21, 2017

@MonoidMusician Thanks for this PR!

To be honest, I'm not super knowledgable about unicode. Would you be able to link me to something explaining the difference between Javascript Chars and actual unicode characters?

Before I merge this, could you do the following things:

Also, purescript-parsing depends on purescript-unicode. After this is merged, would you be willing to send a small PR to purescript-parsing updating their uses of purescript-unicode?

@cdepillabout
Copy link
Member

@michaelficarra, would it be possible for you to take a quick glance at this PR and make sure everything looks sane?

It looks like you're the author of Data.String.CodePoint, so I was thinking you might be the most qualified to tell whether this use of it is good.

In case you haven't see purescript-unicode before, it is basically a straight copy of Haskell's Data.Char module.

@cdepillabout
Copy link
Member

Also, I wonder if it makes sense to change the module name from Data.Char.Unicode, to something like Data.CodePoint.Unicode.

toUpper :: Char -> Char
toUpper = fromCharCode <<< uTowupper <<< toCharCode
toUpper :: CodePoint -> CodePoint
toUpper = modify uTowupper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toUpper can't be CodePoint -> CodePoint, it must be CodePoint -> String. But maybe that's something we should address separately, since I see it's currently broken anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay ... Is this because a some precomposed characters don't have direct upper case equivalents and would need to be decomposed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, cool ... I'm not familiar with the representation of the rules, so I wouldn't know how to fix that. Also, would we want to return String or Array CodePoint? (maybe the latter for the the Internal module?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array CodePoint sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently it looks like toUpper does not change the eszett.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MonoidMusician could you also add some documentation to toUpper explaining why the type is CodePoint -> Array CodePoint?

Also, I'm wondering if it would be helpful to have a function like unsafeToUpper :: CodePoint -> CodePoint for people who want to easily map over a list of codepoints, without worrying about being 100% correct. Its use should be discouraged of course.

@michaelficarra
Copy link
Contributor

I don't like the use of unsafePartial, but yes, this looks like the right direction. Also, I agree that something like charPoint should be built in.

@MonoidMusician
Copy link
Contributor Author

@cdepillabout Thanks for the suggestions!

According the the documentation for Prim.Char [1], it represents one UTF-16 code unit, i.e. all non-astral code points. But CodePoint is a newtype around Int so it can represent any Unicode value.

Int `superset` CodePoint `superset` Char I believe

I should make a PR for -strings to make most of those changes ... maybe codePoint{To,From}Char would be a good name instead of charPoint?

[1] https://pursuit.purescript.org/builtins/docs/Prim#t:Char

@cdepillabout
Copy link
Member

Also, I wonder if it makes sense to change the module name from Data.Char.Unicode, to something like Data.CodePoint.Unicode.

After thinking about it some more, I think this might be a good change to make.

It would make it possible for us to go back and re-add a Data.Char.Unicode module that works on Char instead of CodePoint. That would make it easy for people to use without having to figure out the relationship between Char and CodePoint. (I'm assuming that most users

The documentation for Data.Char.Unicode should point out its deficiencies, and also point users to the more-correct Data.CodePoint.Unicode module.

@michaelficarra
Copy link
Contributor

Why even have the Data.Char one? We have code unit based functions in -strings only for legacy and are considering moving around the modules to make CodePoint the default.

@cdepillabout
Copy link
Member

cdepillabout commented Nov 22, 2017

@michaelficarra

We have code unit based functions in -strings only for legacy and are considering moving around the modules to make CodePoint the default.

Does "code unit" mean PureScript's Char?

I was thinking that there would be a lot of [beginner?] PureScript programmers that are using Char (since it is explained in the PureScript book and available in Prim). Many of them may want functions like toUpper, toLower, isSymbol etc they can easily use, without having to figure out the CodePoint type. That's why we should also provide a Data.Char.Unicode module in addition to (my proposed) Data.CodePoint.Unicode.

However, there will certainly be people that want to handle unicode correctly. They can use the Data.CodePoint.Unicode module. The Data.Char.Unicode should specifically recommend using Data.CodePoint.Unicode.


However, if the rest of the PureScript community is moving to completely use CodePoint instead of Char, then I think that you are correct. We don't really need the Data.Char.Unicode module.

@MonoidMusician
Copy link
Contributor Author

I will update this PR once this is merged and released: purescript/purescript-strings#92

It looks like the mapping of “ß” 00DF to “SS” 0073 0073 is not specified in UnicodeData.txt but rather CaseFolding.txt. Should we parse that as well and use that for case data? Like you said, CodePoint -> Array CodePoint makes sense here, and we can also expose String -> String by using concatMap with the appropriate conversions.

# UnicodeData.txt specifies:
00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;;;;
# CaseFolding.txt specifies:
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S

I've started my own standard parsing library so if I could probably attempt to convert our parser here to PS from awk: https://github.com/MonoidMusician/purescript-uievents-key/blob/master/test/Generate.purs

Also should we generate it from the latest 10.0.0 instead of 6.0.0 linked in the README? I'm not sure how significant the changes are (more emoji anyone?? 😄)

And my last point: I renamed it to Data.CodePoints.Unicode but the strings package uses Data.String.CodePoints ... should I go to Data.String.CodePoints.Unicode to be consistent with that? I personally don't think that keeping Char around makes much sense, especially since we will be able to easily upcast that once the PR is merged.

Thanks for the feedback from both of you!

@michaelficarra
Copy link
Contributor

Should we parse that as well and use that for case data?

Yes, if you want to provide case mapping functions, you should base them on that table.

Also should we generate it from the latest 10.0.0 instead of 6.0.0 linked in the README?

Yes, but stay consistent within major releases of purescript-unicode.

should I go to Data.String.CodePoints.Unicode to be consistent with that?

No, see purescript/purescript-strings#95.

@cdepillabout
Copy link
Member

It looks like there has been some discussion about moving to CodePoints on purescript/purescript-strings#95, so it seems like it might be a good time to revisit this issue!

@kritzcreek
Copy link
Contributor

I'm going to need a response to this "quickly" or I'll make a 0.12 release with the old API and we'll need to make another breaking release to incorporate this PR.

@cdepillabout
Copy link
Member

@MonoidMusician Does this PR need to be updated? Or can it be merged in as-is?

@MonoidMusician
Copy link
Contributor Author

I am starting to work on updating it :) Special casing might be a separate PR though ... maybe we should keep toUpper/toLower as-is (or rename now), and add new functions for Special Casing?

@kritzcreek
Copy link
Contributor

kritzcreek commented May 25, 2018

@MonoidMusician For anything that you do, make sure you do it against the compiler/0.12 branch. This package-set should have all the dependencies you need: https://github.com/kRITZCREEK/package-sets/tree/test-core

EDIT:

fromCharCode returns a Maybe now, but since this library is pretty low-level you might want to redefine a non-Maybe version with the FFI and use that whenever you're sure you'll stay within bounds.

@MonoidMusician
Copy link
Contributor Author

Hm, the dev dependency purescript-spec isn't updated ...

@kritzcreek kritzcreek changed the base branch from master to compiler/0.12 May 25, 2018 13:41
MonoidMusician and others added 5 commits May 25, 2018 08:46
This is more accurate because`CodePoint` represents any unicode character, whereas `Char` only works with non-astral characters, due to its JavaScript representation.
They work with local modifications ...
@MonoidMusician
Copy link
Contributor Author

MonoidMusician commented May 25, 2018

I had to fix up the testing libraries locally, but the tests pass here now. @kritzcreek suggested using purescript-assert so we don't have to "wait" on spec being updated ...

jk tests still pass
Copy link
Contributor

@thomashoneyman thomashoneyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really impressive, @MonoidMusician, and thank you for your work. I only have minor comments to make; on the whole it's good to see this modernized and ready for Unicode 13.0.0, as well as on a sounder footing with the switch to code points.

Admittedly, I'm not particularly well-versed in the subtleties of unicode, so perhaps if @michaelficarra has time for another look that would be helpful as well.

src/Data/String/Unicode.purs Outdated Show resolved Hide resolved
src/Data/String/Unicode.purs Outdated Show resolved Hide resolved
src/Data/String/Unicode.purs Outdated Show resolved Hide resolved
src/Data/String/Unicode.purs Show resolved Hide resolved
src/Data/String/Unicode.purs Outdated Show resolved Hide resolved
src/Data/CodePoint/Unicode.purs Outdated Show resolved Hide resolved
src/Data/CodePoint/Unicode.purs Outdated Show resolved Hide resolved
.github/workflows/ci.yml Show resolved Hide resolved
@@ -1,10 +1,10 @@
-----------------------------------------------------------
-- This is an automatically generated file: do not edit
-- Generated by ubconfc at Fri Nov 10 20:05:16 PST 2017
-- Generated by ubconfc at Tue Jan 12 18:57:20 CST 2021
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a good way to verify this file other than seeing that the tests still pass. From a scan through it looks fine to me. If you have any suggestions on verifying this, @MonoidMusician, I'm all ears!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we have some documentation that describes generating these files anew:

https://github.com/purescript-contrib/purescript-unicode/tree/main/docs#generating-internal-modules

You've made some changes to how these are generated; if there's any information you think would be helpful to maintainers working on this in the future, please consider adding it to that documentation file as well. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I missed that it was moved to there, will update.

import Data.CodePoint.Unicode.Internal (bsearch, uTowlower, uTowtitle, uTowupper)
import Data.Maybe (Maybe(..))

type CaseRec =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add documentation comments to this and the declarations throughout the file, as this is a public module? Simple ones are fine, but they can help guide maintainers in the future as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it to a new Internal folder – users should not use these functions on Ints but instead the ones on CodePoints.

MonoidMusician and others added 8 commits January 25, 2021 12:51
Co-authored-by: Thomas Honeyman <admin@thomashoneyman.com>
Co-authored-by: Thomas Honeyman <admin@thomashoneyman.com>
And update documentation for generating internal modules
Instead of hypothetical performance benefits
@MonoidMusician
Copy link
Contributor Author

Okay, thanks for the feedback! I think I addressed what was addressable.

packages.dhall Outdated Show resolved Hide resolved
Copy link
Contributor

@thomashoneyman thomashoneyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Thanks for your work! If no other maintainer notices an issue before this must be merged in order to make the 0.14 release, then I will merge this PR. I'm going to leave it open in the meantime.

fullcase.js Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
fullcase.js Outdated Show resolved Hide resolved
Copy link
Contributor

@michaelficarra michaelficarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. What do you think about also exporting the Unicode version as a String?

@MonoidMusician
Copy link
Contributor Author

Okay, I think I addressed your feedback, thanks! I don't think we need to export the Unicode version from the library, I just want to make sure it's available for documentation when someone goes looking for it. Unicode seems to be pretty stable so I don't think it affects most users.

@MonoidMusician
Copy link
Contributor Author

I just noticed that this fixes #28 (I used hexadecimal escapes).

@thomashoneyman thomashoneyman merged commit 124607f into purescript-contrib:main Jan 27, 2021
@thomashoneyman
Copy link
Contributor

Fantastic work, @MonoidMusician! Thank you, and if anyone does happen to notice an issue we can merge a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
purs-0.14 A reminder to address this issue or merge this PR before we release PureScript v0.14.0 type: breaking change A change that requires a major version bump.
Development

Successfully merging this pull request may close these issues.

None yet

6 participants