New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert functions to use code points #15
Convert functions to use code points #15
Conversation
@MonoidMusician Thanks for this PR! To be honest, I'm not super knowledgable about unicode. Would you be able to link me to something explaining the difference between Javascript Before I merge this, could you do the following things:
Also, purescript-parsing depends on purescript-unicode. After this is merged, would you be willing to send a small PR to purescript-parsing updating their uses of purescript-unicode? |
@michaelficarra, would it be possible for you to take a quick glance at this PR and make sure everything looks sane? It looks like you're the author of In case you haven't see |
Also, I wonder if it makes sense to change the module name from |
src/Data/Char/Unicode.purs
Outdated
toUpper :: Char -> Char | ||
toUpper = fromCharCode <<< uTowupper <<< toCharCode | ||
toUpper :: CodePoint -> CodePoint | ||
toUpper = modify uTowupper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toUpper
can't be CodePoint -> CodePoint
, it must be CodePoint -> String
. But maybe that's something we should address separately, since I see it's currently broken anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay ... Is this because a some precomposed characters don't have direct upper case equivalents and would need to be decomposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but also see https://unicode.org/faq/casemap_charprop.html#11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, cool ... I'm not familiar with the representation of the rules, so I wouldn't know how to fix that. Also, would we want to return String
or Array CodePoint
? (maybe the latter for the the Internal
module?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array CodePoint
sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently it looks like toUpper
does not change the eszett.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MonoidMusician could you also add some documentation to toUpper
explaining why the type is CodePoint -> Array CodePoint
?
Also, I'm wondering if it would be helpful to have a function like unsafeToUpper :: CodePoint -> CodePoint
for people who want to easily map over a list of codepoints, without worrying about being 100% correct. Its use should be discouraged of course.
I don't like the use of |
@cdepillabout Thanks for the suggestions! According the the documentation for Prim.Char [1], it represents one UTF-16 code unit, i.e. all non-astral code points. But CodePoint is a newtype around Int so it can represent any Unicode value.
I should make a PR for [1] https://pursuit.purescript.org/builtins/docs/Prim#t:Char |
After thinking about it some more, I think this might be a good change to make. It would make it possible for us to go back and re-add a The documentation for |
Why even have the |
Does "code unit" mean PureScript's I was thinking that there would be a lot of [beginner?] PureScript programmers that are using However, there will certainly be people that want to handle unicode correctly. They can use the However, if the rest of the PureScript community is moving to completely use |
7a544c3
to
99e1549
Compare
I will update this PR once this is merged and released: purescript/purescript-strings#92 It looks like the mapping of “ß”
I've started my own standard parsing library so if I could probably attempt to convert our parser here to PS from awk: https://github.com/MonoidMusician/purescript-uievents-key/blob/master/test/Generate.purs Also should we generate it from the latest 10.0.0 instead of 6.0.0 linked in the README? I'm not sure how significant the changes are (more emoji anyone?? 😄) And my last point: I renamed it to Thanks for the feedback from both of you! |
Yes, if you want to provide case mapping functions, you should base them on that table.
Yes, but stay consistent within major releases of purescript-unicode.
No, see purescript/purescript-strings#95. |
It looks like there has been some discussion about moving to CodePoints on purescript/purescript-strings#95, so it seems like it might be a good time to revisit this issue! |
I'm going to need a response to this "quickly" or I'll make a 0.12 release with the old API and we'll need to make another breaking release to incorporate this PR. |
@MonoidMusician Does this PR need to be updated? Or can it be merged in as-is? |
I am starting to work on updating it :) Special casing might be a separate PR though ... maybe we should keep |
@MonoidMusician For anything that you do, make sure you do it against the compiler/0.12 branch. This package-set should have all the dependencies you need: https://github.com/kRITZCREEK/package-sets/tree/test-core EDIT: fromCharCode returns a Maybe now, but since this library is pretty low-level you might want to redefine a non-Maybe version with the FFI and use that whenever you're sure you'll stay within bounds. |
Hm, the dev dependency |
This is more accurate because`CodePoint` represents any unicode character, whereas `Char` only works with non-astral characters, due to its JavaScript representation.
They work with local modifications ...
I had to fix up the testing libraries locally, but the tests pass here now. @kritzcreek suggested using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really impressive, @MonoidMusician, and thank you for your work. I only have minor comments to make; on the whole it's good to see this modernized and ready for Unicode 13.0.0, as well as on a sounder footing with the switch to code points.
Admittedly, I'm not particularly well-versed in the subtleties of unicode, so perhaps if @michaelficarra has time for another look that would be helpful as well.
@@ -1,10 +1,10 @@ | |||
----------------------------------------------------------- | |||
-- This is an automatically generated file: do not edit | |||
-- Generated by ubconfc at Fri Nov 10 20:05:16 PST 2017 | |||
-- Generated by ubconfc at Tue Jan 12 18:57:20 CST 2021 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a good way to verify this file other than seeing that the tests still pass. From a scan through it looks fine to me. If you have any suggestions on verifying this, @MonoidMusician, I'm all ears!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we have some documentation that describes generating these files anew:
https://github.com/purescript-contrib/purescript-unicode/tree/main/docs#generating-internal-modules
You've made some changes to how these are generated; if there's any information you think would be helpful to maintainers working on this in the future, please consider adding it to that documentation file as well. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I missed that it was moved to there, will update.
import Data.CodePoint.Unicode.Internal (bsearch, uTowlower, uTowtitle, uTowupper) | ||
import Data.Maybe (Maybe(..)) | ||
|
||
type CaseRec = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add documentation comments to this and the declarations throughout the file, as this is a public module? Simple ones are fine, but they can help guide maintainers in the future as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it to a new Internal folder – users should not use these functions on Int
s but instead the ones on CodePoint
s.
Co-authored-by: Thomas Honeyman <admin@thomashoneyman.com>
Co-authored-by: Thomas Honeyman <admin@thomashoneyman.com>
And update documentation for generating internal modules
Instead of hypothetical performance benefits
Okay, thanks for the feedback! I think I addressed what was addressable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Thanks for your work! If no other maintainer notices an issue before this must be merged in order to make the 0.14 release, then I will merge this PR. I'm going to leave it open in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM. What do you think about also exporting the Unicode version as a String?
Co-authored-by: Michael Ficarra <github@michael.ficarra.me>
Okay, I think I addressed your feedback, thanks! I don't think we need to export the Unicode version from the library, I just want to make sure it's available for documentation when someone goes looking for it. Unicode seems to be pretty stable so I don't think it affects most users. |
I just noticed that this fixes #28 (I used hexadecimal escapes). |
Fantastic work, @MonoidMusician! Thank you, and if anyone does happen to notice an issue we can merge a fix. |
This is more accurate because
CodePoint
represents any unicode character, whereasChar
only works with non-astral characters, due to its JavaScript representation. Right?I'll probably submit a PR to
-strings
to add an equivalent ofcharPoint
directly to Data.String.CodePoints, which will clean this up a bit.