Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include unicode escape sequence in docs on String literal #232

Open
chexxor opened this issue Jan 14, 2019 · 4 comments
Open

Include unicode escape sequence in docs on String literal #232

chexxor opened this issue Jan 14, 2019 · 4 comments

Comments

@chexxor
Copy link
Collaborator

chexxor commented Jan 14, 2019

I had to look into PureScript source code and test cases to learn how to input Unicode characters by code points like this.

At the same time, it might also be good to define that PS strings use UTF-16 character-encoding internally, and whatever other details that entails.

> a = "Pok\x00E9mon"
> b = "Poke\x0301mon"
> a == b
false
-- hmm... that's concerning.

Looks like swift-lang does String equality by code-points [1], not code-units like many languages do. I wonder if PureScript should treat this as a bug for fixing.

[1] https://oleb.net/blog/2017/11/swift-4-strings/

@hdgarrood
Copy link
Collaborator

hdgarrood commented Jan 14, 2019

We do mention the encoding in the Prim docs, but it’s probably worth saying it here too. See https://pursuit.purescript.org/builtins/docs/Prim#t:String.

The example you have there isn’t to do with encoding, though. That would evaluate to false regardless of the encoding, because you have a different set of code points between a and b.

I consider the current behaviour of the String instances like Eq and Ord to be the most sensible option though: see purescript/purescript-strings#79 (comment) and the few comments following it.

@chexxor
Copy link
Collaborator Author

chexxor commented Jan 14, 2019

Interestingly, it looks like Perl6 has two different built-in types for strings - Str and Uni.
https://docs.perl6.org/type.html

@chexxor
Copy link
Collaborator Author

chexxor commented Jan 15, 2019

You're right, but here's an example which demonstrates my example:

-- PureScript example of differing lengths.
> logShow (Data.String.CodeUnits.length "𝐀bc")
4
> logShow (Data.String.CodePoints.length "𝐀bc")
3

// Compiles to the following JavaScript
> "𝐀bc".length; // CodeUnits
4
> Array.from("𝐀bc", function (str) { return str.codePointAt(0); }).length; // CodePoints
3

Data.String exports Data.CodePoints, effectively softly recommending it.
https://github.com/purescript/purescript-strings/blob/1fbc4c0cf0fb816870a6841fa83c5cbdcddaaf22/src/Data/String.purs#L3

The question, then, is: How should PureScript-the-language define how Strings work?

I'd prefer it does the most intuitive thing by default and allows you to opt-in to a data type which is faster on the specific backend you are targeting.

@hdgarrood
Copy link
Collaborator

hdgarrood commented Jan 15, 2019

Data.String exports Data.CodePoints, effectively softly recommending it.

Yes, this is totally on purpose: you should be using Data.String.CodePoints unless you have a specific reason to use Data.String.CodeUnits. Using the functions in Data.String.CodeUnits makes it way too easy to accidentally do things like split surrogate pairs in half:

> CU.splitAt 1 "🐱🐲"
{ after: "�🐲", before: "" }

> CP.splitAt 1 "🐱🐲"
{ after: "🐲", before: "🐱" }

I'd prefer it does the most intuitive thing by default and allows you to opt-in to a data type which is faster on the specific backend you are targeting.

I'm not sure it's safe to say that any option is the most intuitive, or even the fastest on any given platform. It's highly context-dependent: UTF-8 is often a good choice, especially for English text, because you generally just need one byte for each code point, and because (I think) it's the most common encoding on the web. However UTF-16 can be better for other languages, e.g. east Asian languages, which can fit in fewer bytes -- many characters which fit into two bytes when encoded as UTF-16 will require three in UTF-8. Also, many programming languages' default string type (including JS) uses UTF-16.

PureScript's String type is defined as a sequence of UTF-16 code units (the same as JS). Most of the time, this isn't very different from defining String as, say, a sequence of Unicode scalar values, but the difference does matter in cases like my example above: if String were actually a sequence of Unicode scalar values, then it wouldn't be possible to split across a surrogate pair like that. The problem with that, though, is that it's such a subtle difference that it would be very easy to run into undefined behaviour by declaring something which comes from JS to be a String, if it then turns out to contain lone surrogates. See also purescript/purescript#2488 for some discussion around this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants