Include unicode escape sequence in docs on String literal #232

chexxor · 2019-01-14T04:54:54Z

I had to look into PureScript source code and test cases to learn how to input Unicode characters by code points like this.

At the same time, it might also be good to define that PS strings use UTF-16 character-encoding internally, and whatever other details that entails.

> a = "Pok\x00E9mon"
> b = "Poke\x0301mon"
> a == b
false
-- hmm... that's concerning.

Looks like swift-lang does String equality by code-points [1], not code-units like many languages do. I wonder if PureScript should treat this as a bug for fixing.

[1] https://oleb.net/blog/2017/11/swift-4-strings/

hdgarrood · 2019-01-14T08:19:19Z

We do mention the encoding in the Prim docs, but it’s probably worth saying it here too. See https://pursuit.purescript.org/builtins/docs/Prim#t:String.

The example you have there isn’t to do with encoding, though. That would evaluate to false regardless of the encoding, because you have a different set of code points between a and b.

I consider the current behaviour of the String instances like Eq and Ord to be the most sensible option though: see purescript/purescript-strings#79 (comment) and the few comments following it.

chexxor · 2019-01-14T21:10:42Z

Interestingly, it looks like Perl6 has two different built-in types for strings - Str and Uni.
https://docs.perl6.org/type.html

chexxor · 2019-01-15T16:53:38Z

You're right, but here's an example which demonstrates my example:

-- PureScript example of differing lengths.
> logShow (Data.String.CodeUnits.length "𝐀bc")
4
> logShow (Data.String.CodePoints.length "𝐀bc")
3

// Compiles to the following JavaScript
> "𝐀bc".length; // CodeUnits
4
> Array.from("𝐀bc", function (str) { return str.codePointAt(0); }).length; // CodePoints
3

Data.String exports Data.CodePoints, effectively softly recommending it.
https://github.com/purescript/purescript-strings/blob/1fbc4c0cf0fb816870a6841fa83c5cbdcddaaf22/src/Data/String.purs#L3

The question, then, is: How should PureScript-the-language define how Strings work?

I'd prefer it does the most intuitive thing by default and allows you to opt-in to a data type which is faster on the specific backend you are targeting.

hdgarrood · 2019-01-15T22:36:14Z

Data.String exports Data.CodePoints, effectively softly recommending it.

Yes, this is totally on purpose: you should be using Data.String.CodePoints unless you have a specific reason to use Data.String.CodeUnits. Using the functions in Data.String.CodeUnits makes it way too easy to accidentally do things like split surrogate pairs in half:

> CU.splitAt 1 "🐱🐲"
{ after: "�🐲", before: "�" }

> CP.splitAt 1 "🐱🐲"
{ after: "🐲", before: "🐱" }

I'd prefer it does the most intuitive thing by default and allows you to opt-in to a data type which is faster on the specific backend you are targeting.

I'm not sure it's safe to say that any option is the most intuitive, or even the fastest on any given platform. It's highly context-dependent: UTF-8 is often a good choice, especially for English text, because you generally just need one byte for each code point, and because (I think) it's the most common encoding on the web. However UTF-16 can be better for other languages, e.g. east Asian languages, which can fit in fewer bytes -- many characters which fit into two bytes when encoded as UTF-16 will require three in UTF-8. Also, many programming languages' default string type (including JS) uses UTF-16.

PureScript's String type is defined as a sequence of UTF-16 code units (the same as JS). Most of the time, this isn't very different from defining String as, say, a sequence of Unicode scalar values, but the difference does matter in cases like my example above: if String were actually a sequence of Unicode scalar values, then it wouldn't be possible to split across a surrogate pair like that. The problem with that, though, is that it's such a subtle difference that it would be very easy to run into undefined behaviour by declaring something which comes from JS to be a String, if it then turns out to contain lone surrogates. See also purescript/purescript#2488 for some discussion around this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include unicode escape sequence in docs on String literal #232

Include unicode escape sequence in docs on String literal #232

chexxor commented Jan 14, 2019 •

edited

hdgarrood commented Jan 14, 2019 •

edited

chexxor commented Jan 14, 2019

chexxor commented Jan 15, 2019 •

edited

hdgarrood commented Jan 15, 2019 •

edited

Include unicode escape sequence in docs on String literal #232

Include unicode escape sequence in docs on String literal #232

Comments

chexxor commented Jan 14, 2019 • edited

hdgarrood commented Jan 14, 2019 • edited

chexxor commented Jan 14, 2019

chexxor commented Jan 15, 2019 • edited

hdgarrood commented Jan 15, 2019 • edited

chexxor commented Jan 14, 2019 •

edited

hdgarrood commented Jan 14, 2019 •

edited

chexxor commented Jan 15, 2019 •

edited

hdgarrood commented Jan 15, 2019 •

edited