Treatment of surrogate pairs in string literals #2434

hdgarrood · 2016-11-17T14:43:37Z

After looking at #2418 I realised that we allow invalid surrogate pairs in string literals. For example:

module Main where

import Prelude
import Control.Monad.Eff (Eff)
import Control.Monad.Eff.Console (CONSOLE, log)

main :: forall e. Eff (console :: CONSOLE | e) Unit
main = do
  log "\xDFFF\xD800"

Produces the following:

��

That is, two copies of U+FFFD, the unicode replacement character. At the moment, valid surrogate pairs come out how you would expect them to (although I expect backends other than JS will handle this less gracefully).

I think we should almost certainly throw an error in the parser if a string literal contains an invalid surrogate pair, and also consider throwing an error in the parser if a string literal contains any escape sequence for a character in the range reserved for surrogate pairs i.e. U+D800 to U+DFFF, i.e. we should also consider throwing errors for valid surrogate pairs.

Ideally I think string literals in PureScript source code would be represented in the compiler as the precise values they refer to, e.g. if you wrote "\x1F4A1" in a PureScript source file, the compiler would represent that internally as that actual code point (using some encoding which I don't think is relevant to this discussion). Then whichever backend is being used would be responsible for writing that value out as a string literal which would be interpreted correctly by whatever language implementation will end up reading it (i.e. usually JS but potentially many others).

The reason I would vote for erroring even on valid surrogate pairs is that allowing them suggests that strings are always going to be encoded as UTF-16 at runtime, which is true for JS, but probably not for other backends. It also makes the implementation of the lexer simpler. I also think it makes more sense to put any awkward code for handling surrogate pairs in the JS code generator (as opposed to the lexer), as dealing with this kind of encoding issue in generated code will be specific to the backend you're targeting.

Relevant background info from Wikipedia:

U+D800 to U+DFFF

The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.

However UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors. It is possible to unambiguously encode them in UTF-16 by using a code unit equal to the code point, as long as no sequence of two code units can be interpreted as a legal surrogate pair (that is, as long as a high surrogate is never followed by a low surrogate). The majority of UTF-16 encoder and decoder implementations translate between encodings as though this were the case[citation needed] and Windows allows such sequences in filenames.

/cc @kritzcreek @michaelficarra @andyarvanitis

The text was updated successfully, but these errors were encountered:

kritzcreek · 2016-11-17T14:58:10Z

awkward code for handling surrogate pairs in the JS code generator

refers to this bit I think:

purescript/src/Language/PureScript/Pretty/JS.hs

Lines 158 to 162 in ebd7c3c

    
           encodeChar c | fromEnum c > 0xFFFF = "\\u" ++ showHex highSurrogate ("\\u" ++ showHex lowSurrogate "") 
        
             where 
        
             (h, l) = divMod (fromEnum c - 0x10000) 0x400 
        
             highSurrogate = h + 0xD800 
        
             lowSurrogate = l + 0xDC00

I think we shouldn't allow surrogate pairs in PureScript string literals, but I've just read into the topic for the last two days and have never dealt with them before so my opinion doesn't weigh much 🤷

hdgarrood · 2016-11-17T15:03:29Z

Ah great, so I guess we already have that to deal with the case where people write string literals which include characters which must be encoded as surrogate pairs in UTF-16, but written in the normal way as one code point e.g. the RHS in that test that's causing problems in your Text PR:

surrogatePair = "\xD834\xDF06" == "\x1D306"

michaelficarra · 2016-11-17T17:59:37Z

tl;dr I don't think we can get away with language independence and full JavaScript interoperability simultaneously. But I do think we should make a clear decision about PureScript string semantics.

I have a feeling this is going to be a long discussion. JavaScript strings are a bit strange. If you want to have full interoperability with JS programs (which we do), you need to be able to represent what we call "lone surrogates". See this excerpt from §6.1.4:

Some operations interpret String contents as UTF-16 encoded Unicode code points. In that case the interpretation is:

A code unit in the range 0 to 0xD7FF or in the range 0xE000 to 0xFFFF is interpreted as a code point with the same value.

A sequence of two code units, where the first code unit c1 is in the range 0xD800 to 0xDBFF and the second code unit c2 is in the range 0xDC00 to 0xDFFF, is a surrogate pair and is interpreted as a code point with the value (c1 - 0xD800) × 0x400 + (c2 - 0xDC00) + 0x10000. (See 10.1.2)

A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value.

These are valid code points, but there would be no way to represent them using UTF-16 encoding. When attempting to do so, you get those U+FFFD replacement characters.

I think the first thing we should talk about is whether PureScript strings are sequences of code points or UTF-16 code units. I think we can all agree that we would ideally have PureScript strings be a sequence of code points, not code units. But it appears that the purescript-strings library currently treats strings as sequences of code units. In order to maintain JS interoperability, we have to be able to represent the code points that would form lone surrogates. If we can do that, we can combine strings with lone surrogates, forming surrogate pairs. So in this theoretical world of PureScript strings as sequences of code points, string concatenation will surprise some who are not familiar with the relationship between code points and code units.

let a = "\xD834"
let b = "\xDF06"
length a -- 1
length b -- 1
let c = a <> b -- "\x1D306"
length c -- 1;

hdgarrood · 2016-11-17T18:23:35Z

Oh wow, I had no idea, thanks for this! In a theoretical world where PureScript strings are sequences of code points, would we want to disallow people from constructing strings like a and b in your example?

michaelficarra · 2016-11-17T18:39:56Z

Sure, but then there goes our JavaScript interop.

michaelficarra · 2016-11-17T21:48:19Z

Oh man, I just realised another surprising consequence. This is no longer necessarily true.

length (a <> b) == length (b <> a)

@paf31 I see you've marked this as approved/bug. I didn't realise that we've identified an action item yet. Mind clarifying those labels?

paf31 · 2016-11-18T01:42:01Z

Good point, I've moved it to discussion.

hdgarrood · 2016-11-18T09:08:48Z

I suppose everyone is going to expect <> and all the functions in Data.String to work the same way as + and the string functions in js of the same names respectively, and that purescript strings to are the same as js strings (for interop). In order to achieve that i guess we need to allow any sequence in literals then, and pass them all through unchanged?

michaelficarra · 2016-11-18T17:56:53Z

@hdgarrood I think you're right; we really need seamless JS FFI for string values. In that case, I think we should move the rendering of Unicode code points to UTF-16 surrogate pairs from the JS code generator to the internal representation of strings in the compiler. That will officially codify that representation as the PureScript representation and make it easy for all other backends to behave the same way.

hdgarrood · 2016-11-18T18:25:33Z

I guess we should be using ByteString for representing a string literal in the AST, then; Text is no good because it won't give us control over this kind of thing.

What do you think, say, a Python 3 backend should do? As far as I'm aware their strings behave more like sequences of code points.

hdgarrood · 2016-11-18T18:26:49Z

(Also 👍, this all sounds sensible to me).

hdgarrood · 2016-11-19T01:49:48Z

Just checked this again and I think my terminal was actually doing the replacement; I think psc is already doing what we want it to do, since this program seems to work:

module Main where

import Prelude
import Data.String
import Control.Monad.Eff (Eff)
import Control.Monad.Eff.Console (CONSOLE, logShow)

main :: forall e. Eff (console :: CONSOLE | e) Unit
main = do
  logShow $ length str
  logShow $ charCodeAt 0 str
  logShow $ charCodeAt 1 str


str = "\xDFFF\xD800"

Outputs:

2
(Just 57343)
(Just 55296)

hdgarrood · 2016-11-19T01:52:54Z

Ok, so I guess we should probably add a test for invalid surrogate pairs / lone surrogates, but other than that I don't think there's anything else we really need to do?

hdgarrood · 2016-11-19T01:55:59Z

Oh oops, of course there's also this:

I think we should move the rendering of Unicode code points to UTF-16 surrogate pairs from the JS code generator to the internal representation of strings in the compiler. That will officially codify that representation as the PureScript representation and make it easy for all other backends to behave the same way.

michaelficarra · 2016-11-19T02:05:33Z

Yeah, I think that's the only action item here. And it'll automatically fix #2438. I'm still going to think about the interop with non-JS languages for a bit before making a PR.

…e UTF-16 code units

hdgarrood · 2016-11-29T22:46:28Z

Come to think of it, we should probably ask for input from other backend maintainers if this is what we want to do. Will it be awkward for the C++/Erlang backends if PureScript specifies that the String type should represent a sequence of UTF-16 code units like this? @andyarvanitis @nwolverson

andyarvanitis · 2016-11-29T23:33:18Z

I tend to favor UTF-8 (for various reasons), and was more-or-less supporting unicode via it in the C++ backend, but I would be ok with switching to UTF-16 if we want to standardize on the semantics on the PureScript side (sounds like we do). And I think operations being based on code units rather than points will be ok too. As an interop bonus, some popular native frameworks use this scheme anyway (Apple NSString, Qt QString, and possibly more).

andyarvanitis · 2016-11-30T02:11:56Z

Actually, I'm going to think about this one some more (for C++), but feel free to proceed with what you already have in mind for now.

nwolverson · 2016-11-30T21:03:33Z

Same, I'm currently using lists of Unicode code points, was thinking to move to UTF8 binaries (which I think is what sane libraries use); UTF16 is possible but not sure about lone surrogates. But regardless better to make this change to make the JS representation precise and can either manage to comply with this or apply a caveat...

) * fixes #2434: fixes #2438: PureScript chars must be UTF-16 code units * add descriptions to StringEscape assertions; add one more assertion * fix regressions introduced by #2418 * disable failing Unicode replacement character test for now

nwolverson · 2017-01-11T22:10:58Z

I've merged these changes and I have an implementation of representing strings as sequences of UTF-16 code-unit, but I'm reluctant to proceed with it as I think it will give an unpleasant experience, requiring conversion to/from UTF-8 constantly and yet not behaving quite the same as JS where lone surrogates are concerned. I wonder how strongly people feel about consistency here?

hdgarrood · 2017-01-12T13:32:22Z

How come it doesn't quite behave the same as JS? I don't feel strongly, I could be persuaded either way at this point - consistency is of course important but if that ends up with us having a separate string type for every backend then clearly something has gone wrong.

nwolverson · 2017-01-12T14:15:31Z

@hdgarrood I can use a "binary" containing utf-16 code units, but all (Unicode-aware) library functions expect UTF8, conversion from valid UTF16 is fine but in the case of lone surrogates (or anything otherwise invalid), will of course fail, this failure must be handled, if U+FFFD replacement char is desired this would have to be done by hand at runtime (e.g. currently for console output just now I'm using a quick truncation of the string before any invalid char). So basically all FFI functions involving strings will need to go via a runtime support function, straightforward native library bindings will use some UTF8String type everywhere instead.

(As a counterpoint concatenation of separate halves of a surrogate pair would be fine. And PSString is perfectly usable)

hdgarrood · 2017-01-12T15:53:07Z

Ah right, I see.

I'd actually like to reconsider another option that we previously decided against in this issue, where we would change the builtin String type so that it was guaranteed to be valid unicode text with an unspecified encoding, and provide a JSString type in a library for instances where people really do need the lone surrogate behaviour which matches JS (which I suspect is very rare). In the JS backend we would probably still have String represented by a JS string at runtime. We would then have to change a lot of the strings library, since lots of the functions there operate on the code unit level, e.g. we'd have to change splitAt so that it can't cut a surrogate pair in half, and change length to treat surrogate pairs as one character, and so on. We'd probably also have to change the Char type too so that it could contain any unicode code point, not just UTF-16 code units.

andyarvanitis · 2017-01-12T18:59:12Z

For efficiency, in the core strings library, you could still have all your index-based operations be based on code units, but just not specify what width those units are. So for example searching for a char/substring gets you an index, and then you can use that index to get to it with another function, but just don't expect it to be a "full character". Even code points have this problem, since you can have a letter followed by a number of accent marks to be applied to it.

Length/size functions should always return bytes, in my opinion, for related reasons.

This is just if we went with a more generic unicode approach -- I can't speak to the javascript-specific needs.

Also, I've also merged the PSString changes, and it's working fine for me too.

andyarvanitis · 2017-01-19T06:51:26Z

I now have a working version of purescript-strings for the C++ target that I'm testing out, and it is internally using utf8 (no changes to the *.purs files, as usual). All indexing is based on (effectively unspecified) code units -- which are bytes, in this case. As for Char, I'm now trying it out as a utf32 code point, even though literals are limited on the PS side. As @hdgarrood mentioned, maybe Char should be changed to that officially? Would it matter to javascript, since it doesn't really have a standalone char type?

paf31 added codegen syntax type: bug labels Nov 17, 2016

paf31 added this to the Approved milestone Nov 17, 2016

paf31 modified the milestones: Discussion, Approved Nov 18, 2016

paf31 removed the type: bug label Nov 18, 2016

michaelficarra mentioned this issue Nov 19, 2016

Unicode encoding for record keys fails for surrogate pairs #2438

Closed

michaelficarra mentioned this issue Nov 27, 2016

Use Text over String #2418

Merged

michaelficarra added a commit to michaelficarra/purescript that referenced this issue Nov 27, 2016

fixes purescript#2434: fixes purescript#2438: PureScript chars must b…

5c9fd93

…e UTF-16 code units

michaelficarra added a commit to michaelficarra/purescript that referenced this issue Nov 28, 2016

fixes purescript#2434: fixes purescript#2438: PureScript chars must b…

0709e62

…e UTF-16 code units

michaelficarra mentioned this issue Nov 28, 2016

fixes #2434 and #2438: PureScript chars must be UTF-16 code units #2454

Merged

michaelficarra added a commit to michaelficarra/purescript that referenced this issue Nov 29, 2016

fixes purescript#2434: fixes purescript#2438: PureScript chars must b…

7395626

…e UTF-16 code units

michaelficarra mentioned this issue Dec 8, 2016

Add overrides section to psc-package config #2458

Closed

paf31 closed this as completed in #2454 Dec 10, 2016

dashed mentioned this issue Feb 23, 2017

handling of invalid utf8 / utf16 dashed/esparser#41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treatment of surrogate pairs in string literals #2434

Treatment of surrogate pairs in string literals #2434

hdgarrood commented Nov 17, 2016

U+D800 to U+DFFF

kritzcreek commented Nov 17, 2016

hdgarrood commented Nov 17, 2016

michaelficarra commented Nov 17, 2016

hdgarrood commented Nov 17, 2016

michaelficarra commented Nov 17, 2016

michaelficarra commented Nov 17, 2016

paf31 commented Nov 18, 2016

hdgarrood commented Nov 18, 2016

michaelficarra commented Nov 18, 2016

hdgarrood commented Nov 18, 2016

hdgarrood commented Nov 18, 2016

hdgarrood commented Nov 19, 2016

hdgarrood commented Nov 19, 2016

hdgarrood commented Nov 19, 2016

michaelficarra commented Nov 19, 2016

hdgarrood commented Nov 29, 2016

andyarvanitis commented Nov 29, 2016 •

edited

andyarvanitis commented Nov 30, 2016

nwolverson commented Nov 30, 2016

nwolverson commented Jan 11, 2017

hdgarrood commented Jan 12, 2017

nwolverson commented Jan 12, 2017

hdgarrood commented Jan 12, 2017

andyarvanitis commented Jan 12, 2017

andyarvanitis commented Jan 19, 2017

Treatment of surrogate pairs in string literals #2434

Treatment of surrogate pairs in string literals #2434

Comments

hdgarrood commented Nov 17, 2016

U+D800 to U+DFFF

kritzcreek commented Nov 17, 2016

hdgarrood commented Nov 17, 2016

michaelficarra commented Nov 17, 2016

hdgarrood commented Nov 17, 2016

michaelficarra commented Nov 17, 2016

michaelficarra commented Nov 17, 2016

paf31 commented Nov 18, 2016

hdgarrood commented Nov 18, 2016

michaelficarra commented Nov 18, 2016

hdgarrood commented Nov 18, 2016

hdgarrood commented Nov 18, 2016

hdgarrood commented Nov 19, 2016

hdgarrood commented Nov 19, 2016

hdgarrood commented Nov 19, 2016

michaelficarra commented Nov 19, 2016

hdgarrood commented Nov 29, 2016

andyarvanitis commented Nov 29, 2016 • edited

andyarvanitis commented Nov 30, 2016

nwolverson commented Nov 30, 2016

nwolverson commented Jan 11, 2017

hdgarrood commented Jan 12, 2017

nwolverson commented Jan 12, 2017

hdgarrood commented Jan 12, 2017

andyarvanitis commented Jan 12, 2017

andyarvanitis commented Jan 19, 2017

andyarvanitis commented Nov 29, 2016 •

edited