Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treatment of surrogate pairs in string literals #2434

Closed
hdgarrood opened this issue Nov 17, 2016 · 25 comments
Closed

Treatment of surrogate pairs in string literals #2434

hdgarrood opened this issue Nov 17, 2016 · 25 comments

Comments

@hdgarrood
Copy link
Contributor

After looking at #2418 I realised that we allow invalid surrogate pairs in string literals. For example:

module Main where

import Prelude
import Control.Monad.Eff (Eff)
import Control.Monad.Eff.Console (CONSOLE, log)

main :: forall e. Eff (console :: CONSOLE | e) Unit
main = do
  log "\xDFFF\xD800"

Produces the following:

��

That is, two copies of U+FFFD, the unicode replacement character. At the moment, valid surrogate pairs come out how you would expect them to (although I expect backends other than JS will handle this less gracefully).

I think we should almost certainly throw an error in the parser if a string literal contains an invalid surrogate pair, and also consider throwing an error in the parser if a string literal contains any escape sequence for a character in the range reserved for surrogate pairs i.e. U+D800 to U+DFFF, i.e. we should also consider throwing errors for valid surrogate pairs.

Ideally I think string literals in PureScript source code would be represented in the compiler as the precise values they refer to, e.g. if you wrote "\x1F4A1" in a PureScript source file, the compiler would represent that internally as that actual code point (using some encoding which I don't think is relevant to this discussion). Then whichever backend is being used would be responsible for writing that value out as a string literal which would be interpreted correctly by whatever language implementation will end up reading it (i.e. usually JS but potentially many others).

The reason I would vote for erroring even on valid surrogate pairs is that allowing them suggests that strings are always going to be encoded as UTF-16 at runtime, which is true for JS, but probably not for other backends. It also makes the implementation of the lexer simpler. I also think it makes more sense to put any awkward code for handling surrogate pairs in the JS code generator (as opposed to the lexer), as dealing with this kind of encoding issue in generated code will be specific to the backend you're targeting.

Relevant background info from Wikipedia:

U+D800 to U+DFFF

The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.

However UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors. It is possible to unambiguously encode them in UTF-16 by using a code unit equal to the code point, as long as no sequence of two code units can be interpreted as a legal surrogate pair (that is, as long as a high surrogate is never followed by a low surrogate). The majority of UTF-16 encoder and decoder implementations translate between encodings as though this were the case[citation needed] and Windows allows such sequences in filenames.

/cc @kritzcreek @michaelficarra @andyarvanitis

@kritzcreek
Copy link
Member

awkward code for handling surrogate pairs in the JS code generator

refers to this bit I think:

encodeChar c | fromEnum c > 0xFFFF = "\\u" ++ showHex highSurrogate ("\\u" ++ showHex lowSurrogate "")
where
(h, l) = divMod (fromEnum c - 0x10000) 0x400
highSurrogate = h + 0xD800
lowSurrogate = l + 0xDC00

I think we shouldn't allow surrogate pairs in PureScript string literals, but I've just read into the topic for the last two days and have never dealt with them before so my opinion doesn't weigh much 🤷

@hdgarrood
Copy link
Contributor Author

Ah great, so I guess we already have that to deal with the case where people write string literals which include characters which must be encoded as surrogate pairs in UTF-16, but written in the normal way as one code point e.g. the RHS in that test that's causing problems in your Text PR:

surrogatePair = "\xD834\xDF06" == "\x1D306"

@michaelficarra
Copy link
Contributor

tl;dr I don't think we can get away with language independence and full JavaScript interoperability simultaneously. But I do think we should make a clear decision about PureScript string semantics.

I have a feeling this is going to be a long discussion. JavaScript strings are a bit strange. If you want to have full interoperability with JS programs (which we do), you need to be able to represent what we call "lone surrogates". See this excerpt from §6.1.4:

Some operations interpret String contents as UTF-16 encoded Unicode code points. In that case the interpretation is:

  • A code unit in the range 0 to 0xD7FF or in the range 0xE000 to 0xFFFF is interpreted as a code point with the same value.
  • A sequence of two code units, where the first code unit c1 is in the range 0xD800 to 0xDBFF and the second code unit c2 is in the range 0xDC00 to 0xDFFF, is a surrogate pair and is interpreted as a code point with the value (c1 - 0xD800) × 0x400 + (c2 - 0xDC00) + 0x10000. (See 10.1.2)
  • A code unit that is in the range 0xD800 to 0xDFFF, but is not part of a surrogate pair, is interpreted as a code point with the same value.

These are valid code points, but there would be no way to represent them using UTF-16 encoding. When attempting to do so, you get those U+FFFD replacement characters.

I think the first thing we should talk about is whether PureScript strings are sequences of code points or UTF-16 code units. I think we can all agree that we would ideally have PureScript strings be a sequence of code points, not code units. But it appears that the purescript-strings library currently treats strings as sequences of code units. In order to maintain JS interoperability, we have to be able to represent the code points that would form lone surrogates. If we can do that, we can combine strings with lone surrogates, forming surrogate pairs. So in this theoretical world of PureScript strings as sequences of code points, string concatenation will surprise some who are not familiar with the relationship between code points and code units.

let a = "\xD834"
let b = "\xDF06"
length a -- 1
length b -- 1
let c = a <> b -- "\x1D306"
length c -- 1;

@hdgarrood
Copy link
Contributor Author

Oh wow, I had no idea, thanks for this! In a theoretical world where PureScript strings are sequences of code points, would we want to disallow people from constructing strings like a and b in your example?

@michaelficarra
Copy link
Contributor

Sure, but then there goes our JavaScript interop.

@michaelficarra
Copy link
Contributor

Oh man, I just realised another surprising consequence. This is no longer necessarily true.

length (a <> b) == length (b <> a)

@paf31 I see you've marked this as approved/bug. I didn't realise that we've identified an action item yet. Mind clarifying those labels?

@paf31 paf31 modified the milestones: Discussion, Approved Nov 18, 2016
@paf31
Copy link
Contributor

paf31 commented Nov 18, 2016

Good point, I've moved it to discussion.

@paf31 paf31 removed the type: bug label Nov 18, 2016
@hdgarrood
Copy link
Contributor Author

I suppose everyone is going to expect <> and all the functions in Data.String to work the same way as + and the string functions in js of the same names respectively, and that purescript strings to are the same as js strings (for interop). In order to achieve that i guess we need to allow any sequence in literals then, and pass them all through unchanged?

@michaelficarra
Copy link
Contributor

@hdgarrood I think you're right; we really need seamless JS FFI for string values. In that case, I think we should move the rendering of Unicode code points to UTF-16 surrogate pairs from the JS code generator to the internal representation of strings in the compiler. That will officially codify that representation as the PureScript representation and make it easy for all other backends to behave the same way.

@hdgarrood
Copy link
Contributor Author

I guess we should be using ByteString for representing a string literal in the AST, then; Text is no good because it won't give us control over this kind of thing.

What do you think, say, a Python 3 backend should do? As far as I'm aware their strings behave more like sequences of code points.

@hdgarrood
Copy link
Contributor Author

(Also 👍, this all sounds sensible to me).

@hdgarrood
Copy link
Contributor Author

Just checked this again and I think my terminal was actually doing the replacement; I think psc is already doing what we want it to do, since this program seems to work:

module Main where

import Prelude
import Data.String
import Control.Monad.Eff (Eff)
import Control.Monad.Eff.Console (CONSOLE, logShow)

main :: forall e. Eff (console :: CONSOLE | e) Unit
main = do
  logShow $ length str
  logShow $ charCodeAt 0 str
  logShow $ charCodeAt 1 str


str = "\xDFFF\xD800"

Outputs:

2
(Just 57343)
(Just 55296)

@hdgarrood
Copy link
Contributor Author

Ok, so I guess we should probably add a test for invalid surrogate pairs / lone surrogates, but other than that I don't think there's anything else we really need to do?

@hdgarrood
Copy link
Contributor Author

Oh oops, of course there's also this:

I think we should move the rendering of Unicode code points to UTF-16 surrogate pairs from the JS code generator to the internal representation of strings in the compiler. That will officially codify that representation as the PureScript representation and make it easy for all other backends to behave the same way.

@michaelficarra
Copy link
Contributor

Yeah, I think that's the only action item here. And it'll automatically fix #2438. I'm still going to think about the interop with non-JS languages for a bit before making a PR.

michaelficarra added a commit to michaelficarra/purescript that referenced this issue Nov 27, 2016
michaelficarra added a commit to michaelficarra/purescript that referenced this issue Nov 28, 2016
michaelficarra added a commit to michaelficarra/purescript that referenced this issue Nov 29, 2016
@hdgarrood
Copy link
Contributor Author

Come to think of it, we should probably ask for input from other backend maintainers if this is what we want to do. Will it be awkward for the C++/Erlang backends if PureScript specifies that the String type should represent a sequence of UTF-16 code units like this? @andyarvanitis @nwolverson

@andyarvanitis
Copy link
Contributor

andyarvanitis commented Nov 29, 2016

I tend to favor UTF-8 (for various reasons), and was more-or-less supporting unicode via it in the C++ backend, but I would be ok with switching to UTF-16 if we want to standardize on the semantics on the PureScript side (sounds like we do). And I think operations being based on code units rather than points will be ok too. As an interop bonus, some popular native frameworks use this scheme anyway (Apple NSString, Qt QString, and possibly more).

@andyarvanitis
Copy link
Contributor

Actually, I'm going to think about this one some more (for C++), but feel free to proceed with what you already have in mind for now.

@nwolverson
Copy link
Contributor

Same, I'm currently using lists of Unicode code points, was thinking to move to UTF8 binaries (which I think is what sane libraries use); UTF16 is possible but not sure about lone surrogates. But regardless better to make this change to make the JS representation precise and can either manage to comply with this or apply a caveat...

paf31 pushed a commit that referenced this issue Dec 10, 2016
)

* fixes #2434: fixes #2438: PureScript chars must be UTF-16 code units

* add descriptions to StringEscape assertions; add one more assertion

* fix regressions introduced by #2418

* disable failing Unicode replacement character test for now
@nwolverson
Copy link
Contributor

I've merged these changes and I have an implementation of representing strings as sequences of UTF-16 code-unit, but I'm reluctant to proceed with it as I think it will give an unpleasant experience, requiring conversion to/from UTF-8 constantly and yet not behaving quite the same as JS where lone surrogates are concerned. I wonder how strongly people feel about consistency here?

@hdgarrood
Copy link
Contributor Author

How come it doesn't quite behave the same as JS? I don't feel strongly, I could be persuaded either way at this point - consistency is of course important but if that ends up with us having a separate string type for every backend then clearly something has gone wrong.

@nwolverson
Copy link
Contributor

@hdgarrood I can use a "binary" containing utf-16 code units, but all (Unicode-aware) library functions expect UTF8, conversion from valid UTF16 is fine but in the case of lone surrogates (or anything otherwise invalid), will of course fail, this failure must be handled, if U+FFFD replacement char is desired this would have to be done by hand at runtime (e.g. currently for console output just now I'm using a quick truncation of the string before any invalid char). So basically all FFI functions involving strings will need to go via a runtime support function, straightforward native library bindings will use some UTF8String type everywhere instead.

(As a counterpoint concatenation of separate halves of a surrogate pair would be fine. And PSString is perfectly usable)

@hdgarrood
Copy link
Contributor Author

Ah right, I see.

I'd actually like to reconsider another option that we previously decided against in this issue, where we would change the builtin String type so that it was guaranteed to be valid unicode text with an unspecified encoding, and provide a JSString type in a library for instances where people really do need the lone surrogate behaviour which matches JS (which I suspect is very rare). In the JS backend we would probably still have String represented by a JS string at runtime. We would then have to change a lot of the strings library, since lots of the functions there operate on the code unit level, e.g. we'd have to change splitAt so that it can't cut a surrogate pair in half, and change length to treat surrogate pairs as one character, and so on. We'd probably also have to change the Char type too so that it could contain any unicode code point, not just UTF-16 code units.

@andyarvanitis
Copy link
Contributor

For efficiency, in the core strings library, you could still have all your index-based operations be based on code units, but just not specify what width those units are. So for example searching for a char/substring gets you an index, and then you can use that index to get to it with another function, but just don't expect it to be a "full character". Even code points have this problem, since you can have a letter followed by a number of accent marks to be applied to it.

Length/size functions should always return bytes, in my opinion, for related reasons.

This is just if we went with a more generic unicode approach -- I can't speak to the javascript-specific needs.

Also, I've also merged the PSString changes, and it's working fine for me too.

@andyarvanitis
Copy link
Contributor

I now have a working version of purescript-strings for the C++ target that I'm testing out, and it is internally using utf8 (no changes to the *.purs files, as usual). All indexing is based on (effectively unspecified) code units -- which are bytes, in this case. As for Char, I'm now trying it out as a utf32 code point, even though literals are limited on the PS side. As @hdgarrood mentioned, maybe Char should be changed to that officially? Would it matter to javascript, since it doesn't really have a standalone char type?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants