New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Treatment of surrogate pairs in string literals #2434
Comments
refers to this bit I think: purescript/src/Language/PureScript/Pretty/JS.hs Lines 158 to 162 in ebd7c3c
I think we shouldn't allow surrogate pairs in PureScript string literals, but I've just read into the topic for the last two days and have never dealt with them before so my opinion doesn't weigh much 🤷 |
Ah great, so I guess we already have that to deal with the case where people write string literals which include characters which must be encoded as surrogate pairs in UTF-16, but written in the normal way as one code point e.g. the RHS in that test that's causing problems in your Text PR: surrogatePair = "\xD834\xDF06" == "\x1D306" |
tl;dr I don't think we can get away with language independence and full JavaScript interoperability simultaneously. But I do think we should make a clear decision about PureScript string semantics. I have a feeling this is going to be a long discussion. JavaScript strings are a bit strange. If you want to have full interoperability with JS programs (which we do), you need to be able to represent what we call "lone surrogates". See this excerpt from §6.1.4:
These are valid code points, but there would be no way to represent them using UTF-16 encoding. When attempting to do so, you get those I think the first thing we should talk about is whether PureScript strings are sequences of code points or UTF-16 code units. I think we can all agree that we would ideally have PureScript strings be a sequence of code points, not code units. But it appears that the let a = "\xD834"
let b = "\xDF06"
length a -- 1
length b -- 1
let c = a <> b -- "\x1D306"
length c -- 1; |
Oh wow, I had no idea, thanks for this! In a theoretical world where PureScript strings are sequences of code points, would we want to disallow people from constructing strings like |
Sure, but then there goes our JavaScript interop. |
Oh man, I just realised another surprising consequence. This is no longer necessarily true. length (a <> b) == length (b <> a) @paf31 I see you've marked this as approved/bug. I didn't realise that we've identified an action item yet. Mind clarifying those labels? |
Good point, I've moved it to discussion. |
I suppose everyone is going to expect |
@hdgarrood I think you're right; we really need seamless JS FFI for string values. In that case, I think we should move the rendering of Unicode code points to UTF-16 surrogate pairs from the JS code generator to the internal representation of strings in the compiler. That will officially codify that representation as the PureScript representation and make it easy for all other backends to behave the same way. |
I guess we should be using What do you think, say, a Python 3 backend should do? As far as I'm aware their strings behave more like sequences of code points. |
(Also 👍, this all sounds sensible to me). |
Just checked this again and I think my terminal was actually doing the replacement; I think module Main where
import Prelude
import Data.String
import Control.Monad.Eff (Eff)
import Control.Monad.Eff.Console (CONSOLE, logShow)
main :: forall e. Eff (console :: CONSOLE | e) Unit
main = do
logShow $ length str
logShow $ charCodeAt 0 str
logShow $ charCodeAt 1 str
str = "\xDFFF\xD800" Outputs:
|
Ok, so I guess we should probably add a test for invalid surrogate pairs / lone surrogates, but other than that I don't think there's anything else we really need to do? |
Oh oops, of course there's also this:
|
Yeah, I think that's the only action item here. And it'll automatically fix #2438. I'm still going to think about the interop with non-JS languages for a bit before making a PR. |
…e UTF-16 code units
…e UTF-16 code units
…e UTF-16 code units
Come to think of it, we should probably ask for input from other backend maintainers if this is what we want to do. Will it be awkward for the C++/Erlang backends if PureScript specifies that the |
I tend to favor UTF-8 (for various reasons), and was more-or-less supporting unicode via it in the C++ backend, but I would be ok with switching to UTF-16 if we want to standardize on the semantics on the PureScript side (sounds like we do). And I think operations being based on code units rather than points will be ok too. As an interop bonus, some popular native frameworks use this scheme anyway (Apple |
Actually, I'm going to think about this one some more (for C++), but feel free to proceed with what you already have in mind for now. |
Same, I'm currently using lists of Unicode code points, was thinking to move to UTF8 binaries (which I think is what sane libraries use); UTF16 is possible but not sure about lone surrogates. But regardless better to make this change to make the JS representation precise and can either manage to comply with this or apply a caveat... |
I've merged these changes and I have an implementation of representing strings as sequences of UTF-16 code-unit, but I'm reluctant to proceed with it as I think it will give an unpleasant experience, requiring conversion to/from UTF-8 constantly and yet not behaving quite the same as JS where lone surrogates are concerned. I wonder how strongly people feel about consistency here? |
How come it doesn't quite behave the same as JS? I don't feel strongly, I could be persuaded either way at this point - consistency is of course important but if that ends up with us having a separate string type for every backend then clearly something has gone wrong. |
@hdgarrood I can use a "binary" containing utf-16 code units, but all (Unicode-aware) library functions expect UTF8, conversion from valid UTF16 is fine but in the case of lone surrogates (or anything otherwise invalid), will of course fail, this failure must be handled, if (As a counterpoint concatenation of separate halves of a surrogate pair would be fine. And PSString is perfectly usable) |
Ah right, I see. I'd actually like to reconsider another option that we previously decided against in this issue, where we would change the builtin |
For efficiency, in the core strings library, you could still have all your index-based operations be based on code units, but just not specify what width those units are. So for example searching for a char/substring gets you an index, and then you can use that index to get to it with another function, but just don't expect it to be a "full character". Even code points have this problem, since you can have a letter followed by a number of accent marks to be applied to it. Length/size functions should always return bytes, in my opinion, for related reasons. This is just if we went with a more generic unicode approach -- I can't speak to the javascript-specific needs. Also, I've also merged the |
I now have a working version of |
After looking at #2418 I realised that we allow invalid surrogate pairs in string literals. For example:
Produces the following:
That is, two copies of U+FFFD, the unicode replacement character. At the moment, valid surrogate pairs come out how you would expect them to (although I expect backends other than JS will handle this less gracefully).
I think we should almost certainly throw an error in the parser if a string literal contains an invalid surrogate pair, and also consider throwing an error in the parser if a string literal contains any escape sequence for a character in the range reserved for surrogate pairs i.e. U+D800 to U+DFFF, i.e. we should also consider throwing errors for valid surrogate pairs.
Ideally I think string literals in PureScript source code would be represented in the compiler as the precise values they refer to, e.g. if you wrote
"\x1F4A1"
in a PureScript source file, the compiler would represent that internally as that actual code point (using some encoding which I don't think is relevant to this discussion). Then whichever backend is being used would be responsible for writing that value out as a string literal which would be interpreted correctly by whatever language implementation will end up reading it (i.e. usually JS but potentially many others).The reason I would vote for erroring even on valid surrogate pairs is that allowing them suggests that strings are always going to be encoded as UTF-16 at runtime, which is true for JS, but probably not for other backends. It also makes the implementation of the lexer simpler. I also think it makes more sense to put any awkward code for handling surrogate pairs in the JS code generator (as opposed to the lexer), as dealing with this kind of encoding issue in generated code will be specific to the backend you're targeting.
Relevant background info from Wikipedia:
/cc @kritzcreek @michaelficarra @andyarvanitis
The text was updated successfully, but these errors were encountered: