Represent PureScript strings internally as sequence of Word16 instead of Text #2488

Merged
merged 5 commits into from Jan 4, 2017

Conversation

Projects
None yet
5 participants
@michaelficarra
Contributor

michaelficarra commented Dec 18, 2016

This is a WIP. I've done all the work of replacing Text with PSString where a string value was being represented and Label where a record label was being represented. I still need to change out the underlying representation of a label to similarly use a sequence of Word16, but that will be very easy.

The reason I'm opening this PR in this failing state is that I've run into an issue that I can't figure out. Most member accesses are rendering just fine, except module."exports" and $foreign."warn"(Data_Show."show"(dictShow)(a));. These must be generated in some other way. I'm guessing it's bundler related. If so, @hdgarrood might be able to help?

@hdgarrood

This comment has been minimized.

Show comment
Hide comment
@hdgarrood

hdgarrood Dec 18, 2016

Contributor

I will gladly review this but I won't be able to do so for a few days, as I'm busy with exams right now.

Contributor

hdgarrood commented Dec 18, 2016

I will gladly review this but I won't be able to do so for a few days, as I'm busy with exams right now.

@hdgarrood

I'm not really sure about the member access issue you've run into. The tests all appear to be running fine apart from the string escapes one, which presumably means that the case you've described isn't being hit by the compiler test suite. Do you know why that might be the case? That is, do you know the circumstances that lead to bad member accesses like that?

src/Language/PureScript/Parser/Lexer.hs
@@ -269,10 +257,10 @@ parseToken = P.choice
}
parseStringLiteral :: Lexer u Text
- parseStringLiteral = blockString <|> T.pack <$> concatMap expandAstralCodePointToUTF16Surrogates <$> PT.stringLiteral tokenParser
+ parseStringLiteral = T.pack <$> (blockString <|> PT.stringLiteral tokenParser)

This comment has been minimized.

@hdgarrood

hdgarrood Dec 21, 2016

Contributor

AFAICT this will suffer from the same issue as before, as T.pack is still being used. I think parseStringLiteral should ideally be Lexer u PSString, or if that's not workable for whatever reason, Lexer u String and then convert it to a PSString later. It's only by attempting to convert to Text that we lose information, I think.

The other uses of mkString in the parser make me a bit nervous too for the same reason, is it possible to get rid of those?

@hdgarrood

hdgarrood Dec 21, 2016

Contributor

AFAICT this will suffer from the same issue as before, as T.pack is still being used. I think parseStringLiteral should ideally be Lexer u PSString, or if that's not workable for whatever reason, Lexer u String and then convert it to a PSString later. It's only by attempting to convert to Text that we lose information, I think.

The other uses of mkString in the parser make me a bit nervous too for the same reason, is it possible to get rid of those?

src/Language/PureScript/Terms.hs
+mkString = fromString . T.unpack
+
+-- TODO: remove me, replace with toLabel + possibly others
+codePoints :: PSString -> Text

This comment has been minimized.

@hdgarrood

hdgarrood Dec 21, 2016

Contributor

Does it make sense to have this return a Maybe Text, since it attempts to convert a PSString into a Text, but not all PSString values can be represented as Text (specifically, PSStrings which contain lone surrogates)?

@hdgarrood

hdgarrood Dec 21, 2016

Contributor

Does it make sense to have this return a Maybe Text, since it attempts to convert a PSString into a Text, but not all PSString values can be represented as Text (specifically, PSStrings which contain lone surrogates)?

@paf31 paf31 changed the title from represent PureScript strings internally as sequence of Word16 instead of Text to Represent PureScript strings internally as sequence of Word16 instead of Text Dec 23, 2016

@michaelficarra

This comment has been minimized.

Show comment
Hide comment
@michaelficarra

michaelficarra Dec 29, 2016

Contributor

@hdgarrood Looks like I was having some sort of caching issue. Blowing away the repo and re-cloning got it building properly again. Oddly, deleting .stack-work and ~/.stack was not enough. I've addressed most of your concerns, and I'm pretty comfortable with the state of things. Two things remain:

  1. I'd like to get rid of pretty much all the usages of codePoints besides the one in Label's Show implementation. I'm not sure how possible that will be.

  2. I'm getting MissingFFIModule errors in some of the tests. I'm not really sure how to debug it. You can check it out in the CI failure. It's reporting

    The foreign module implementation for module Data.Show is missing.

edit: Never mind, these failures were again only on my machine but were able to be resolved by deleting the .stack-work directory and rebuilding.

Contributor

michaelficarra commented Dec 29, 2016

@hdgarrood Looks like I was having some sort of caching issue. Blowing away the repo and re-cloning got it building properly again. Oddly, deleting .stack-work and ~/.stack was not enough. I've addressed most of your concerns, and I'm pretty comfortable with the state of things. Two things remain:

  1. I'd like to get rid of pretty much all the usages of codePoints besides the one in Label's Show implementation. I'm not sure how possible that will be.

  2. I'm getting MissingFFIModule errors in some of the tests. I'm not really sure how to debug it. You can check it out in the CI failure. It's reporting

    The foreign module implementation for module Data.Show is missing.

edit: Never mind, these failures were again only on my machine but were able to be resolved by deleting the .stack-work directory and rebuilding.

@andyarvanitis

This comment has been minimized.

Show comment
Hide comment
@andyarvanitis

andyarvanitis Dec 29, 2016

Contributor

Your changes currently provide it, but if you can keep codePoints or equivalent available for use in the (backends') codegen, that would be very helpful.

Contributor

andyarvanitis commented Dec 29, 2016

Your changes currently provide it, but if you can keep codePoints or equivalent available for use in the (backends') codegen, that would be very helpful.

@michaelficarra

This comment has been minimized.

Show comment
Hide comment
@michaelficarra

michaelficarra Dec 29, 2016

Contributor

@andyarvanitis toUTF16CodeUnits is probably what you want to actually use. Centralising that function so all the backends could use it for their codegen was one of the major goals of this effort.

Contributor

michaelficarra commented Dec 29, 2016

@andyarvanitis toUTF16CodeUnits is probably what you want to actually use. Centralising that function so all the backends could use it for their codegen was one of the major goals of this effort.

@andyarvanitis

This comment has been minimized.

Show comment
Hide comment
@andyarvanitis

andyarvanitis Dec 29, 2016

Contributor

Well, I've been looking into it a bit recently, and I might need to continue to render to UTF-8 (and individual chars to UTF-32), so a convenient way to get code points would be good to have. Worst case, I could certainly work with UTF-16 units (and convert to points myself), but other backends might need to do the same, so a common function would be nice.

Contributor

andyarvanitis commented Dec 29, 2016

Well, I've been looking into it a bit recently, and I might need to continue to render to UTF-8 (and individual chars to UTF-32), so a convenient way to get code points would be good to have. Worst case, I could certainly work with UTF-16 units (and convert to points myself), but other backends might need to do the same, so a common function would be nice.

@andyarvanitis

This comment has been minimized.

Show comment
Hide comment
@andyarvanitis

andyarvanitis Dec 29, 2016

Contributor

Also, CC'ing @nwolverson, since I think the Erlang world often deals in UTF-8

Contributor

andyarvanitis commented Dec 29, 2016

Also, CC'ing @nwolverson, since I think the Erlang world often deals in UTF-8

@michaelficarra

This comment has been minimized.

Show comment
Hide comment
@michaelficarra

michaelficarra Dec 30, 2016

Contributor

@andyarvanitis Do your Data.String implementations treat strings as lists of UTF-16 code units? If not, that would be an incompatibility with the JS backend. If so, you must be doing that re-encoding at runtime, right?

Contributor

michaelficarra commented Dec 30, 2016

@andyarvanitis Do your Data.String implementations treat strings as lists of UTF-16 code units? If not, that would be an incompatibility with the JS backend. If so, you must be doing that re-encoding at runtime, right?

@andyarvanitis

This comment has been minimized.

Show comment
Hide comment
@andyarvanitis

andyarvanitis Dec 30, 2016

Contributor

@michaelficarra, they don't actually exist yet (no C++ purescript-strings yet), but I'm aware of and in agreement with the current approach, and my plan is to keep the "standard" operations in that package (and related ones) based on UTF-16 code units. However, there are places – for most widely compatible C/C++ – where UTF-8 will be best, and may need to be the default in its "core" codegen. Additionally, there's no saying you couldn't have code-points-based operations in non-standard (or even the standard) strings packages, even if you have UTF-16(ish) javascript code-unit ones there too. For C++, I do somewhat predict separate strings packages for UTF-8 and UTF-16, to accommodate different frameworks.

But really, I'm just looking for flexibility and convenience here, for now, since I think there's still some work to be done for C++/native, since, unlike JS, there really is no unicode encoding standard (though UTF-8 seems to work best for most "vanilla" things). Again, I can probably work around it, but this could affect other backends in the same way.

Contributor

andyarvanitis commented Dec 30, 2016

@michaelficarra, they don't actually exist yet (no C++ purescript-strings yet), but I'm aware of and in agreement with the current approach, and my plan is to keep the "standard" operations in that package (and related ones) based on UTF-16 code units. However, there are places – for most widely compatible C/C++ – where UTF-8 will be best, and may need to be the default in its "core" codegen. Additionally, there's no saying you couldn't have code-points-based operations in non-standard (or even the standard) strings packages, even if you have UTF-16(ish) javascript code-unit ones there too. For C++, I do somewhat predict separate strings packages for UTF-8 and UTF-16, to accommodate different frameworks.

But really, I'm just looking for flexibility and convenience here, for now, since I think there's still some work to be done for C++/native, since, unlike JS, there really is no unicode encoding standard (though UTF-8 seems to work best for most "vanilla" things). Again, I can probably work around it, but this could affect other backends in the same way.

@michaelficarra

This comment has been minimized.

Show comment
Hide comment
@michaelficarra

michaelficarra Dec 30, 2016

Contributor

@andyarvanitis How would you want to codegen lone surrogates? They don't really have a meaning when you consider a string to be a sequence of code points. They aren't just the code point with the same index. They only have meaning as UTF-16 code units.

Contributor

michaelficarra commented Dec 30, 2016

@andyarvanitis How would you want to codegen lone surrogates? They don't really have a meaning when you consider a string to be a sequence of code points. They aren't just the code point with the same index. They only have meaning as UTF-16 code units.

@andyarvanitis

This comment has been minimized.

Show comment
Hide comment
@andyarvanitis

andyarvanitis Dec 30, 2016

Contributor

Actually, if I understand correctly, they aren't even valid UTF-16 code units, right? I don't know how often they get manually constructed, but if they're valid pairs by the time they need to get rendered, any unicode encoding will do. But it's true that if they need to get passed all the way as lone, then I think the backend will need to decide how to render, with replacement chars being one option. Ultimately, the backend needs to handle rendering to escape codes too, since they can be different formats in different backend languages.

Contributor

andyarvanitis commented Dec 30, 2016

Actually, if I understand correctly, they aren't even valid UTF-16 code units, right? I don't know how often they get manually constructed, but if they're valid pairs by the time they need to get rendered, any unicode encoding will do. But it's true that if they need to get passed all the way as lone, then I think the backend will need to decide how to render, with replacement chars being one option. Ultimately, the backend needs to handle rendering to escape codes too, since they can be different formats in different backend languages.

@michaelficarra

This comment has been minimized.

Show comment
Hide comment
@michaelficarra

michaelficarra Dec 30, 2016

Contributor

They're valid code units. UTF-16 code units are numbers in the range 0 through 0xFFFF, inclusive. You're thinking of how lone surrogates would be malformed UTF-16 encoded code points, which is true. But that's why they're not sequences of code points encoded using UTF-16. They're sequences of numbers in the range 0 to 0xFFFF. Those sequences usually are valid UTF-16 encodings of a sequence of code points, but not necessarily, as is the case with lone surrogates. That's why this whole re-encoding thing doesn't really make sense. They really need to be backed by something that is isomorphic to a sequence of UTF-16 code units.

It's a bit inconvenient. I really wish we could have strings just be sequences of Unicode code points, but we would lose full interop with arbitrary JS strings.

Contributor

michaelficarra commented Dec 30, 2016

They're valid code units. UTF-16 code units are numbers in the range 0 through 0xFFFF, inclusive. You're thinking of how lone surrogates would be malformed UTF-16 encoded code points, which is true. But that's why they're not sequences of code points encoded using UTF-16. They're sequences of numbers in the range 0 to 0xFFFF. Those sequences usually are valid UTF-16 encodings of a sequence of code points, but not necessarily, as is the case with lone surrogates. That's why this whole re-encoding thing doesn't really make sense. They really need to be backed by something that is isomorphic to a sequence of UTF-16 code units.

It's a bit inconvenient. I really wish we could have strings just be sequences of Unicode code points, but we would lose full interop with arbitrary JS strings.

@andyarvanitis

This comment has been minimized.

Show comment
Hide comment
@andyarvanitis

andyarvanitis Dec 30, 2016

Contributor

Ah, ok, that's right. Well, I think my own last comments point back to each backend having to make some of those decisions anyway, so there really isn't a good reason to try to keep a common "to-points" function either way.

Contributor

andyarvanitis commented Dec 30, 2016

Ah, ok, that's right. Well, I think my own last comments point back to each backend having to make some of those decisions anyway, so there really isn't a good reason to try to keep a common "to-points" function either way.

@michaelficarra

This comment has been minimized.

Show comment
Hide comment
@michaelficarra

michaelficarra Dec 30, 2016

Contributor

Agreed. I should add a cautionary comment to the function that says that lone surrogates are represented in the result as the reserved code point with that index.

Contributor

michaelficarra commented Dec 30, 2016

Agreed. I should add a cautionary comment to the function that says that lone surrogates are represented in the result as the reserved code point with that index.

src/Language/PureScript/PSString.hs
+ toJSON = A.toJSON . map toChar . toUTF16CodeUnits
+
+renderJSON :: PSString -> Text
+renderJSON s = T.pack $ "\"" <> concatMap encodeChar (toUTF16CodeUnits s) <> "\""

This comment has been minimized.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

I think it's preferable to avoid String as much as possible, and I think we can almost avoid it entirely here by switching concatMap for T.concatMap, changing the type signature of encodeChar to Word16 -> Text, and switching the pure to T.singleton. I'm not aware of a Text equivalent of showHex so I think we might also need to define something like showHex = T.pack . Numeric.showHex.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

I think it's preferable to avoid String as much as possible, and I think we can almost avoid it entirely here by switching concatMap for T.concatMap, changing the type signature of encodeChar to Word16 -> Text, and switching the pure to T.singleton. I'm not aware of a Text equivalent of showHex so I think we might also need to define something like showHex = T.pack . Numeric.showHex.

This comment has been minimized.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

Unfortunately, T.concatMap takes a Text as its second parameter, so I'll have to keep this the way it is.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

Unfortunately, T.concatMap takes a Text as its second parameter, so I'll have to keep this the way it is.

This comment has been minimized.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Ok, in that case foldMap should work, right?

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Ok, in that case foldMap should work, right?

This comment has been minimized.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

Done.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

Done.

src/Language/PureScript/PSString.hs
+ encodeUTF16 c = [toWord $ fromEnum c]
+
+instance A.ToJSON PSString where
+ toJSON = A.toJSON . map toChar . toUTF16CodeUnits

This comment has been minimized.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

I wonder if it might be safer to use A.toJson . toUtf16CodeUnits here, since JSON parsers (especially those written in languages other than JavaScript) might assume UTF-8?

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

I wonder if it might be safer to use A.toJson . toUtf16CodeUnits here, since JSON parsers (especially those written in languages other than JavaScript) might assume UTF-8?

This comment has been minimized.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Also, if the ToJSON is being defined manually, I think the FromJSON should be, too, otherwise we risk them not behaving in the same way. In fact I think the derived FromJSON instance is going to expect PSStrings to be encoded as if with A.toJSON . toUtf16CodeUnits; with this code I think encoding a PSString to JSON and then trying to parse it again is not going to work.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Also, if the ToJSON is being defined manually, I think the FromJSON should be, too, otherwise we risk them not behaving in the same way. In fact I think the derived FromJSON instance is going to expect PSStrings to be encoded as if with A.toJSON . toUtf16CodeUnits; with this code I think encoding a PSString to JSON and then trying to parse it again is not going to work.

_ <- C.indented *> equals
value <- C.indented *> parseValue
return (name, value)
parseAccessor :: Expr -> TokenParser Expr
parseAccessor (Constructor _) = P.unexpected "constructor"
-parseAccessor obj = P.try $ Accessor <$> (C.indented *> dot *> C.indented *> (lname <|> stringLiteral)) <*> pure obj
+parseAccessor obj = P.try $ Accessor <$> (C.indented *> dot *> C.indented *> ((mkString <$> lname) <|> stringLiteral)) <*> pure obj

This comment has been minimized.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

((mkString <$> lname) <|> stringLiteral) appears in a few places, does it make sense to give that a name and put it in Parser.Common?

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

((mkString <$> lname) <|> stringLiteral) appears in a few places, does it make sense to give that a name and put it in Parser.Common?

src/Language/PureScript/Pretty/JS.hs
- objectPropertyToString s | identNeedsEscaping s = string s
- | otherwise = emit s
+ objectPropertyToString :: (Emit gen) => PSString -> gen
+ objectPropertyToString s | identNeedsEscaping $ T.pack $ codePoints s = string s

This comment has been minimized.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

While this implementation will do the right thing as far as I can tell, it makes me a bit uneasy since T.pack . codePoints can introduce replacement characters. If I understand correctly, I think the following is a bit closer to how I might describe what should happen here:

-- | Attempt to decode a PSString as UTF-16 encoded text. This will fail if e.g. the string contains lone surrogates.
codePoints :: PSString -> Maybe Text

objectPropertyToString s =
  case codePoints s of
    Just s' | not (identNeedsEscaping s') ->
      emit s'
    _ ->
      string s

(perhaps if we were to do it this way, decode or decodeText might be a better name for codePoints).

That is, the only case where we can safely emit an object property without using a string literal is when a) the string does represent valid UTF-16 encoded text and b) that text is a valid JS identifier (i.e. doesn't contain primes or anything like that). In all other cases, we need to emit it as a string literal.

I can't remember if I mentioned this already but exporting codePoints without using Maybe in the return type makes me uneasy because it's not always possible to consider a PSString as a sequence of code points (as you've already pointed out elsewhere). Is it feasible to use Maybe?

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

While this implementation will do the right thing as far as I can tell, it makes me a bit uneasy since T.pack . codePoints can introduce replacement characters. If I understand correctly, I think the following is a bit closer to how I might describe what should happen here:

-- | Attempt to decode a PSString as UTF-16 encoded text. This will fail if e.g. the string contains lone surrogates.
codePoints :: PSString -> Maybe Text

objectPropertyToString s =
  case codePoints s of
    Just s' | not (identNeedsEscaping s') ->
      emit s'
    _ ->
      string s

(perhaps if we were to do it this way, decode or decodeText might be a better name for codePoints).

That is, the only case where we can safely emit an object property without using a string literal is when a) the string does represent valid UTF-16 encoded text and b) that text is a valid JS identifier (i.e. doesn't contain primes or anything like that). In all other cases, we need to emit it as a string literal.

I can't remember if I mentioned this already but exporting codePoints without using Maybe in the return type makes me uneasy because it's not always possible to consider a PSString as a sequence of code points (as you've already pointed out elsewhere). Is it feasible to use Maybe?

This comment has been minimized.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

@hdgarrood

I can't remember if I mentioned this already but exporting codePoints without using Maybe in the return type makes me uneasy because it's not always possible to consider a PSString as a sequence of code points (as you've already pointed out elsewhere).

That's correct, but this implementation represents lone surrogates in the PSString input as the reserved code point with the same index in the String output. So the function is total, though a bit opaque. As discussed with @andyarvanitis above, I'll add an explanatory comment. I don't think a Maybe makes the function any better.

How do you feel about adding another exposed function called hasLoneSurrogates or isValidUTF16? Then the guard for objectPropertyToString could be even more readable.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

@hdgarrood

I can't remember if I mentioned this already but exporting codePoints without using Maybe in the return type makes me uneasy because it's not always possible to consider a PSString as a sequence of code points (as you've already pointed out elsewhere).

That's correct, but this implementation represents lone surrogates in the PSString input as the reserved code point with the same index in the String output. So the function is total, though a bit opaque. As discussed with @andyarvanitis above, I'll add an explanatory comment. I don't think a Maybe makes the function any better.

How do you feel about adding another exposed function called hasLoneSurrogates or isValidUTF16? Then the guard for objectPropertyToString could be even more readable.

This comment has been minimized.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

I am not sure that we should have any function that treats lone surrogates in that way, as it's so easy to convert them to U+FFFD by then doing Data.Text.pack on the result (and also we should be avoiding String). I think a Maybe variant would make it much easier to avoid accidentally losing information in this way.

As far as I can tell, there are no more than three things we might need to do with a PSString:

  • Render it as a JS string with appropriate escapes in the JS backend
  • Give it to an alternate backend in the form of UTF-16 code units
  • Render it in an error message produced by the compiler

As far as I can tell, having decodeText :: PSString -> Maybe Text and renderJSON covers the first and toUtf16CodeUnits covers the second. For the third I think we just want to obtain a PS string literal that would parse to the same PSString value, which I think means we want a function that's similar to renderJSON except which uses PS escapes.

Given all this, and assuming I haven't missed anything, I don't see a need for this codePoints function.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

I am not sure that we should have any function that treats lone surrogates in that way, as it's so easy to convert them to U+FFFD by then doing Data.Text.pack on the result (and also we should be avoiding String). I think a Maybe variant would make it much easier to avoid accidentally losing information in this way.

As far as I can tell, there are no more than three things we might need to do with a PSString:

  • Render it as a JS string with appropriate escapes in the JS backend
  • Give it to an alternate backend in the form of UTF-16 code units
  • Render it in an error message produced by the compiler

As far as I can tell, having decodeText :: PSString -> Maybe Text and renderJSON covers the first and toUtf16CodeUnits covers the second. For the third I think we just want to obtain a PS string literal that would parse to the same PSString value, which I think means we want a function that's similar to renderJSON except which uses PS escapes.

Given all this, and assuming I haven't missed anything, I don't see a need for this codePoints function.

This comment has been minimized.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

As far as I can tell, there are no more than three things we might need to do with a PSString:

We also

  • rewrite the program to use the string as a label
  • derive identifier names from the string
  • pretty-print PureScript code
@michaelficarra

michaelficarra Dec 30, 2016

Contributor

As far as I can tell, there are no more than three things we might need to do with a PSString:

We also

  • rewrite the program to use the string as a label
  • derive identifier names from the string
  • pretty-print PureScript code

This comment has been minimized.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Ok, cool. Is there any part of that where we need the current behaviour of codePoints, i.e. representing lone surrogates in that way?

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Ok, cool. Is there any part of that where we need the current behaviour of codePoints, i.e. representing lone surrogates in that way?

@@ -694,11 +697,14 @@ mkVarMn mn = Var . Qualified mn
mkVar :: Ident -> Expr
mkVar = mkVarMn Nothing
+labelToIdent :: Label -> Ident
+labelToIdent (Label l) = Ident $ T.pack $ codePoints l

This comment has been minimized.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Similarly to my review comments in the JS pretty-printer, I worry that this is function is partial in some sense. In particular I think there's a risk of it introducing unicode replacement characters in derived Generic instances for record types with awkward labels. I think we should probably add a test for deriving Generic with records with awkward labels, i.e. labels which are not valid JS identifiers and also labels which are not valid UTF-16, if we don't already have one.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Similarly to my review comments in the JS pretty-printer, I worry that this is function is partial in some sense. In particular I think there's a risk of it introducing unicode replacement characters in derived Generic instances for record types with awkward labels. I think we should probably add a test for deriving Generic with records with awkward labels, i.e. labels which are not valid JS identifiers and also labels which are not valid UTF-16, if we don't already have one.

src/Language/PureScript/Label.hs
+ deriving (Eq, Ord, IsString, Monoid, A.ToJSON, A.FromJSON)
+
+instance Show Label where
+ show (Label s) = either (const $ show s) (const cs) $ runParser lname $ T.pack cs

This comment has been minimized.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Why not use a derived Show instance for Label?

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Why not use a derived Show instance for Label?

This comment has been minimized.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

I was trying to have it be suitable for pretty-printing PureScript code, like above. I'll change it to use deriving and export a separate function for pretty-printing.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

I was trying to have it be suitable for pretty-printing PureScript code, like above. I'll change it to use deriving and export a separate function for pretty-printing.

src/Language/PureScript/PSString.hs
+ deriving (Eq, Ord, A.FromJSON, Monoid)
+
+instance Show PSString where
+ show = show . codePoints

This comment has been minimized.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Have you considered defining show as T.unpack . renderJSON instead?

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Have you considered defining show as T.unpack . renderJSON instead?

This comment has been minimized.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

renderJSON uses JSON/JS escape sequences. show uses Haskell/PureScript escape sequences. This is supposed to be used in error reporting and pretty-printing PureScript code. Using renderJSON will kind of work for the former, but definitely not the latter.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

renderJSON uses JSON/JS escape sequences. show uses Haskell/PureScript escape sequences. This is supposed to be used in error reporting and pretty-printing PureScript code. Using renderJSON will kind of work for the former, but definitely not the latter.

This comment has been minimized.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Ah, I see. My response would be please let's not use Show for either of those things. We used to do that for types like Ident and it caused a few bugs - I remember a conscious effort to remove all uses of show for that kind of thing a while back. I think the Show instances for types we define in the compiler like this one should only be there so that we can work with things easily in the REPL (perhaps @paf31 or @garyb could like confirm that this is the approach we want to take?)

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Ah, I see. My response would be please let's not use Show for either of those things. We used to do that for types like Ident and it caused a few bugs - I remember a conscious effort to remove all uses of show for that kind of thing a while back. I think the Show instances for types we define in the compiler like this one should only be there so that we can work with things easily in the REPL (perhaps @paf31 or @garyb could like confirm that this is the approach we want to take?)

This comment has been minimized.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

I would love to do that. Incorrect assumptions about show are the source of so many bugs.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

I would love to do that. Incorrect assumptions about show are the source of so many bugs.

src/Language/PureScript/Pretty/Common.hs
+prettyPrintObjectKey = prettyPrintLabel . Label
+
+prettyPrintLabel :: Label -> Text
+prettyPrintLabel = T.pack . show

This comment has been minimized.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Can we avoid show here? I tend to prefer monomorphic versions of functions where possible. In particular, the old code might have been forced to use show as that might have been the only way to render the Text in the desired manner, but now, renderJSON on the underlying PSString ought to do the right thing, right?

Edit: To clarify, I tend to prefer monomorphic versions of functions when you're using dodgy (lawless) type classes like Show. We've been bitten in the compiler by using show in 'real' code before, ideally I think Show should only be used for displaying things in the REPL.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Can we avoid show here? I tend to prefer monomorphic versions of functions where possible. In particular, the old code might have been forced to use show as that might have been the only way to render the Text in the desired manner, but now, renderJSON on the underlying PSString ought to do the right thing, right?

Edit: To clarify, I tend to prefer monomorphic versions of functions when you're using dodgy (lawless) type classes like Show. We've been bitten in the compiler by using show in 'real' code before, ideally I think Show should only be used for displaying things in the REPL.

src/Language/PureScript/Pretty/JS.hs
@@ -182,14 +161,13 @@ conditional = mkPattern match
accessor :: (Emit gen) => Pattern PrinterState JS (gen, JS)
accessor = mkPattern match
where
- match (JSAccessor _ prop val) = Just (emit prop, val)
+ match (JSAccessor _ prop val) = Just (emit $ T.pack $ codePoints prop, val)

This comment has been minimized.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Should this be using the same objectPropertyToString function as above?

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Should this be using the same objectPropertyToString function as above?

This comment has been minimized.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

No, I don't think so. I believe JSAccessor represents static member access (a.b). There's another constructor for computed member access (a[b]) called JSIndexer or something. Should we make JSAccessor accept any property name and automatically render them using computed member access if static member access would be syntactically invalid?

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

No, I don't think so. I believe JSAccessor represents static member access (a.b). There's another constructor for computed member access (a[b]) called JSIndexer or something. Should we make JSAccessor accept any property name and automatically render them using computed member access if static member access would be syntactically invalid?

This comment has been minimized.

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

I'm actually not too familiar with this part of the compiler so I'm not sure. Perhaps this field should remain as Text, not PSString, if there's no need to accommodate strings which aren't valid UTF-16 (which presumably there isn't with static property access like this)?

@hdgarrood

hdgarrood Dec 30, 2016

Contributor

I'm actually not too familiar with this part of the compiler so I'm not sure. Perhaps this field should remain as Text, not PSString, if there's no need to accommodate strings which aren't valid UTF-16 (which presumably there isn't with static property access like this)?

This comment has been minimized.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

I confirmed that my suspicions were correct. Instead of switching JSAccessor to use Text, I think we should remove it entirely. Then when the Expr in JSIndexer is a StringLiteral that matches ECMASript's IdentifierName grammar production, render it using static member access. This has the side benefit of using static member access where an unnecessary computed member access would have been generated previously. I'll open an issue and send a separate PR for it.

@michaelficarra

michaelficarra Dec 30, 2016

Contributor

I confirmed that my suspicions were correct. Instead of switching JSAccessor to use Text, I think we should remove it entirely. Then when the Expr in JSIndexer is a StringLiteral that matches ECMASript's IdentifierName grammar production, render it using static member access. This has the side benefit of using static member access where an unnecessary computed member access would have been generated previously. I'll open an issue and send a separate PR for it.

@michaelficarra

This comment has been minimized.

Show comment
Hide comment
@michaelficarra

michaelficarra Dec 30, 2016

Contributor

@hdgarrood I believe I've addressed all of your comments now, and I've replaced codePoints with a version that returns an Either. I think it's a lot better now. CI should pass and this should be ready to merge as soon as it gets a positive review and a squash.

edit: Of course, right after posting that, I remember I wanted to add some tests for labels with astral code points and with lone lead/trail surrogates.

Contributor

michaelficarra commented Dec 30, 2016

@hdgarrood I believe I've addressed all of your comments now, and I've replaced codePoints with a version that returns an Either. I think it's a lot better now. CI should pass and this should be ready to merge as soon as it gets a positive review and a squash.

edit: Of course, right after posting that, I remember I wanted to add some tests for labels with astral code points and with lone lead/trail surrogates.

@hdgarrood

This comment has been minimized.

Show comment
Hide comment
@hdgarrood

hdgarrood Dec 30, 2016

Contributor

I came up with this test for deriving old-style generics and it does fail in the way that I was worried it might, based on seeing the use of T.pack in labelToIdent: https://gist.github.com/hdgarrood/bee8bd41f59576aaf3b9ae88dbd4b946. I'm fairly sure the problem is that labelToIdent function needs to be injective and currently isn't because of the replacement of invalid scalar values that Data.Text.pack is doing. We might consider just dropping support for old-style generics instead though instead of worrying about fixing this.

The Either version is definitely an improvement but I still think it's too dangerous, i.e. it's too tempting to call Data.Text.pack on the resulting Left String obtained from codePoints. I now see why the current behaviour of codePoints is useful in defining the Show instance for PSString and for pretty-printing PSString values, but outside of those two uses I don't see any reason to have it and so I would rather codePoints were not exported from the module that defines it. If I haven't been able to make myself clear I'd be happy to have a crack at this myself, perhaps it will be easier to express as code rather than prose.

Contributor

hdgarrood commented Dec 30, 2016

I came up with this test for deriving old-style generics and it does fail in the way that I was worried it might, based on seeing the use of T.pack in labelToIdent: https://gist.github.com/hdgarrood/bee8bd41f59576aaf3b9ae88dbd4b946. I'm fairly sure the problem is that labelToIdent function needs to be injective and currently isn't because of the replacement of invalid scalar values that Data.Text.pack is doing. We might consider just dropping support for old-style generics instead though instead of worrying about fixing this.

The Either version is definitely an improvement but I still think it's too dangerous, i.e. it's too tempting to call Data.Text.pack on the resulting Left String obtained from codePoints. I now see why the current behaviour of codePoints is useful in defining the Show instance for PSString and for pretty-printing PSString values, but outside of those two uses I don't see any reason to have it and so I would rather codePoints were not exported from the module that defines it. If I haven't been able to make myself clear I'd be happy to have a crack at this myself, perhaps it will be easier to express as code rather than prose.

@hdgarrood

This comment has been minimized.

Show comment
Hide comment
@hdgarrood

hdgarrood Dec 30, 2016

Contributor

Also, I think we should change the ToJSON instance to use A.toJSON . toUTF16CodeUnits (i.e. use the derived ToJSON and FromJSON instances), since although it might be permissible according to the JSON spec to have lone surrogates in JSON strings, I expect that many JSON parsers won't be able to deal with them. Haskell's aeson is one such example:

λ: decode ("{\"\\u1000\": 0}" :: Data.ByteString.Lazy.ByteString) :: Maybe Value
Just (Object (fromList [("\4096",Number 0.0)]))

λ: decode ("{\"\\ud800\": 0}" :: Data.ByteString.Lazy.ByteString) :: Maybe Value
Nothing

I also think the array of integers format is likely to be easier for alternative backends to deal with correctly.

Contributor

hdgarrood commented Dec 30, 2016

Also, I think we should change the ToJSON instance to use A.toJSON . toUTF16CodeUnits (i.e. use the derived ToJSON and FromJSON instances), since although it might be permissible according to the JSON spec to have lone surrogates in JSON strings, I expect that many JSON parsers won't be able to deal with them. Haskell's aeson is one such example:

λ: decode ("{\"\\u1000\": 0}" :: Data.ByteString.Lazy.ByteString) :: Maybe Value
Just (Object (fromList [("\4096",Number 0.0)]))

λ: decode ("{\"\\ud800\": 0}" :: Data.ByteString.Lazy.ByteString) :: Maybe Value
Nothing

I also think the array of integers format is likely to be easier for alternative backends to deal with correctly.

@michaelficarra

This comment has been minimized.

Show comment
Hide comment
@michaelficarra

michaelficarra Jan 2, 2017

Contributor

@hdgarrood I've added your test and updated the ToJSON/FromJSON instances for PSString as you suggested. I've also given you access to my fork so feel free to push additional commits to this branch as you like.

Contributor

michaelficarra commented Jan 2, 2017

@hdgarrood I've added your test and updated the ToJSON/FromJSON instances for PSString as you suggested. I've also given you access to my fork so feel free to push additional commits to this branch as you like.

Tidying up: represent PureScript strings as sequence of Word16
Changes:

* Expand tests for string edge-cases
* Remove use of `show` while printing errors
* Remove `codePoints` from the export list of Language.PureScript.PSString
* Fix an issue with derived Generic instances where colliding Idents
  were being generated
* Change CoreFn/ToJSON so that invalid JSON strings (i.e. invalid
  UTF-16) will not be generated, since relatively few JSON parsers can
  cope with it (e.g. aeson)
* Various function renaming and rearranging to better match
  existing conventions inside the compiler.

Unfortunately we are forced to break the CoreFn JSON format with this
change, as there is no way of generating strings that reliably parse to
the value we want if strings are allowed to include invalid UTF-16.

The CoreFn JSON changes in the following ways:

* String literals are now generated as arrays of integers, where each
  integer is between 0 and 0xFFFF and represents one UTF-16 code unit
  (were previously generated as JSON strings).
* Record literals are now generated as an array of pairs (two-element
  arrays), where the first element is the key, generated as an array of
  code units just like string literals, and the second element is the
  value.
@hdgarrood

This comment has been minimized.

Show comment
Hide comment
@hdgarrood

hdgarrood Jan 3, 2017

Contributor

I'm happy with this now. @garyb @paf31, could one of you please take a look?

Contributor

hdgarrood commented Jan 3, 2017

I'm happy with this now. @garyb @paf31, could one of you please take a look?

src/Language/PureScript/Errors.hs
renderSimpleErrorMessage (AdditionalProperty prop) =
- line $ "Type of expression contains additional label " <> markCode prop <> "."
+ line $ "Type of expression contains additional label " <> prettyPrintLabel prop <> "."

This comment has been minimized.

@hdgarrood

hdgarrood Jan 3, 2017

Contributor

Oh actually, perhaps this should be markCode (prettyPrintLabel prop)?

@hdgarrood

hdgarrood Jan 3, 2017

Contributor

Oh actually, perhaps this should be markCode (prettyPrintLabel prop)?

This comment has been minimized.

@paf31

paf31 Jan 4, 2017

Member

Yes please

@paf31

paf31 Jan 4, 2017

Member

Yes please

@@ -733,11 +737,22 @@ mkVarMn mn = Var . Qualified mn
mkVar :: Ident -> Expr
mkVar = mkVarMn Nothing
+-- This function may seem a little obtuse, but it's only this way to ensure
+-- that it is injective. Injectivity is important here; without it, we can end
+-- up with accidental variable shadowing in the generated code.

This comment has been minimized.

@hdgarrood

hdgarrood Jan 3, 2017

Contributor

Perhaps I've made an incorrect assumption about how Ident generation works, actually; I'm not completely sure whether this comment is correct. If it really worked how I am imagining then I think lamNull above should break for the same reason that the previous code did?

@hdgarrood

hdgarrood Jan 3, 2017

Contributor

Perhaps I've made an incorrect assumption about how Ident generation works, actually; I'm not completely sure whether this comment is correct. If it really worked how I am imagining then I think lamNull above should break for the same reason that the previous code did?

hdgarrood added some commits Jan 3, 2017

Fix psci tests
* Stop accidentally including tests from support modules
* Update support module list
@michaelficarra

This comment has been minimized.

Show comment
Hide comment
@michaelficarra

michaelficarra Jan 3, 2017

Contributor

Thanks @hdgarrood. The changes look great.

Contributor

michaelficarra commented Jan 3, 2017

Thanks @hdgarrood. The changes look great.

@hdgarrood

This comment has been minimized.

Show comment
Hide comment
@hdgarrood

hdgarrood Jan 3, 2017

Contributor

I think you deserve most of the credit, since you did most of the work! :)

Contributor

hdgarrood commented Jan 3, 2017

I think you deserve most of the credit, since you did most of the work! :)

@paf31

This comment has been minimized.

Show comment
Hide comment
@paf31

paf31 Jan 4, 2017

Member

This looks good to me, and is very easy to read, so thank you both!

@andyarvanitis @nwolverson Do you have any additional comments on this from the perspective of alternative backends?

Member

paf31 commented Jan 4, 2017

This looks good to me, and is very easy to read, so thank you both!

@andyarvanitis @nwolverson Do you have any additional comments on this from the perspective of alternative backends?

@paf31

paf31 approved these changes Jan 4, 2017

@hdgarrood

This comment has been minimized.

Show comment
Hide comment
@hdgarrood

hdgarrood Jan 4, 2017

Contributor

Another potential approach has just occurred to me: we might be able to say that the String type is UTF-16 encoded text and the functions in Data.String operate on the code-unit level, and the behaviour (as far as the language is concerned) is undefined when the code units in a String are not valid UTF-16, which effectively leaves it up to each individual backend to decide what to do. Some backends could refuse to compile code which contained lone surrogates in string literals, for instance. We could then say that the JS backend handles invalid UTF-16 the same way as JS does, i.e. it's fine, preserving JS compatibility for people who need it. This ought to leave alternate backends free to choose any Unicode encoding for their String representations, and therefore I think it would help avoid a situation where any code that involves strings is basically unportable between different backends.

Thankfully this PR is pretty much exactly what we would want to do in both cases: either this new potential approach or what we agreed on earlier. :)

Contributor

hdgarrood commented Jan 4, 2017

Another potential approach has just occurred to me: we might be able to say that the String type is UTF-16 encoded text and the functions in Data.String operate on the code-unit level, and the behaviour (as far as the language is concerned) is undefined when the code units in a String are not valid UTF-16, which effectively leaves it up to each individual backend to decide what to do. Some backends could refuse to compile code which contained lone surrogates in string literals, for instance. We could then say that the JS backend handles invalid UTF-16 the same way as JS does, i.e. it's fine, preserving JS compatibility for people who need it. This ought to leave alternate backends free to choose any Unicode encoding for their String representations, and therefore I think it would help avoid a situation where any code that involves strings is basically unportable between different backends.

Thankfully this PR is pretty much exactly what we would want to do in both cases: either this new potential approach or what we agreed on earlier. :)

@hdgarrood hdgarrood merged commit 086bbe6 into purescript:master Jan 4, 2017

1 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
@andyarvanitis

This comment has been minimized.

Show comment
Hide comment
@andyarvanitis

andyarvanitis Jan 4, 2017

Contributor

Looks good to me, thanks! I also like @hdgarrood 's idea. Another thing I've thought about and alluded to earlier is that it might be good to have functions that operate on code points (and something like size, in bytes) in addition to ones that operate on code units, in the standard strings package. Regardless of underlying encoding, they would be useful and portable. It's obviously a topic for a separate discussion, though.

Contributor

andyarvanitis commented Jan 4, 2017

Looks good to me, thanks! I also like @hdgarrood 's idea. Another thing I've thought about and alluded to earlier is that it might be good to have functions that operate on code points (and something like size, in bytes) in addition to ones that operate on code units, in the standard strings package. Regardless of underlying encoding, they would be useful and portable. It's obviously a topic for a separate discussion, though.

@michaelficarra michaelficarra deleted the michaelficarra:string-newtypes branch Jan 4, 2017

@michaelficarra

This comment has been minimized.

Show comment
Hide comment
@michaelficarra

michaelficarra Jan 4, 2017

Contributor

@hdgarrood Though that seems convenient, I'd rather avoid any undefined behaviour in the language. I wanted to leave an ability for compile time PSString evaluation, such as concatenation. I would want the program "\xD834" <> "\xDF06" to be simplified to "\x1D306" or "𝌆". In fact, that behaviour is already defined today in PSString's Monoid implementation.

@andyarvanitis Yes, I'd love to see codepoint-based operations added to the standard string libraries. Please include me. I'll help out however I can.

Contributor

michaelficarra commented Jan 4, 2017

@hdgarrood Though that seems convenient, I'd rather avoid any undefined behaviour in the language. I wanted to leave an ability for compile time PSString evaluation, such as concatenation. I would want the program "\xD834" <> "\xDF06" to be simplified to "\x1D306" or "𝌆". In fact, that behaviour is already defined today in PSString's Monoid implementation.

@andyarvanitis Yes, I'd love to see codepoint-based operations added to the standard string libraries. Please include me. I'll help out however I can.

@hdgarrood

This comment has been minimized.

Show comment
Hide comment
@hdgarrood

hdgarrood Jan 4, 2017

Contributor

Ok, good point. On second thoughts it's a bit early to try to guess the risk of a proliferation of nonportable string code right now anyway. I'd also love to see more code-point based stuff in strings, btw.

Contributor

hdgarrood commented Jan 4, 2017

Ok, good point. On second thoughts it's a bit early to try to guess the risk of a proliferation of nonportable string code right now anyway. I'd also love to see more code-point based stuff in strings, btw.

@kRITZCREEK

This comment has been minimized.

Show comment
Hide comment
@kRITZCREEK

kRITZCREEK Jan 4, 2017

Member

Just dropping in to say thanks to @michaelficarra and @hdgarrood! A colossal effort and a clean solution 👍

Member

kRITZCREEK commented Jan 4, 2017

Just dropping in to say thanks to @michaelficarra and @hdgarrood! A colossal effort and a clean solution 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment