-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement C FFI for purescript-strings #34
Comments
We're 239/266 tests passing! A few sigsegvs here (likely stack overflows) and a few sigabrt there (likely failed asserts), a missing implementation for `ObjectUpdate`, a few missing FFIs for tests themselves, as well as a missing implementations for purescript-arrays (#36), purescript-strings (#34).
Also, JS engines apparently encode strings in UTF-16 or UCS-2. I see no need or appeal to match JS in any way for the sake of being similar, but I wonder what the rationale is and whether we should do the same or not. |
Also interesting: https://blog.golang.org/strings |
I think rust-lang's string handling is awesome, and a good way to learn unicode/different encodings. At the risk of saying stuff that's already known: (note below I use javascript and ecmascript interchangeably) In the beginning there was ascii, where a character is an 8-bit integer (since there's no maths involved, either unsigned or signed it doesn't matter). Bit patterns 0x00 to 0x7f were the standard ascii character set, and bit patterns 0x80 to 0xff were a second implementation-defined page, that could be used for, say, É in french, or characters for drawing windows in DOS. This became a bit of a mess, with lots of encodings using different second pages. There were extensions that I don't understand, but in the end the great and good got together and developed the Unicode standard, where each atom is 4 bytes wide. This address space should be big enough to contain all possible characters in all languages. However if most of the text you write is ascii, you are wasting 3/4 of your space with 00s if you encode every character as a Unicode code-point, so in general folks use some kind of encoding scheme that compresses this data. The most popular is utf-8 where the first 127 characters are just their ascii representation, and any wider characters use a continuation bit (I think this means that utf-8 is always in little-endian, though I may be wrong). This means that for example the string "test" takes 4 bytes, the string "tÉst" takes 5, and so on. Well, the last sentence was a bit of a lie. The character É could be made of 1 or 2 Unicode code points. The character itself is a code point, but it could also be produced with the sequence 'E' and U+0301, the Unicode combining, accent acute modifier. Just check out the acute Wikipedia page to get a feel for how many esoteric Unicode code points there are. When there is more than 1 way to represent the same physical glyph(s), sometimes there is a canonical representation, and changing all representations to this one is called normalization. Groups of code points that represent a single character or other unit are sometimes called grapheme clusters, but there are other names too I believe. I don't think C knows anything about any of this. In rust, strings work the same as a vector, in that their structure on the stack is Hope this is useful in any case. PS utf-16 is less compact and less standard that utf-8, but it is standard on windows and some ECMAScript strings (though not all I think) use it. PPS (see USVString) it appears that strings in javascript are always utf-16, but there are some opaque types that may encode the string differently. They will be converted to utf-16 if they are accessed as Strings in ECMAScript. |
@derekdreery That was extremely useful, thank you for taking the time to write this up. I originally took up utf8.h as it was easy to get started with, however it seems my understanding was incomplete. The current type representing strings is effectively
If we put (1) aside for a moment, and assuming that immutability means (2) is not an issue, it would seem that we can safely move ahead with the current representation? How does Utf-16 relate to all of this, will it have an impact on what representation we should choose? |
Sounds good re. 1 and 2 I'm not an expert on utf16, although it may be that some code points that are valid in utf8 are not in utf16, or the other way round, I'm not sure - it's complicated :( |
Since strings are immutable, you could probably use some complex data structure to save space, and cut down on copying, but probably not part of the MVP. |
I guess from here on out it's a matter of forking purescript-strings and see how far we can get. Are you interested at all in giving this a go? I've got my eyes set on getting purescript-httpure going in PureC, using the Aff port and libuv bindings in purec-uv - I think purescript-strings is an obvious missing piece, and it would be a great usecase/demonstration of PureC. |
I had a go with this using utf8.h. The issue is that, as you already mentioned, we are not able to represent all the characters that purescript parses since it uses UTF-16 😢. Unfortunately I don't see any other way than to switch to UTF-16 strings in purec as well. What do you think we should do @derekdreery @felixSchl ? Do you know of any light UTF-16 libraries for C? |
@andyarvanitis purescript-native uses UTF-8 for strings, but I also saw that purescript-strings uses UTF-16 and UTF-32 for chars. Not sure I understand this. Is there a conversion somewhere between UTF-8 and UTF-16, or is it all just UTF-8? |
@anttih For purescript-native, the plan is just to use UTF-8. I've yet to implement its FFI functions for the current runtime, though (the older pure11 implementation was also UTF-8). BTW, I hope to get to this quite soon. Interestingly enough, I had also started looking at utf8.h, since the C++ std codecvt stuff is deprecated in C++17. But there shouldn't be any valid unicode code points not representable in UTF-8, unless you're talking about the (UTF-16-invalid) lone surrogates that javascript supports. I just didn't support them in pure11 and planned to do the same for purescript-native. Were there any other unicode characters/entities you encountered that were a problem? |
Yeah, I read some more and it seems the only problem is with the lone surrogates. I think we can go with UTF-8 then :) |
There's another big question here I think if we use UTF-8: what to do with the API for the For example, the
while the
For To be compatible we should implement both APIs, but this makes the API a bit confusing. Do we have any other choice than to just implement both APIs? Edit: here's a nice explanation of the |
Right now, a PureScript
Implementing the Data.String.CodeUnits API will probably be a bit awkward, as you'll probably end up having to wrap every function implementation with an I think there is perhaps a case to be made for having the builtin |
It does mean though that |
Yes, although I would be in favour of changing the language so that |
@hdgarrood Should we create a compiler ticket? I agree, and I would like it if the language was representation agnostic. |
Sounds good to me 👍 |
So either we a) change b) have a) doesn't sound appealing to me. It might be a lot of work for not really being better in any way except being compatible with the current language. I think there has to be a middle ground here. Maybe we could solve some of the issues not by changing the language but by not using |
Another challenge here is that we need to support the NULL byte to be unicode compliant, so just using NULL teminated strings won't work. I think we have to switch to a string "slice" where we track the length. Edit: looks like we could read in |
Required by #12
The text was updated successfully, but these errors were encountered: