Implement C FFI for purescript-strings #34

felixSchl · 2018-10-03T19:33:43Z

Required by #12

We're 239/266 tests passing! A few sigsegvs here (likely stack overflows) and a few sigabrt there (likely failed asserts), a missing implementation for `ObjectUpdate`, a few missing FFIs for tests themselves, as well as a missing implementations for purescript-arrays (#36), purescript-strings (#34).

felixSchl · 2018-11-01T01:27:18Z

@Lupino let's explore what we have to do get this going. Also note that I am pretty sure that the library is not required by #12 after all, so we could really avoid implementing it altogether.

felixSchl · 2018-11-01T03:39:33Z

Also, JS engines apparently encode strings in UTF-16 or UCS-2. I see no need or appeal to match JS in any way for the sake of being similar, but I wonder what the rationale is and whether we should do the same or not.

felixSchl · 2018-11-01T03:45:53Z

Also interesting: https://blog.golang.org/strings

richard-uk1 · 2018-11-10T10:14:20Z

I think rust-lang's string handling is awesome, and a good way to learn unicode/different encodings. At the risk of saying stuff that's already known: (note below I use javascript and ecmascript interchangeably)

In the beginning there was ascii, where a character is an 8-bit integer (since there's no maths involved, either unsigned or signed it doesn't matter). Bit patterns 0x00 to 0x7f were the standard ascii character set, and bit patterns 0x80 to 0xff were a second implementation-defined page, that could be used for, say, É in french, or characters for drawing windows in DOS.

This became a bit of a mess, with lots of encodings using different second pages. There were extensions that I don't understand, but in the end the great and good got together and developed the Unicode standard, where each atom is 4 bytes wide. This address space should be big enough to contain all possible characters in all languages.

However if most of the text you write is ascii, you are wasting 3/4 of your space with 00s if you encode every character as a Unicode code-point, so in general folks use some kind of encoding scheme that compresses this data. The most popular is utf-8 where the first 127 characters are just their ascii representation, and any wider characters use a continuation bit (I think this means that utf-8 is always in little-endian, though I may be wrong). This means that for example the string "test" takes 4 bytes, the string "tÉst" takes 5, and so on.

Well, the last sentence was a bit of a lie. The character É could be made of 1 or 2 Unicode code points. The character itself is a code point, but it could also be produced with the sequence 'E' and U+0301, the Unicode combining, accent acute modifier. Just check out the acute Wikipedia page to get a feel for how many esoteric Unicode code points there are. When there is more than 1 way to represent the same physical glyph(s), sometimes there is a canonical representation, and changing all representations to this one is called normalization.

Groups of code points that represent a single character or other unit are sometimes called grapheme clusters, but there are other names too I believe.

I don't think C knows anything about any of this. In rust, strings work the same as a vector, in that their structure on the stack is [ pointer, length, capacity ], where pointer points to a single unstructured block of data on the heap, length is the current length of the string, and capacity is the length of the allocated memory. In C, a string is just a pointer, where it is up to the user to know how much allocated memory there is. The length is determined by the first occurrence of the null (\0, 0x00) character. In rust, when working with C FFI it's generally easier to convert to a [pointer, length] or [pointer, length, capacity] format for ease of use, unless that causes unacceptable performance problems.

Hope this is useful in any case.

PS utf-16 is less compact and less standard that utf-8, but it is standard on windows and some ECMAScript strings (though not all I think) use it.

PPS (see USVString) it appears that strings in javascript are always utf-16, but there are some opaque types that may encode the string differently. They will be converted to utf-16 if they are accessed as Strings in ECMAScript.

felixSchl · 2018-11-13T20:51:13Z

@derekdreery That was extremely useful, thank you for taking the time to write this up. I originally took up utf8.h as it was easy to get started with, however it seems my understanding was incomplete. The current type representing strings is effectively void* assuming utf8 encoded contents and '\0' terminating byte. There are a few things to note about with the current representation:

We require the managed_t wrapper and a GC_register_finalizer because we assume a non-GC allocated string (due to vasprintf). This is largely unrelated to the discussion, but if we can clean this up we might as well do it now.
We do not track the capacity as we assume that length always equals capacity-1. I think this should be OK, though, because strings are immutable? In other words, I don't think we'll be shrinking or growing strings under the same pointer.

If we put (1) aside for a moment, and assuming that immutability means (2) is not an issue, it would seem that we can safely move ahead with the current representation? How does Utf-16 relate to all of this, will it have an impact on what representation we should choose?

richard-uk1 · 2018-11-13T22:47:44Z

Sounds good re. 1 and 2

I'm not an expert on utf16, although it may be that some code points that are valid in utf8 are not in utf16, or the other way round, I'm not sure - it's complicated :(

richard-uk1 · 2018-11-13T22:49:27Z

Since strings are immutable, you could probably use some complex data structure to save space, and cut down on copying, but probably not part of the MVP.

felixSchl · 2018-11-13T23:20:46Z

I guess from here on out it's a matter of forking purescript-strings and see how far we can get. Are you interested at all in giving this a go? I've got my eyes set on getting purescript-httpure going in PureC, using the Aff port and libuv bindings in purec-uv - I think purescript-strings is an obvious missing piece, and it would be a great usecase/demonstration of PureC.

anttih · 2019-05-28T03:58:09Z

I had a go with this using utf8.h. The issue is that, as you already mentioned, we are not able to represent all the characters that purescript parses since it uses UTF-16 😢.

Unfortunately I don't see any other way than to switch to UTF-16 strings in purec as well. What do you think we should do @derekdreery @felixSchl ? Do you know of any light UTF-16 libraries for C?

anttih · 2019-05-28T19:07:03Z

@andyarvanitis purescript-native uses UTF-8 for strings, but I also saw that purescript-strings uses UTF-16 and UTF-32 for chars. Not sure I understand this. Is there a conversion somewhere between UTF-8 and UTF-16, or is it all just UTF-8?

andyarvanitis · 2019-05-30T03:32:28Z

@anttih For purescript-native, the plan is just to use UTF-8. I've yet to implement its FFI functions for the current runtime, though (the older pure11 implementation was also UTF-8). BTW, I hope to get to this quite soon. Interestingly enough, I had also started looking at utf8.h, since the C++ std codecvt stuff is deprecated in C++17. But there shouldn't be any valid unicode code points not representable in UTF-8, unless you're talking about the (UTF-16-invalid) lone surrogates that javascript supports. I just didn't support them in pure11 and planned to do the same for purescript-native. Were there any other unicode characters/entities you encountered that were a problem?

anttih · 2019-05-31T03:17:24Z

Yeah, I read some more and it seems the only problem is with the lone surrogates. I think we can go with UTF-8 then :)

anttih · 2019-06-01T11:29:47Z

There's another big question here I think if we use UTF-8: what to do with the API for the CodePoints and the CodeUnits modules? In PureScript's JS backend, a Char is a UTF-16 code unit, whereas right now in purec a Char is a 32-bit UTF-8 code point. I think what purec does is better, but now the CodePoints API doesn't make that much sense since we can represent all code points as proper Chars.

For example, the CodePoints module has:

codePointAt :: Int -> String -> Maybe CodePoint

while the CodeUnits module has:

charAt :: Int -> String -> Maybe Char

For purec the CodePoint version behaves the same way as the Char one.

To be compatible we should implement both APIs, but this makes the API a bit confusing. Do we have any other choice than to just implement both APIs?

Edit: here's a nice explanation of the strings API by @hdgarrood

hdgarrood · 2019-06-01T14:53:20Z

Right now, a PureScript String is defined to be any sequence of UTF-16 code units (including those which don't form valid Unicode strings). You're free to use whatever encoding / layout in memory, but if you can't implement a bijection String -> Array Word16, your backend is technically incorrect. Having the internal representation of strings in this backend be UTF-8 doesn't necessarily mean you can't conform, though; see UTF-16 (Wikipedia):

U+D800 to U+DFFF

The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points.

However UCS-2, UTF-8, and UTF-32 can encode these code points in trivial and obvious ways, and large amounts of software does so even though the standard states that such arrangements should be treated as encoding errors.

It is possible to unambiguously encode an unpaired surrogate (a high surrogate code point not followed by a low one, or a low one not preceded by a high one) in UTF-16 by using a code unit equal to the code point. The majority of UTF-16 encoder and decoder implementations do this then when translating between encodings.[citation needed] Windows allows unpaired surrogates in filenames and other places, which generally means they have to be supported by software in spite of their exclusion from the Unicode standard

Implementing the Data.String.CodeUnits API will probably be a bit awkward, as you'll probably end up having to wrap every function implementation with an encodeUtf16 on the way in and/or a decodeUtf16 on the way out, but I think it's doable, provided that the string libraries you're using don't mind about encoding the code points U+D800 to U+DFFF in UTF-8 (which they probably don't; see above).

I think there is perhaps a case to be made for having the builtin String type represent a sequence of Unicode code points, where we don't define the encoding, but this would require careful planning and a fairly big breaking change for the purescript-strings library (we'd probably want to rip out Data.String.CodeUnits and put it in a separate library which would be marked only for use with js, and only where you specifically need to handle lone surrogates.

anttih · 2019-06-01T17:03:20Z

You're free to use whatever encoding / layout in memory, but if you can't implement a bijection String -> Array Word16, your backend is technically incorrect. Having the internal representation of strings in this backend be UTF-8 doesn't necessarily mean you can't conform, though; see UTF-16 (Wikipedia):

It does mean though that Char cannot represent a code point, but has to be a Word16, right?

hdgarrood · 2019-06-01T17:09:45Z

Yes, although I would be in favour of changing the language so that Char is actually a code point rather than a UTF-16 code unit. I think having the UTF-16 representation leak through there is a bit of a wart; it's sort of inconsistent with the approach we've taken in the strings library, which is that we provide functions you can use if speed is really necessary / you know you need to operate with UTF-16 code units, but the functions you get if you just import Data.String are the ones which operate on code points and which don't expose details of how the strings are actually encoded in memory.

natefaubion · 2019-06-01T17:14:56Z

@hdgarrood Should we create a compiler ticket? I agree, and I would like it if the language was representation agnostic.

hdgarrood · 2019-06-01T17:20:04Z

Sounds good to me 👍

anttih · 2019-06-01T18:19:54Z

So either we

a) change Char in purec to be a UTF-16 code unit, and find a way to encode/decode from utf-8 encoded String to Char for the use in CodeUnits (I didn't find anything that implements this by a quick googling)

b) have Char be a UTF-8 code point and implement CodeUnits by actually using code points underneath 😅

a) doesn't sound appealing to me. It might be a lot of work for not really being better in any way except being compatible with the current language. I think there has to be a middle ground here. Maybe we could solve some of the issues not by changing the language but by not using Char all that much, as suggested here.

anttih · 2019-06-14T03:55:23Z

Another challenge here is that we need to support the NULL byte to be unicode compliant, so just using NULL teminated strings won't work. I think we have to switch to a string "slice" where we track the length.

Edit: looks like we could read in \0 as two bytes: 0xC0 0x80 https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

felixSchl added the libraries label Oct 3, 2018

felixSchl mentioned this issue Oct 11, 2018

Run against all passing tests in PureScript compiler test suite #12

Closed

3 tasks

felixSchl mentioned this issue Nov 3, 2018

Compile pure-c using pure-c #29

Open

natefaubion mentioned this issue Jun 1, 2019

Consider changing Char to represent a code point rather than a UTF-16 code unit purescript/purescript#3662

Open

felixSchl closed this as completed Feb 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement C FFI for purescript-strings #34

Implement C FFI for purescript-strings #34

felixSchl commented Oct 3, 2018

felixSchl commented Nov 1, 2018

felixSchl commented Nov 1, 2018

felixSchl commented Nov 1, 2018

richard-uk1 commented Nov 10, 2018 •

edited

Loading

felixSchl commented Nov 13, 2018

richard-uk1 commented Nov 13, 2018

richard-uk1 commented Nov 13, 2018

felixSchl commented Nov 13, 2018 •

edited

Loading

anttih commented May 28, 2019

anttih commented May 28, 2019

andyarvanitis commented May 30, 2019

anttih commented May 31, 2019

anttih commented Jun 1, 2019 •

edited

Loading

hdgarrood commented Jun 1, 2019

U+D800 to U+DFFF

anttih commented Jun 1, 2019

hdgarrood commented Jun 1, 2019

natefaubion commented Jun 1, 2019

hdgarrood commented Jun 1, 2019

anttih commented Jun 1, 2019 •

edited

Loading

anttih commented Jun 14, 2019 •

edited

Loading

Implement C FFI for purescript-strings #34

Implement C FFI for purescript-strings #34

Comments

felixSchl commented Oct 3, 2018

felixSchl commented Nov 1, 2018

felixSchl commented Nov 1, 2018

felixSchl commented Nov 1, 2018

richard-uk1 commented Nov 10, 2018 • edited Loading

felixSchl commented Nov 13, 2018

richard-uk1 commented Nov 13, 2018

richard-uk1 commented Nov 13, 2018

felixSchl commented Nov 13, 2018 • edited Loading

anttih commented May 28, 2019

anttih commented May 28, 2019

andyarvanitis commented May 30, 2019

anttih commented May 31, 2019

anttih commented Jun 1, 2019 • edited Loading

hdgarrood commented Jun 1, 2019

U+D800 to U+DFFF

anttih commented Jun 1, 2019

hdgarrood commented Jun 1, 2019

natefaubion commented Jun 1, 2019

hdgarrood commented Jun 1, 2019

anttih commented Jun 1, 2019 • edited Loading

anttih commented Jun 14, 2019 • edited Loading

richard-uk1 commented Nov 10, 2018 •

edited

Loading

felixSchl commented Nov 13, 2018 •

edited

Loading

anttih commented Jun 1, 2019 •

edited

Loading

anttih commented Jun 1, 2019 •

edited

Loading

anttih commented Jun 14, 2019 •

edited

Loading