[RFC] Buffer and string encoding stuff in luvit #262

Closed
hnakamur opened this Issue Jun 26, 2012 · 2 comments

Projects

None yet

3 participants

Member

Here we should do better than Javascript.

Accorting to ECMA-262, JavaScript's Strings is defined as zero or more sequences of unsigned 16bit values and each one of 16bit value is UTF-16.
In this implementation, we must use two 16bit values to represent a Unicode surrogate pair, and one 16bit value for other characters.
(For the record, the choice of UTF-16 as JavaScript strings encodings was made before surrogate paris are defined in Unicode.)
So we must constantly convert between UTF-16 and UTF-8 back and forth to communicate outside of a JavaScript VM, which is just a waste of CPU time and RAM, I think.

On the other hand, according to Lua 5.0 Reference Manual, Strings in Lua may contain any 8-bit value, including embedded zeros, which can be specified as ‘\0’.
And the encoding of strings is not specified, so we can have strings in any encodings.
For example, UTF8, ISO 8859-1 (Latin-1), and Shift_JIS. We can even have binary data in strings.
I think this is an important feature to avoid unnecessary encoding conversions.

(Or, we may restrict the encoding of strings to UTF8, and use Buffer instances for text/binary data in other encodings.
In that case, luvit IO APIs must take Buffer instances as well as strings.)

A Buffer instance has a fixed size (length), so there are cases a buffer ends in the middle of a character byte sequence.
For example, a buffer may end with '\194' of '\194\162' in UTF8. We must handle these cases.
I wrote an experimental module for one possible implementation: github.com/hnakamur/luvit-charset
Please see README.md for code samples.

However, I suppose there may be better ways than this.
So comments are welcome.


Want to back this issue? Place a bounty on it! We accept bounties via Bountysource.

All Stream related activity should use Buffers. Everything else Strings.

re: Encodings
We should probably include encoding support into cBuffers #336

Owner

I don't want to bake encoding into luvit core. Most people use utf-8 or ascii and there is nothing special luvit's APIs need to do to handle either.

If you want to process stream chunks as unicode data you need a filter anyway that rechunks the utf8 values so that partial characters aren't chopped between two packets.

@creationix creationix closed this Mar 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment