In this implementation, we must use two 16bit values to represent a Unicode surrogate pair, and one 16bit value for other characters.
On the other hand, according to Lua 5.0 Reference Manual, Strings in Lua may contain any 8-bit value, including embedded zeros, which can be specified as ‘\0’.
And the encoding of strings is not specified, so we can have strings in any encodings.
For example, UTF8, ISO 8859-1 (Latin-1), and Shift_JIS. We can even have binary data in strings.
I think this is an important feature to avoid unnecessary encoding conversions.
(Or, we may restrict the encoding of strings to UTF8, and use Buffer instances for text/binary data in other encodings.
In that case, luvit IO APIs must take Buffer instances as well as strings.)
A Buffer instance has a fixed size (length), so there are cases a buffer ends in the middle of a character byte sequence.
For example, a buffer may end with '\194' of '\194\162' in UTF8. We must handle these cases.
I wrote an experimental module for one possible implementation: github.com/hnakamur/luvit-charset
Please see README.md for code samples.
However, I suppose there may be better ways than this.
So comments are welcome.
Want to back this issue? Place a bounty on it! We accept bounties via Bountysource.
All Stream related activity should use Buffers. Everything else Strings.
We should probably include encoding support into cBuffers #336
I don't want to bake encoding into luvit core. Most people use utf-8 or ascii and there is nothing special luvit's APIs need to do to handle either.
If you want to process stream chunks as unicode data you need a filter anyway that rechunks the utf8 values so that partial characters aren't chopped between two packets.