-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unicode (multibyte) characters are corrupted #21
Comments
That's what I thought too. I wanted to create a failing test and tried this (based on the StringDecoder API example): var StringDecoder = require('string_decoder').StringDecoder;
var decoder = new StringDecoder('utf8');
var cent = new Buffer([0xC2, 0xA2]);
console.log(decoder.write(cent));
var euro = new Buffer([0xE2, 0x82, 0xAC]);
console.log(decoder.write(euro));
console.log('v1='+decoder.write(new Buffer([0xE2, 0x82])) + decoder.write(new Buffer([0xAC])));
console.log('v2='+decoder.write(new Buffer([0xE2])) + decoder.write(new Buffer([0x82, 0xAC]))); If you look at the output, there seems no problem. |
I think you're right. I reran this against the library I used to replace it (line-by-line), and eventually determined that the corrupt characters appear when reader.nextLine is called more than once before it returns. |
@scharf Huh. I take this to mean that JS strings can store part of a multi-byte character? |
So the following couldn't occur with line-reader's buffering, but this doesn't work:
|
Looking at the code - it's not safe to decode buffers to strings except at separator boundaries - they may end part-way through a multi-byte character, hence the issue.
The text was updated successfully, but these errors were encountered: