unicode (multibyte) characters are corrupted #21

lfdoherty · 2015-01-09T09:35:58Z

Looking at the code - it's not safe to decode buffers to strings except at separator boundaries - they may end part-way through a multi-byte character, hence the issue.

scharf · 2015-01-28T14:36:44Z

That's what I thought too. I wanted to create a failing test and tried this (based on the StringDecoder API example):

var StringDecoder = require('string_decoder').StringDecoder;
var decoder = new StringDecoder('utf8');

var cent = new Buffer([0xC2, 0xA2]);
console.log(decoder.write(cent));

var euro = new Buffer([0xE2, 0x82, 0xAC]);
console.log(decoder.write(euro));

console.log('v1='+decoder.write(new Buffer([0xE2, 0x82])) + decoder.write(new Buffer([0xAC])));
console.log('v2='+decoder.write(new Buffer([0xE2])) + decoder.write(new Buffer([0x82, 0xAC])));

If you look at the output, there seems no problem.

lfdoherty · 2015-01-29T00:43:53Z

I think you're right. I reran this against the library I used to replace it (line-by-line), and eventually determined that the corrupt characters appear when reader.nextLine is called more than once before it returns.

jedwards1211 · 2015-08-30T22:24:03Z

@scharf Huh. I take this to mean that JS strings can store part of a multi-byte character?

jedwards1211 · 2015-08-30T22:26:55Z

So the following couldn't occur with line-reader's buffering, but this doesn't work:

var v2 = decoder.write(new Buffer([0xE2])) + decoder.write(new Buffer([0x82])) + decoder.write(new Buffer([0xAC]));

lfdoherty closed this as completed Jan 29, 2015

lfdoherty mentioned this issue Jan 29, 2015

Reading same line multiple times #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode (multibyte) characters are corrupted #21

unicode (multibyte) characters are corrupted #21

lfdoherty commented Jan 9, 2015

scharf commented Jan 28, 2015

lfdoherty commented Jan 29, 2015

jedwards1211 commented Aug 30, 2015

jedwards1211 commented Aug 30, 2015

unicode (multibyte) characters are corrupted #21

unicode (multibyte) characters are corrupted #21

Comments

lfdoherty commented Jan 9, 2015

scharf commented Jan 28, 2015

lfdoherty commented Jan 29, 2015

jedwards1211 commented Aug 30, 2015

jedwards1211 commented Aug 30, 2015