Break multi bytes UTF-8 characters when parsing in Node-style #908

jdesboeufs · 2022-01-04T16:37:45Z

PapaParse breaks multi bytes UTF-8 characters when they are sliced between different chunks of Buffer.
For example ç would become ��.

To reproduce:

const Papa = require('papaparse')
const {PassThrough} = require('stream')

const csvFileString = 'first_name,last_name\nFrançois,Mitterrand\n'

const input = new PassThrough()
const parser = Papa.parse(Papa.NODE_STREAM_INPUT, {header: true})

input.pipe(parser)

parser.on('data', row => console.log(row))

input.write(Buffer.from(csvFileString).slice(0, 26))
input.write(Buffer.from(csvFileString).slice(26))
input.end()

{ first_name: 'Fran��ois', last_name: 'Mitterrand' }

A workaround is to ensure UTF-8 decoding with string_decoder (internal Node module), WHATWG TextDecoder or with iconv-lite (user-land dependency).
But a better answer is to use string_decoder or TextDecoder into PapaParse, in place of chunk.toString().

Related to #751

The text was updated successfully, but these errors were encountered:

jdesboeufs · 2022-01-04T20:35:18Z

Option 1: using string_decoder in `PapaParse`

Pros: straightforward bugfix
Cons: will depend on a polyfill when using Node stream syntax in browser. Possibly a breaking change

Option 2 using WHATWG TextDecoder in `PapaParse`

Pros: future-proof and universal bugfix
Cons: require Node.js 8.3+, Firefox 19/20, Chrome 38+

Option 3: deprecate or forbid using PapaParse with a stream of `Buffer`

Throw an error when using with a stream of Buffer => force user to decode stream on its own (add example with iconv-lite)

Pros: keep PapaParse simple
Cons: breaking change if no deprecation

Narretz · 2022-01-24T19:55:19Z

I'm trying to proactively use the iconv-lite option. Can you check if this pseudo implementation correct? Could also be added to the docs after clean-up.
It does work but I haven't tested all edge cases.
I assume iconv-lite guarantees that multi bytes UTF-8 characters are kept together?

And is there a way to get the "meta" field in the streaming api? on('data') only gets you the data part of the result.
See

PapaParse/papaparse.js

Lines 917 to 918 in 1f2c733

    
           var data = results.data; 
        
           if (!stream.push(data) && !this._handle.paused()) {

I assume that's intentional?

    import { parse, NODE_STREAM_INPUT } from 'papaparse';
    import { decodeStream } from 'iconv-lite';
    import { pipeline } from 'stream';

    const stream // some ReadStream

    const converterStream = decodeStream('utf8');

    const csvStream = parse(NODE_STREAM_INPUT, {
      header: true,
    });

    csvStream.on('data', (data) => {
      console.log('do something with the data')
      // can I get the meta info here?
    });

    csvStream.on('end', (result) => {});

    pipeline(stream, converterStream, csvStream, (err) => {
      console.log('stream complete', err);
    });

Kaiido · 2022-02-04T08:51:22Z

If I may, I believe the WHATWG's TextDecoder option would be your best move here.
As already said it is future-proof, and can be polyfilled if needed.
It would also fix a bug in browsers with the chunk option: https://jsfiddle.net/3zypkqtg/ (not sure if yet another report is needed for that).

kaligrafy · 2023-07-08T20:39:14Z

any news on this? I tried every method to fix it, but it doesn't work or it just take forever to read the stream. What should be done in the mentime to be able to read mutil-byte UTF-8 cjaracters when streaming to papa?

jdesboeufs mentioned this issue Jan 4, 2022

Ignoring accented characters #907

Closed

jdesboeufs added a commit to livingdata-co/PapaParse that referenced this issue Jan 4, 2022

Add test case for issue mholt#908

287d26a

jdesboeufs mentioned this issue Jan 7, 2022

Erreur d'encodage sur certaines adresses BaseAdresseNationale/addok-docker#28

Closed

unframework mentioned this issue Feb 10, 2022

encoding option from papaparse is not working beamworks/react-csv-importer#55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Break multi bytes UTF-8 characters when parsing in Node-style #908

Break multi bytes UTF-8 characters when parsing in Node-style #908

jdesboeufs commented Jan 4, 2022 •

edited

Loading

jdesboeufs commented Jan 4, 2022 •

edited

Loading

Narretz commented Jan 24, 2022 •

edited

Loading

Kaiido commented Feb 4, 2022

kaligrafy commented Jul 8, 2023

Break multi bytes UTF-8 characters when parsing in Node-style #908

Break multi bytes UTF-8 characters when parsing in Node-style #908

Comments

jdesboeufs commented Jan 4, 2022 • edited Loading

jdesboeufs commented Jan 4, 2022 • edited Loading

Option 1: using string_decoder in PapaParse

Option 2 using WHATWG TextDecoder in PapaParse

Option 3: deprecate or forbid using PapaParse with a stream of Buffer

Narretz commented Jan 24, 2022 • edited Loading

Kaiido commented Feb 4, 2022

kaligrafy commented Jul 8, 2023

jdesboeufs commented Jan 4, 2022 •

edited

Loading

jdesboeufs commented Jan 4, 2022 •

edited

Loading

Option 1: using string_decoder in `PapaParse`

Option 2 using WHATWG TextDecoder in `PapaParse`

Option 3: deprecate or forbid using PapaParse with a stream of `Buffer`

Narretz commented Jan 24, 2022 •

edited

Loading