Reopen big (400MB .csv) file with encoding #8

pinguin999 · 2020-09-17T08:00:40Z

VSCode Version: 1.50.0-insider (user setup)
Commit: 9e505675670d65138405321a60b0df4ddec28799
Date: 2020-09-16T06:49:37.816Z
Electron: 9.3.0
Chrome: 83.0.4103.122
Node.js: 12.14.1
V8: 8.3.110.13-electron.0
OS: Windows_NT x64 10.0.19041

Steps to Reproduce:

Open a large (400MB -csv for example) windows 1252 file and click on the UTF-8 at the bottom right. And on reopen with encoding.
Ram grows and finally code crashes with oom :)

I guess it's the encoding guessing algorithm

Does this issue occur when all extensions are disabled?: Yes

bpasero · 2020-09-17T09:18:06Z

We should limit the data we give to jschardet to a small value to prevent this. Up for a PR if someone wants to chime in.

pinguin999 · 2020-09-21T09:40:51Z

Hi I looked into the code and on the first look the problem may be not jschardet.
There is something like this:

// ensure to limit buffer for guessing due to https://github.com/aadsm/jschardet/issues/53
const limitedBuffer = buffer.slice(0, AUTO_ENCODING_GUESS_MAX_BYTES);

// before guessing jschardet calls toString('binary') on input if it is a Buffer,
// since we are using it inside browser environment as well we do conversion ourselves
// https://github.com/aadsm/jschardet/blob/v2.1.1/src/index.js#L36-L40
const binaryString = encodeLatin1(limitedBuffer.buffer);

const guessed = jschardet.detect(binaryString);

@bpasero if you point me to a tutorial how to build and debug vscode I can try to locate the problem.

bpasero · 2020-09-21T12:11:35Z

@pinguin999 you can refer to https://github.com/microsoft/vscode/wiki/How-to-Contribute for how to run VSCode.

Thanks for finding that location, however the upper bound seems to be only 65kb, so I wonder if that is already sufficient to cause the exception. Maybe you try to lower it.

bpasero · 2020-09-22T16:17:05Z

Also could you attach such a file that crashes as zip? or send it?

pinguin999 · 2020-09-23T13:42:59Z

@bpasero Sorry no, it's a file with private costumer data.

bpasero · 2020-09-29T15:16:35Z

Without a repro for me, I cannot be of help, sorry. Will leave this open for 7 days until a repro is found, otherwise this will close.

pinguin999 · 2020-10-02T07:41:47Z

It's looks like jschardet is not the problem.

I can track it down to this lines:
const decoded = decoder.write(VSBuffer.concat(bufferedChunks).buffer);

	write(buffer: Uint8Array): string {
		return this.iconvLiteDecoder.write(buffer);
	}

Buffer size is: Buffer(380293716)

It uses the iconvLiteDecoder from the node module "iconv-lite-umd"

Does it help?

bpasero · 2020-10-02T07:52:33Z

@pinguin999 ok, that really in the end is iconv-lite, but we can move the issue to iconv-lite-umd-

bpasero · 2020-10-02T07:53:44Z

Ideally we can create a reproducible case and report against https://github.com/ashtuchkin/iconv-lite.

//cc @gyzerok

jeanp413 · 2020-10-02T16:26:22Z

@bpasero
I got a bit curious about this and took a look. I was able to reproduce with any 400MB file, in my case a use this wikimedia dump (900mb uncompressed) and split it in two (the first time you open the file run save with encoding and select windows1252).
This is what I found:

When running the ChangeEncodingAction, it was reading all the file contents in one go inside a string using this.textFileService.read which internally is passing that string to iconv-lite, so I change it to read it as a stream this.textFileService.readStream.
https://github.com/microsoft/vscode/blob/0ecb64a2c8945dd1193967019f0734af539ca9c3/src/vs/workbench/browser/parts/editor/editorStatus.ts#L1346
Even after (1) the memory kept growing , did some profiling and narrow it down to the write method in sbcs-codec.js of iconv-lite. For some reason the GC isn't freeing the memory used between write() method calls
https://github.com/ashtuchkin/iconv-lite/blob/efbad0a92edf1b09c111278abb104d935c6c0482/encodings/sbcs-codec.js#L59-L68
iconv-lite uses Buffer object to do some operations, not sure if this is expected but I found that when running vscode on electron it uses the buffer web version (https://github.com/feross/buffer) and also there is this issue Use TextDecoder for Buffer.toString? feross/buffer#268 which says that maybe the problem is in Buffer.toString
The latest version of iconv-lite ("0.7.0-pre" not sure if it's ready for release) now uses typed arrays and TextDecoder for the it's web version (which vscode always uses) https://github.com/ashtuchkin/iconv-lite/blob/master/backends/web.js. I forked iconv-lite-umd using the latest version of iconv-lite, tested it in vscode and it fixes the memory issue

bpasero · 2020-10-03T06:12:04Z

@jeanp413 thanks for the analysis

When running the ChangeEncodingAction, it was reading all the file contents in one go inside a string

Yes indeed, want to open a PR to change this to readStream?

I forked iconv-lite-umd using the latest version of iconv-lite, tested it in vscode and it fixes the memory issue

I would like @gyzerok and maybe @ashtuchkin to chime in and let us know when we can expect a new stable version of iconv-lite to then consume here for VSCode

jeanp413 · 2020-10-03T13:02:31Z

Yes indeed, want to open a PR to change this to readStream?

Sure I'll create a PR with that fix,

pinguin999 · 2020-10-10T15:48:02Z

Wow you are so awesome! I updated to the latest insider build and it looks like its fixed!

jeanp413 · 2020-10-10T17:16:30Z

That's great, although, for the record, back when I was testing it I think it could still crash if you change the encoding more than once.

gyzerok · 2020-10-12T12:22:31Z

@bpasero

I would like @gyzerok and maybe @ashtuchkin to chime in and let us know when we can expect a new stable version of iconv-lite to then consume here for VSCode

We've made great progress during August around moving iconv-lite away from using buffer. However there still couple of things left that need to be migrated. My current problem is - I've been moving around the country and now am sort of establishing my life in a new city. That's why I do not have any energy currently to devote to the library. However I am hoping that soon the situation is going to change.

Not sure how Alex is doing though. Seems like he didn't have time to progress forward either.

bpasero · 2020-10-12T12:24:12Z

No worries, thanks for the update and good luck with your new start 👍

aeschli assigned bpasero Sep 17, 2020

bpasero transferred this issue from microsoft/vscode Oct 2, 2020

jeanp413 mentioned this issue Oct 4, 2020

Read file contents as stream in ChangeEncodingAction microsoft/vscode#108052

Merged

bpasero closed this as completed Oct 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reopen big (400MB .csv) file with encoding #8

Reopen big (400MB .csv) file with encoding #8

pinguin999 commented Sep 17, 2020

bpasero commented Sep 17, 2020

pinguin999 commented Sep 21, 2020

bpasero commented Sep 21, 2020

bpasero commented Sep 22, 2020

pinguin999 commented Sep 23, 2020

bpasero commented Sep 29, 2020

pinguin999 commented Oct 2, 2020

bpasero commented Oct 2, 2020

bpasero commented Oct 2, 2020

jeanp413 commented Oct 2, 2020 •

edited

Loading

bpasero commented Oct 3, 2020

jeanp413 commented Oct 3, 2020 •

edited

Loading

pinguin999 commented Oct 10, 2020

jeanp413 commented Oct 10, 2020

gyzerok commented Oct 12, 2020

bpasero commented Oct 12, 2020

Reopen big (400MB .csv) file with encoding #8

Reopen big (400MB .csv) file with encoding #8

Comments

pinguin999 commented Sep 17, 2020

bpasero commented Sep 17, 2020

pinguin999 commented Sep 21, 2020

bpasero commented Sep 21, 2020

bpasero commented Sep 22, 2020

pinguin999 commented Sep 23, 2020

bpasero commented Sep 29, 2020

pinguin999 commented Oct 2, 2020

bpasero commented Oct 2, 2020

bpasero commented Oct 2, 2020

jeanp413 commented Oct 2, 2020 • edited Loading

bpasero commented Oct 3, 2020

jeanp413 commented Oct 3, 2020 • edited Loading

pinguin999 commented Oct 10, 2020

jeanp413 commented Oct 10, 2020

gyzerok commented Oct 12, 2020

bpasero commented Oct 12, 2020

jeanp413 commented Oct 2, 2020 •

edited

Loading

jeanp413 commented Oct 3, 2020 •

edited

Loading