Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reopen big (400MB .csv) file with encoding #8

Closed
pinguin999 opened this issue Sep 17, 2020 · 16 comments
Closed

Reopen big (400MB .csv) file with encoding #8

pinguin999 opened this issue Sep 17, 2020 · 16 comments
Assignees

Comments

@pinguin999
Copy link

  • VSCode Version: 1.50.0-insider (user setup)
    Commit: 9e505675670d65138405321a60b0df4ddec28799
    Date: 2020-09-16T06:49:37.816Z
    Electron: 9.3.0
    Chrome: 83.0.4103.122
    Node.js: 12.14.1
    V8: 8.3.110.13-electron.0
    OS: Windows_NT x64 10.0.19041

Steps to Reproduce:

  1. Open a large (400MB -csv for example) windows 1252 file and click on the UTF-8 at the bottom right. And on reopen with encoding.
  2. Ram grows and finally code crashes with oom :)

I guess it's the encoding guessing algorithm

Does this issue occur when all extensions are disabled?: Yes

@bpasero
Copy link
Member

bpasero commented Sep 17, 2020

We should limit the data we give to jschardet to a small value to prevent this. Up for a PR if someone wants to chime in.

@pinguin999
Copy link
Author

Hi I looked into the code and on the first look the problem may be not jschardet.
There is something like this:

// ensure to limit buffer for guessing due to https://github.com/aadsm/jschardet/issues/53
const limitedBuffer = buffer.slice(0, AUTO_ENCODING_GUESS_MAX_BYTES);

// before guessing jschardet calls toString('binary') on input if it is a Buffer,
// since we are using it inside browser environment as well we do conversion ourselves
// https://github.com/aadsm/jschardet/blob/v2.1.1/src/index.js#L36-L40
const binaryString = encodeLatin1(limitedBuffer.buffer);

const guessed = jschardet.detect(binaryString);

@bpasero if you point me to a tutorial how to build and debug vscode I can try to locate the problem.

@bpasero
Copy link
Member

bpasero commented Sep 21, 2020

@pinguin999 you can refer to https://github.com/microsoft/vscode/wiki/How-to-Contribute for how to run VSCode.

Thanks for finding that location, however the upper bound seems to be only 65kb, so I wonder if that is already sufficient to cause the exception. Maybe you try to lower it.

@bpasero
Copy link
Member

bpasero commented Sep 22, 2020

Also could you attach such a file that crashes as zip? or send it?

@pinguin999
Copy link
Author

@bpasero Sorry no, it's a file with private costumer data.

@bpasero
Copy link
Member

bpasero commented Sep 29, 2020

Without a repro for me, I cannot be of help, sorry. Will leave this open for 7 days until a repro is found, otherwise this will close.

@pinguin999
Copy link
Author

It's looks like jschardet is not the problem.

I can track it down to this lines:
const decoded = decoder.write(VSBuffer.concat(bufferedChunks).buffer);

	write(buffer: Uint8Array): string {
		return this.iconvLiteDecoder.write(buffer);
	}

Buffer size is: Buffer(380293716)

It uses the iconvLiteDecoder from the node module "iconv-lite-umd"

Does it help?

@bpasero
Copy link
Member

bpasero commented Oct 2, 2020

@pinguin999 ok, that really in the end is iconv-lite, but we can move the issue to iconv-lite-umd-

@bpasero bpasero transferred this issue from microsoft/vscode Oct 2, 2020
@bpasero
Copy link
Member

bpasero commented Oct 2, 2020

Ideally we can create a reproducible case and report against https://github.com/ashtuchkin/iconv-lite.

//cc @gyzerok

@jeanp413
Copy link

jeanp413 commented Oct 2, 2020

@bpasero
I got a bit curious about this and took a look. I was able to reproduce with any 400MB file, in my case a use this wikimedia dump (900mb uncompressed) and split it in two (the first time you open the file run save with encoding and select windows1252).
This is what I found:

  1. When running the ChangeEncodingAction, it was reading all the file contents in one go inside a string using this.textFileService.read which internally is passing that string to iconv-lite, so I change it to read it as a stream this.textFileService.readStream.
    https://github.com/microsoft/vscode/blob/0ecb64a2c8945dd1193967019f0734af539ca9c3/src/vs/workbench/browser/parts/editor/editorStatus.ts#L1346
  2. Even after (1) the memory kept growing , did some profiling and narrow it down to the write method in sbcs-codec.js of iconv-lite. For some reason the GC isn't freeing the memory used between write() method calls
    https://github.com/ashtuchkin/iconv-lite/blob/efbad0a92edf1b09c111278abb104d935c6c0482/encodings/sbcs-codec.js#L59-L68
  3. iconv-lite uses Buffer object to do some operations, not sure if this is expected but I found that when running vscode on electron it uses the buffer web version (https://github.com/feross/buffer) and also there is this issue Use TextDecoder for Buffer.toString? feross/buffer#268 which says that maybe the problem is in Buffer.toString
  4. The latest version of iconv-lite ("0.7.0-pre" not sure if it's ready for release) now uses typed arrays and TextDecoder for the it's web version (which vscode always uses) https://github.com/ashtuchkin/iconv-lite/blob/master/backends/web.js. I forked iconv-lite-umd using the latest version of iconv-lite, tested it in vscode and it fixes the memory issue

@bpasero
Copy link
Member

bpasero commented Oct 3, 2020

@jeanp413 thanks for the analysis

When running the ChangeEncodingAction, it was reading all the file contents in one go inside a string

Yes indeed, want to open a PR to change this to readStream?

I forked iconv-lite-umd using the latest version of iconv-lite, tested it in vscode and it fixes the memory issue

I would like @gyzerok and maybe @ashtuchkin to chime in and let us know when we can expect a new stable version of iconv-lite to then consume here for VSCode

@jeanp413
Copy link

jeanp413 commented Oct 3, 2020

Yes indeed, want to open a PR to change this to readStream?

Sure I'll create a PR with that fix,

@pinguin999
Copy link
Author

Wow you are so awesome! I updated to the latest insider build and it looks like its fixed!

@bpasero bpasero closed this as completed Oct 10, 2020
@jeanp413
Copy link

That's great, although, for the record, back when I was testing it I think it could still crash if you change the encoding more than once.

@gyzerok
Copy link
Contributor

gyzerok commented Oct 12, 2020

@bpasero

I would like @gyzerok and maybe @ashtuchkin to chime in and let us know when we can expect a new stable version of iconv-lite to then consume here for VSCode

We've made great progress during August around moving iconv-lite away from using buffer. However there still couple of things left that need to be migrated. My current problem is - I've been moving around the country and now am sort of establishing my life in a new city. That's why I do not have any energy currently to devote to the library. However I am hoping that soon the situation is going to change.

Not sure how Alex is doing though. Seems like he didn't have time to progress forward either.

@bpasero
Copy link
Member

bpasero commented Oct 12, 2020

No worries, thanks for the update and good luck with your new start 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants