Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs.readFileSync can't return a string for a big file #9489

Closed
vsemozhetbyt opened this issue Nov 6, 2016 · 17 comments
Closed

fs.readFileSync can't return a string for a big file #9489

vsemozhetbyt opened this issue Nov 6, 2016 · 17 comments
Labels
fs Issues and PRs related to the fs subsystem / file system.

Comments

@vsemozhetbyt
Copy link
Contributor

  • Version: '7.0.0',
  • Platform: Windows 7 x64
  • Subsystem: fs, buffer

If I try to read a big file (582,170,692 bytes, ~ 555 MB) into a buffer, it is OK. If I add an encoding and try to get a string, I get an error.

> require('fs').readFileSync('ru-ru_Wiki-2007-01-03.dsl').length
582170692
> require('fs').readFileSync('ru-ru_Wiki-2007-01-03.dsl', 'utf16le').length
Error: "toString()" failed
    at Buffer.toString (buffer.js:513:11)
    at Object.fs.readFileSync (fs.js:511:41)
    at repl:1:15
    at sigintHandlersWrap (vm.js:22:35)
    at sigintHandlersWrap (vm.js:96:12)
    at ContextifyScript.Script.runInThisContext (vm.js:21:12)
    at REPLServer.defaultEval (repl.js:313:29)
    at bound (domain.js:280:14)
    at REPLServer.runBound [as eval] (domain.js:293:12)
    at REPLServer.<anonymous> (repl.js:513:10)

It seems the string does not exceed the Spec limit. Is there any other undocumented (or documented in other places) limits for the fs.readFileSync() or Buffer.toString()?

@vsemozhetbyt
Copy link
Contributor Author

vsemozhetbyt commented Nov 6, 2016

I've found the de facto limit for the current v8: 268,435,440 characters (Math.pow(2, 28) - 16), 536,870,880 bytes in UTF16.

This test code is OK:

const fs = require('fs');

fs.writeFileSync('bigfile.txt', `\uFEFF${'*'.repeat(Math.pow(2, 28) - 16 - 1)}`, 'utf16le');

console.log(fs.readFileSync('bigfile.txt', 'utf16le').length);

If I add just one character, it throws the error. Should it be documented somewhere?

@mscdex mscdex added the fs Issues and PRs related to the fs subsystem / file system. label Nov 6, 2016
@mscdex
Copy link
Contributor

mscdex commented Nov 6, 2016

FWIW the limit comes from here. ChakraCore uses a much different value that is dependent upon the value of INT_MAX (on my system that would be 2147483646 -- which is ~10x larger than V8's static limit). With that in mind, I'm not sure how useful it is to document a VM-specific limit like this...

@Fishrock123
Copy link
Contributor

I think the docs recommendation should be (if it is not already) to use raw Buffers for "any very large data".

@Fishrock123
Copy link
Contributor

(iirc toString() on that size is not exactly trivial?)

targos added a commit to targos/node that referenced this issue Jan 8, 2017
buffer.toString throws an Error when the resulting string would be
bigger than `2^28 - 16`.

Fixes: nodejs#9489
@AlJohri
Copy link

AlJohri commented Feb 15, 2017

Just in case anyone else finds themselves at this issue from Google, I ran into this while trying to synchronously (no readline, streams, etc.) read a 400mb JSON file line by line.

As suggested, I used raw buffers to solve this aided by the buffer-split package.

var bsplit = require('buffer-split');

function readLineJSON(path) {
  const buf = fs.readFileSync(path); // omitting encoding returns a Buffer
  const delim = Buffer.from('\n');
  const result = bsplit(buf, delim);
  return result
    .map(x => x.toString())
    .filter(x => x !== "")
    .map(JSON.parse);
}

@addaleax
Copy link
Member

@vsemozhetbyt … is there anything here you’d like to see? Would you want to open a docs PR yourself?

@vsemozhetbyt
Copy link
Contributor Author

vsemozhetbyt commented Apr 29, 2017

I have no definite opinion what should be added and in what way. It seems there is no consensus if we should document engine-specific limits. So feel free to close till any new decisions)

@tniessen
Copy link
Member

tniessen commented Jun 2, 2017

We should certainly improve the error messages:

#define SB_STRING_TOO_LONG_ERROR \
  v8::Exception::Error(OneByteString(isolate, "\"toString()\" failed"))

Edit: @addaleax Just noticed your comment in the code. I could not find an open issue for this, is there? Any specific reason this has not been changed yet?

@addaleax
Copy link
Member

addaleax commented Jun 2, 2017

I could not find an open issue for this, is there? Any specific reason this has not been changed yet?

@tniessen No, not beyond the discussion in #12765. The reason this has not been changed yet is that since it’s semver-major it would target Node 9, which gives plenty of time, and the fact that at some point we’re going to have to go through our native errors anyway to upgrade them to the new error code system. (Also, most of the ToDos from that PR might be suitable for first-time contributions from people with a C++ background.)

@Extarys
Copy link

Extarys commented Aug 12, 2018

In which version of Node we should expect huge files to be supported?

@vsemozhetbyt
Copy link
Contributor Author

@Extarys As per this blog post, max String length was increased in V8 6.2, ie the last Node.js LTS version (08.11.3) already supports them.

@vsemozhetbyt
Copy link
Contributor Author

vsemozhetbyt commented Aug 12, 2018

To be more exact: the new limit is mentioned in the "Increased max string length" section, ie 2**30 - 25 in 64-bit OS. It is 1 073 741 799 code units, or near 1 GB in ASCII or near 2 GB in UTF-16 LE (UTF-8 limit is less predictable, but near 1 GB should be OK at least).

@Extarys
Copy link

Extarys commented Aug 14, 2018

Thanks for this update! I love that when importing big logs or something.

@loganpowell
Copy link

Sorry to ping on a closed thread, but I can't Google my way out of asking this: How do I set the buffer.constants.MAX_STRING_LENGTH to the new maximum? The docs say:

<integer> The largest length allowed for a single string instance.
Represents the largest length that a string primitive can have, counted in UTF-16 code units.

This value may depend on the JS engine that is being used.

but I'm not familiar with the UTF16 code units or how to use them. Do I just write:

buffer.constants.MAX_STRING_LENGTH=2**30 - 25

I found this blog post, which uses the 2**30... syntax

@vsemozhetbyt
Copy link
Contributor Author

vsemozhetbyt commented Aug 22, 2018

You can't change buffer.constants.MAX_STRING_LENGTH value, it is read-only. You can only use it to retrieve the information about the limit which is set by the JS engine.

@loganpowell
Copy link

Ah, thank you for clarifying. Just console.log?

@vsemozhetbyt
Copy link
Contributor Author

Or ===, >, < etc comparisons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fs Issues and PRs related to the fs subsystem / file system.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants