Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack backpressure #114

Merged
merged 3 commits into from
Mar 10, 2020
Merged

pack backpressure #114

merged 3 commits into from
Mar 10, 2020

Conversation

missinglink
Copy link
Contributor

@missinglink missinglink commented Mar 10, 2020

Heya,

I'm using this amazing lib to pack GBs of geographic data and then pipe it on to bzip2 for compression.
Unfortunately bzip2 is sloooow, so those GBs I packed end up being buffered in nodejs memory waiting for bzip2 to ask for more.

The source of my memory woes seems to be that the return variable from this.push() isn't being checked which results in pack._readableState.buffer.length growing uncontrolled until I run out of RAM 😭

Looking at the source code there is already a this._drain variable which is perfect for implementing backpressure:

diff --git a/pack.js b/pack.js
index ba4eece..f1da3b7 100644
--- a/pack.js
+++ b/pack.js
@@ -128,9 +128,10 @@ Pack.prototype.entry = function (header, buffer, callback) {
   if (Buffer.isBuffer(buffer)) {
     header.size = buffer.length
     this._encode(header)
-    this.push(buffer)
+    var ok = this.push(buffer)
     overflow(self, header.size)
-    process.nextTick(callback)
+    if (ok) process.nextTick(callback)
+    else this._drain = callback
     return new Void()
   }

I've added a simple test case which I'm happy to clean up if this PR is acceptable?

The difference you'll notice in how the test displays are:

  • without this PR all 24 items are queued in memory
  • with this PR only the first 16 are initially queued and the remaining items are only added as requested.

Please let me know if you think this is something you would consider including 🙇

@mafintosh
Copy link
Owner

PR makes sense. Will merge when I’m at my office. Curious, why you aren’t using the streaming api instead of passing the full buffer?

@missinglink
Copy link
Contributor Author

I'm doing this at the end of a series of object streams:

object_stream() -> object_stream() -> tar_stream() -> shell_stream()

When it comes time to pack files, I am extracting some fields from the object to generate the header and then calling JSON.stringify() on the whole object to serialize it to a JSON string. (it's GeoJSON format)

At this point I already have the data as a string so it seemed appropriate to use the buffer interface.

@missinglink
Copy link
Contributor Author

I think also the requirement to specify a byte size up-front makes the streaming interface less attractive compared to the buffer interface.

I'm guessing that this is due to a requirement of the TAR format, presumably a leading byte which indicates the content length?

@mafintosh
Copy link
Owner

Yes tar requires you to declare the size upfront. Usecase makes sense

@mafintosh mafintosh merged commit 5dfddf7 into mafintosh:master Mar 10, 2020
@mafintosh
Copy link
Owner

Out in 2.1.2, thanks!

@piranna
Copy link
Collaborator

piranna commented Mar 10, 2020

I'm guessing that this is due to a requirement of the TAR format, presumably a leading byte which indicates the content length?

What about making it optional, and try to find the actual lenght if not defined? Or would it lead to problems like memory usage?

@mafintosh
Copy link
Owner

@piranna that would require buffering the entire stream, so in that case just use the api this PR improves

@piranna
Copy link
Collaborator

piranna commented Mar 10, 2020

@piranna that would require buffering the entire stream, so in that case just use the api this PR improves

Yes, i suposse, but in that case, is it possible to do it automatically? :-)

@mafintosh
Copy link
Owner

You could magically do it on the stream yes, but I think that'd lead to serious bugs for people not realising it buffers. Better docs for this is prob the way to go if we wanna improve it.

@missinglink
Copy link
Contributor Author

I like the options that developers have, they cater to different use cases:

I have a stream-like thing and I know the length up-front (eg. a file)
-> use the streams API

I have some bytes in memory already
-> use the buffer API

I have a stream-like thing and I don't know the length up-front (eg. an object stream)

...and I am constrained by memory but am willing to spend CPU
-> read through the stream once and count the bytes, use the streams API

...I am not worried about memory usage
-> convert the stream to a string/buffer, use the buffer API

@piranna
Copy link
Collaborator

piranna commented Mar 10, 2020

You could magically do it on the stream yes, but I think that'd lead to serious bugs for people not realising it buffers. Better docs for this is prob the way to go if we wanna improve it.

My idea was to do under the hood something like:

I have a stream-like thing and I don't know the length up-front (eg. an object stream)

...and I am constrained by memory but am willing to spend CPU
-> read through the stream once and count the bytes, use the streams API

But you are right, explaining the use cases in the docs seems a better alternative, probably the developer knows it before hand :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants