pack backpressure #114

missinglink · 2020-03-10T07:46:09Z

Heya,

I'm using this amazing lib to pack GBs of geographic data and then pipe it on to bzip2 for compression.
Unfortunately bzip2 is sloooow, so those GBs I packed end up being buffered in nodejs memory waiting for bzip2 to ask for more.

The source of my memory woes seems to be that the return variable from this.push() isn't being checked which results in pack._readableState.buffer.length growing uncontrolled until I run out of RAM 😭

Looking at the source code there is already a this._drain variable which is perfect for implementing backpressure:

diff --git a/pack.js b/pack.js
index ba4eece..f1da3b7 100644
--- a/pack.js
+++ b/pack.js
@@ -128,9 +128,10 @@ Pack.prototype.entry = function (header, buffer, callback) {
   if (Buffer.isBuffer(buffer)) {
     header.size = buffer.length
     this._encode(header)
-    this.push(buffer)
+    var ok = this.push(buffer)
     overflow(self, header.size)
-    process.nextTick(callback)
+    if (ok) process.nextTick(callback)
+    else this._drain = callback
     return new Void()
   }

I've added a simple test case which I'm happy to clean up if this PR is acceptable?

The difference you'll notice in how the test displays are:

without this PR all 24 items are queued in memory
with this PR only the first 16 are initially queued and the remaining items are only added as requested.

Please let me know if you think this is something you would consider including 🙇

mafintosh · 2020-03-10T07:54:26Z

PR makes sense. Will merge when I’m at my office. Curious, why you aren’t using the streaming api instead of passing the full buffer?

missinglink · 2020-03-10T08:06:02Z

I'm doing this at the end of a series of object streams:

object_stream() -> object_stream() -> tar_stream() -> shell_stream()

When it comes time to pack files, I am extracting some fields from the object to generate the header and then calling JSON.stringify() on the whole object to serialize it to a JSON string. (it's GeoJSON format)

At this point I already have the data as a string so it seemed appropriate to use the buffer interface.

missinglink · 2020-03-10T08:37:21Z

I think also the requirement to specify a byte size up-front makes the streaming interface less attractive compared to the buffer interface.

I'm guessing that this is due to a requirement of the TAR format, presumably a leading byte which indicates the content length?

mafintosh · 2020-03-10T08:42:06Z

Yes tar requires you to declare the size upfront. Usecase makes sense

mafintosh · 2020-03-10T08:46:07Z

Out in 2.1.2, thanks!

piranna · 2020-03-10T08:46:35Z

I'm guessing that this is due to a requirement of the TAR format, presumably a leading byte which indicates the content length?

What about making it optional, and try to find the actual lenght if not defined? Or would it lead to problems like memory usage?

mafintosh · 2020-03-10T08:47:21Z

@piranna that would require buffering the entire stream, so in that case just use the api this PR improves

piranna · 2020-03-10T08:55:22Z

@piranna that would require buffering the entire stream, so in that case just use the api this PR improves

Yes, i suposse, but in that case, is it possible to do it automatically? :-)

mafintosh · 2020-03-10T08:58:12Z

You could magically do it on the stream yes, but I think that'd lead to serious bugs for people not realising it buffers. Better docs for this is prob the way to go if we wanna improve it.

missinglink · 2020-03-10T09:00:12Z

I like the options that developers have, they cater to different use cases:

I have a stream-like thing and I know the length up-front (eg. a file)
-> use the streams API

I have some bytes in memory already
-> use the buffer API

I have a stream-like thing and I don't know the length up-front (eg. an object stream)

...and I am constrained by memory but am willing to spend CPU
-> read through the stream once and count the bytes, use the streams API

...I am not worried about memory usage
-> convert the stream to a string/buffer, use the buffer API

piranna · 2020-03-10T09:07:44Z

You could magically do it on the stream yes, but I think that'd lead to serious bugs for people not realising it buffers. Better docs for this is prob the way to go if we wanna improve it.

My idea was to do under the hood something like:

I have a stream-like thing and I don't know the length up-front (eg. an object stream)

...and I am constrained by memory but am willing to spend CPU
-> read through the stream once and count the bytes, use the streams API

But you are right, explaining the use cases in the docs seems a better alternative, probably the developer knows it before hand :-)

backpressure: RFC

e707b4c

tweak test

b216f8f

trim whitespace

bc82af2

mafintosh merged commit 5dfddf7 into mafintosh:master Mar 10, 2020

ryanrhee mentioned this pull request Apr 17, 2020

pack.entry() fails if a buffer of sufficiently large size is passed in #116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pack backpressure #114

pack backpressure #114

missinglink commented Mar 10, 2020 •

edited

Loading

mafintosh commented Mar 10, 2020

missinglink commented Mar 10, 2020

missinglink commented Mar 10, 2020

mafintosh commented Mar 10, 2020

mafintosh commented Mar 10, 2020

piranna commented Mar 10, 2020

mafintosh commented Mar 10, 2020

piranna commented Mar 10, 2020

mafintosh commented Mar 10, 2020

missinglink commented Mar 10, 2020

piranna commented Mar 10, 2020

pack backpressure #114

pack backpressure #114

Conversation

missinglink commented Mar 10, 2020 • edited Loading

mafintosh commented Mar 10, 2020

missinglink commented Mar 10, 2020

missinglink commented Mar 10, 2020

mafintosh commented Mar 10, 2020

mafintosh commented Mar 10, 2020

piranna commented Mar 10, 2020

mafintosh commented Mar 10, 2020

piranna commented Mar 10, 2020

mafintosh commented Mar 10, 2020

missinglink commented Mar 10, 2020

piranna commented Mar 10, 2020

missinglink commented Mar 10, 2020 •

edited

Loading