Crippling performance impact of PassThrough, Transform on object stream pipelines #5429
Piping an infinite Readable to an infinite Writable and using
I got these numbers running benchmark/object-stream-throughput.js under Node v0.10.5 on a recent MacBook Pro.
(Note: the script defaults to
I was hoping to be able to process millions of objects per second by arranging object stream transforms in a pipeline. I figure I should find another approach.
The text was updated successfully, but these errors were encountered:
Every intermediary adds a bunch of function calls and event emission. It's not free.
You aren't going to get millions of objects per second if you have setImmediate every 10 objects, anyway. Streams are primarily for IO, and designed to slow down the throughput so as not to overwhelm slow consumers.
It sounds like what you really want is just a bunch of functions that you call on an object to mutate it or something, not an IO focused system designed to communicate backpressure.
My millions of objects are parsed from an I/O stream. After processing, they get rendered into an I/O stream. I designed my pipeline around streams precisely because I need to communicate backpressure.
Do buffer and string streams suffer the same 85% performance drop for three PassThrough or Transform streams, and 99% for four? I don't think we can write that kind of degradation off as "not free". We should document it: 25K buffers/sec (1500 byte reads) is 35MB/s. That'd drop to 25MB/s, 6MB/s, and ~0MB/s as you add transforms. 4K buffers would start at 1Gbps wire speed but drop to nothing at the same rate.
If buffer and string streams don't behave like this, I think we need to document that object stream pipelines have markedly different performance characteristics in pipelines and that transforms should be avoided.
Even if the changes are to documentation rather than library code, I think it's worth tracking.
I'll hook my example up to something stream-based (perhaps dumping stuff through a named pipe) to keep the event loop running, process objects in MTU-of-origin-data sized chunks, and see if that affects the result. I'll also check for buffer and string streams.
I'm not sure I understand the issue for the benchmark. First, your duration is too small. Check the following:
Now assuming that number is the total after
Also, here's a sample of taking
And now the output of
From the above we can see that the overhead from node is relatively small. The majority of the run-time degradation comes from needing to do internal work, like copying around the data.
Remember that a
Damnit. Stuffed up with the sheet.
TL;DR: we still have a problem with object stream performance that we should consider documenting. Even if you stop punching yourself in the head, object streams suffer a performance problem with transforms that we don't see on non-object streams.
I've fixed the benchmark; it now properly simulates parsing objects out of a network stream. Involving the network keeps the event loop running without the setImmediate hackery. I threw in streams of normal Buffers as a control.
Endlessly writing 64KB buffers from the other side of the socket, I can pipe ~650MB/s with 0…8 Transform streams. That's close to Transform streams for buffers being "free", to use Isaac's term. Woot!
Transforming to 12 objects/KB to match my data pushes 768 objects per 64KB chunk. I get ~900K objects/sec through that initial Transform, reducing my input throughput to 73MB/s. Adding one more Transform drops throughput by another 50%; two by 70%; four by 80%; eight by 90%. Ouch.
Chunking up the output objects into 768-length arrays (i.e. not punching self in head) ~eliminates the bottleneck of the initial transform. Eight additional transforms drop performance by ~15%, but I'm willing to treat that as an acceptable "not free" — and, if documented, an expected "not free".