Skip to content
This repository has been archived by the owner. It is now read-only.

Crippling performance impact of PassThrough, Transform on object stream pipelines #5429

garthk opened this issue May 8, 2013 · 5 comments


Copy link

@garthk garthk commented May 8, 2013

Piping an infinite Readable to an infinite Writable and using setImmediate to yield to the event loop every time, I get:

  • ~25K objects/second when piping directly
  • ~15K objects/second when piping via one PassThrough or do-nothing Transform
  • ~4K objects/second when piping via two intermediaries as above
  • <1 objects/second when piping via three intermediaries as above

I got these numbers running benchmark/object-stream-throughput.js under Node v0.10.5 on a recent MacBook Pro.

(Note: the script defaults to setImmediate every ten items, which is ~8X faster for direct pipes but rapidly loses its advantage as we hook up more intermediaries.)

I was hoping to be able to process millions of objects per second by arranging object stream transforms in a pipeline. I figure I should find another approach.

Copy link

@isaacs isaacs commented May 8, 2013

Every intermediary adds a bunch of function calls and event emission. It's not free.

You aren't going to get millions of objects per second if you have setImmediate every 10 objects, anyway. Streams are primarily for IO, and designed to slow down the throughput so as not to overwhelm slow consumers.

It sounds like what you really want is just a bunch of functions that you call on an object to mutate it or something, not an IO focused system designed to communicate backpressure.

@isaacs isaacs closed this May 8, 2013
Copy link

@garthk garthk commented May 8, 2013

My millions of objects are parsed from an I/O stream. After processing, they get rendered into an I/O stream. I designed my pipeline around streams precisely because I need to communicate backpressure.

Do buffer and string streams suffer the same 85% performance drop for three PassThrough or Transform streams, and 99% for four? I don't think we can write that kind of degradation off as "not free". We should document it: 25K buffers/sec (1500 byte reads) is 35MB/s. That'd drop to 25MB/s, 6MB/s, and ~0MB/s as you add transforms. 4K buffers would start at 1Gbps wire speed but drop to nothing at the same rate.

If buffer and string streams don't behave like this, I think we need to document that object stream pipelines have markedly different performance characteristics in pipelines and that transforms should be avoided.

Even if the changes are to documentation rather than library code, I think it's worth tracking.

I'll hook my example up to something stream-based (perhaps dumping stuff through a named pipe) to keep the event loop running, process objects in MTU-of-origin-data sized chunks, and see if that affects the result. I'll also check for buffer and string streams.

Copy link

@trevnorris trevnorris commented May 9, 2013

I'm not sure I understand the issue for the benchmark. First, your duration is too small. Check the following:

object-stream-throughput.js dur=1 steps=4 type=Transform: 3.1865
object-stream-throughput.js dur=10 steps=4 type=Transform: 14223

Now assuming that number is the total after dur seconds, that means the later achieved must higher than when dur=1.

Also, here's a sample of taking --prof of the test at dur=5 and steps=4:

Statistical profiling result from v8.log, (5044 ticks, 21 unaccounted, 0 excluded).

 [Shared libraries]:
   ticks  total  nonlib   name
   2129   42.2%    0.0%  /lib/x86_64-linux-gnu/

   ticks  total  nonlib   name
    100    2.0%    3.4%  LazyCompile: *readableAddChunk _stream_readable.js:121
     98    1.9%    3.4%  LazyCompile: *onwrite _stream_writable.js:233
     92    1.8%    3.2%  LazyCompile: *Writable.write _stream_writable.js:155
     62    1.2%    2.1%  LazyCompile: * _stream_readable.js:236
     57    1.1%    2.0%  LazyCompile: *Transform._read _stream_transform.js:174

   ticks  total  nonlib   name
    213    4.2%    7.3%  v8::internal::FlexibleBodyVisitor<v8::internal::IncrementalMarkingMarkingVisitor, v8::internal::FixedArray::BodyDescriptor, void>::Visit(v8::internal::Map*, v8::inter
     99    2.0%    3.4%  v8::internal::IncrementalMarking::Step(long, v8::internal::IncrementalMarking::CompletionAction)

   ticks  total  nonlib   name
    453    9.0%

 [Bottom up (heavy) profile]:
  Note: percentage shows a share of a particular caller in the total
  amount of its parent calls.
  Callers occupying less than 2.0% are not shown.

   ticks parent  name
   2129   42.2%  /lib/x86_64-linux-gnu/
   1148   53.9%    LazyCompile: *onwrite _stream_writable.js:233
   1132   98.6%      LazyCompile: *WritableState.onwrite _stream_writable.js:91
   1132  100.0%        LazyCompile: *afterTransform _stream_transform.js:83
   1132  100.0%          LazyCompile: *TransformState.afterTransform _stream_transform.js:73
   1132  100.0%            LazyCompile: *NullTransform._transform /var/projects/node/benchmark/object-stream-throughput.js:94
    659   31.0%    LazyCompile: clearBuffer _stream_writable.js:279
    616   93.5%      LazyCompile: *onwrite _stream_writable.js:233
    616  100.0%        LazyCompile: *WritableState.onwrite _stream_writable.js:91
    605   98.2%          LazyCompile: *afterTransform _stream_transform.js:83
    605  100.0%            LazyCompile: *TransformState.afterTransform _stream_transform.js:73

And now the output of perf:

 65.98%  node         [.] __memcpy_ssse3_back                                                                                                                                    
  5.45%  node       [.] 0x00001ad3fbe06140                                                                                                                                     
  5.12%  node       [.] 0x00001387a708d4bf                                                                                                                                     
  2.17%  node  node                 [.] v8::internal::FlexibleBodyVisitor<v8::internal::IncrementalMarkingMarkingVisitor, v8::internal::FixedArray::BodyDescriptor, void>::Visit(v8::internal::
  2.12%  node  node                 [.] v8::internal::IncrementalMarking::Step(long, v8::internal::IncrementalMarking::CompletionAction)                                                       
  1.22%  node         [.] __memset_sse2

From the above we can see that the overhead from node is relatively small. The majority of the run-time degradation comes from needing to do internal work, like copying around the data.

Remember that a Buffer is external memory, but whenever you bring it in and parse as a string it has to copy that into the v8 heap. Then you pass it back out as a Buffer before going through the stream again, which has to read it out to external memory, again. These are expensive operations.

Copy link

@garthk garthk commented May 9, 2013

TL;DR: wipe-out degradation disappears when using a network-based data source instead of setImmediate.

More in a subsequent post; I'm posting this only to head off anyone doing extra work on the old benchmark. (Sorry I didn't win the race, Trev.)

Copy link

@garthk garthk commented May 9, 2013

Damnit. Stuffed up with the sheet.

TL;DR: we still have a problem with object stream performance that we should consider documenting. Even if you stop punching yourself in the head, object streams suffer a performance problem with transforms that we don't see on non-object streams.

I've fixed the benchmark; it now properly simulates parsing objects out of a network stream. Involving the network keeps the event loop running without the setImmediate hackery. I threw in streams of normal Buffers as a control.

Endlessly writing 64KB buffers from the other side of the socket, I can pipe ~650MB/s with 0…8 Transform streams. That's close to Transform streams for buffers being "free", to use Isaac's term. Woot!

Transforming to 12 objects/KB to match my data pushes 768 objects per 64KB chunk. I get ~900K objects/sec through that initial Transform, reducing my input throughput to 73MB/s. Adding one more Transform drops throughput by another 50%; two by 70%; four by 80%; eight by 90%. Ouch.

Chunking up the output objects into 768-length arrays (i.e. not punching self in head) ~eliminates the bottleneck of the initial transform. Eight additional transforms drop performance by ~15%, but I'm willing to treat that as an acceptable "not free" — and, if documented, an expected "not free".

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
None yet
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants