Node.js Streams vs pull-streams vs ReactiveJS #362

dignifiedquire · 2016-07-22T13:45:46Z

Reasons for consideration

In the recent months of developing js-ipfs we had some issues with various parts of the different versions of node-streams. Some of the major ones were

Hard to control error handling, especially as we pipe and wrap streams many times from module to module
Issues with flushing data, write streams don't give guarantees about when data has been written to the underlying data source, for example disk. This resulted in some very problematic race conditions (ref Error: ENOENT: no such file or directory #351)
Behaviour different depending on the module. Given the complicated history it is not always clear when combining streams from core and multiple modules in which mode they are currently and why they are not behaving as expected.

In light of these issues we are exploring different alternatives to replace the stream based code inside js-ipfs

Currently I have identified two possible alternatives

1. Pull Streams

Example code

[WIP] Refactor to use pull streams libp2p/js-libp2p-secio#8

Resources

2. RxJS

Example code

Resources

The text was updated successfully, but these errors were encountered:

dignifiedquire · 2016-08-01T16:43:58Z

The only viable other alternative to node-streams I found so far was https://pull-stream.github.io. They are quite nice and there is an example of secio implemented here: libp2p/js-libp2p-secio#8

The main barrier I hit with them was integration with node-streams, so if we were to move the whole code base to use pull-streams this might solve those issues.

daviddias · 2016-08-04T11:50:31Z

First, thank you @dignifiedquire for chasing this streams issue and for researching possible solutions by doing experiments with actual implementations, really 5 🌟 👌.

To give a bit more emphasis to the issues we've realised (through the hard way), the Node.js Streams, also known as Streams 2, (problem 1) don't propagate errors, even (specially) when piped (.pipe) together. Adding to this, (problem 2) the fs writable streams don't guarantee that the data was flushed to disk when the 'end' event is emitted, given the chance for some race conditions when we try to read faster than the syscal has time to flush. The last issue (problem 3), doesn't come from the Node.js streams, but more from the streams ecosystem as a whole, where not all libraries that implement streams actually implement the full interface/expectations, giving room for some errors and requiring some monkey patching.

I believe, after having gone through and also talked with several people, that we shouldn't treat these 3 problems as one, and falling into the trap of patching the 3 pain points with a large bandaid, which will bring a lot more than what we need, increasing the complexity and also hitting performance and the js blob size.

Just to put in context, a normal communication between two IPFS peers is something like:

Candidate solutions

pump

pump is a small module that offers correct error propagation and stream destruction. It can be seen as .pipe done right.

pump is part of a collection of utility modules for Node.js Streams, called mississippi.

rxjs

rxjs is a very large project supported by lots of groups and companies, its documentation is in a very advanced state and exists for more languages than Node.js.

It offers a lot of nice APIs to operate streams of data, or 'arrays over time', however, (from what I can understand), doesn't offer any backpressure mechanism, essential for networking protocols or operations over large quantities of data. In some sense, it is similar to Node.js Streams 1 (Streams Classic), but with fancier APIs

pull-streams

pull-streams are super interesting, I really like how the ecosystem is growing and they result from similar frustrations that Node.js developers had in the past.

I was initially confused of how to ensure that pull-streams offer backpressure mechanism, specially compatible with the underlying transports (push streams), but @yoshuawuyts appeared in the IPFS workshop for the rescue and gave me a really good explanation.

Essentially pull-streams leverage the underlying push-streams highWatermark to do the network backpressure for them, as seen in the following image:

Problem 1 - propagation of errors

This is a requirement for js-ipfs, as each part of the code works isolated from the remaining services, but they all mount on libp2p to open the connections. As seen in Figure 1, an error in a network socket will cause several parts of the code to have to handle that error and react to it.

It seems to me that any of the proposed solutions would solve this problem really well. Then it will be up to the developers to understand how to handle the network errors if they open a connection themselves.

Problem 2 - ensure data gets flushed

It doesn't seem to me that any of the proposed solutions offers a fix for this problem, because in the end, we are always using Node.js stream to write to disk.

What we can (and should do) is solve this issue at the ipfs-repo level, making sure that data is flushed before we let the end event be emitted, and while we are at it, add a LRU cache to make reads faster, mitigating also the problem of the false negatives.

Problem 3 - streams ecosystem

This feels a bit like lying to ourselves, it is true that not all stream modules implement Node.js streams as we could expect, but that doesn't mean that whatever other modules we get for pull-streams or rxjs will do everything just right.

This has to be solved by having very comprehensive tests, unit, integration and chaos tests, to make sure we don't miss anything.

Important to note that, modules like spdy, won't have a rxjs or a pull-streams complete version (at least soon).

Other concerns

As we can see in Figure 1, we do a lot of plumbing between the several pieces of the code sometimes through read X bytes, others by directing the communication (.pipe), others by wrapping the stream into a new stream (duplexify). I don't see a clean and easy way to do that either with rxjs or pull-streams, without having to do even more wrapping and patching. This will make the code more complex and therefore, error prone.
rxjs requires some very specific wrapping of the transports to work (see https://github.com/dignifiedquire/rxjs.node/blob/master/src/socket-subject.js and https://github.com/dignifiedquire/rxjs.node/blob/master/src/stream-subject.js), this still doesn't feel right to me. To buy in into this interface, we would have to expose it on the libp2p-{tcp, websockets, webrtc, etc} transports too.
I strongly recommend and vote that we keep exposing natural (language constructs based) API, such as callbacks, promises an streams. Requiring a user to buy into a library to be able to use js-ipfs creates a lot of friction.
Each time we add a library, we add a considerable amount of code that gets bundled into the project and as you may know, currently js-ipfs is close to 4MB of JavaScript. If that library because a dependency of how all the other modules get attached together, than it gets very hard to phase it out one day.

Other thoughts

We have to be cautious when proposing/moving to other ecosystems. Every single one of them has its qualms and moving back and forth will always add noise in the migration. There are plenty of libraries and frameworks that optimize the developers experience, however, since they might require a huge paradigm shift on the code structure, they lack to get adoption. In the end, it is the people interest that will make a good lib/framework, a good framework. One good example is the async module, that although not a language construct, it is a well known module for the Node.js community and its patterns are wildly known.

This takes me to my next point which is, in order to maximize adoption and contribution, language API are typically the best bet.

Summing up

In the end, independently of the solution we decide to go for, we should understand and make sure we don't add more complexity and problems for the long term. Bringing in a framework to solve control flow will certainly add a considerable amount of overhead (the large bandaid).

We probably can get away with pump and the repo that knows how to cache and flush properly.

pull-streams look pretty exciting with their low footprint and convenience, but they only solve issues partially (error propagation) and we would need to expose regular streams anyhow, which they do pretty conveniently, same doesn't happen for rxjs.

Let's make sure to chat Friday and make a decision and move on, please do comment in this thread before hand, so that we make the meeting very efficient.

dignifiedquire · 2016-08-04T12:15:24Z

Thank you very much @diasdavid for this comprehensive post, below are a couple of notes just for clarification.

As we can see in Figure 1, we do a lot of plumbing between the several pieces of the code sometimes through read X bytes, others by directing the communication (.pipe), others by wrapping the stream into a new stream (duplexify). I don't see a clean and easy way to do that either with rxjs or pull-streams, without having to do even more wrapping and patching. This will make the code more complex and therefore, error prone.

This is very nicely solved in pull-streams. Wrapping and things like reading from streams per byte is super elegant, as can be seen for example here: https://github.com/libp2p/js-libp2p-secio/blob/pull/src/handshake/finish.js#L22-L32

This takes me to my next point which is, in order to maximize adoption and contribution, language API are typically the best bet.

Yes, but I don't think we should sacrifice our productivity just for adoption as a library. We don't want to clash with everything the community does, but this shouldn't stop us from trying to write as efficient as possible the most excellent code that we can.

We probably can get away with pump and the repo that knows how to cache and flush properly.

I'm afraid that we can't get away with this, as the data flushing issues are more serious than just the fs part. The most pressing issue, the one which lead me to investigating other stream solutions was the one I encountered when trying to implement secio. It encounters the issue that wrapping node-streams and trying to properly manage things like pause and resume data gets lost if you do not attach to it quick enough. As far as my understanding goes pump does not solve this issue in any way.

Let's make sure to chat Friday and make a decision and move on, please do comment in this thread before hand, so that we make the meeting very efficient.

I don't know how your schedule looks like but I would appreciate if we could actually have this chat today, given that it is currently blocking 90% of the work I want to get a move on.

daviddias · 2016-08-04T12:18:12Z

Ok, let's do it today then :) 4pm UTC sounds good? http://www.worldtimebuddy.com/?qm=1&lid=2267057,100,308,12&h=100&date=2016-8-4&sln=16-17

daviddias · 2016-08-04T12:34:04Z

Pinging @haadcode @nginnever @xicombd @RichardLitt @victorbjelkholm @shamb0t and anyone else following this, you are welcome to join the chat, don't feel obligated to though :)

yoshuawuyts · 2016-08-04T12:52:43Z

It doesn't seem to me that any of the proposed solutions offers a fix for
this problem, because in the end, we are always using Node.js stream to write
to disk.

https://github.com/dominictarr/pull-fs uses fs.open() to create a file
descriptor and then uses the cursor to write to disk in chunks. It sidesteps
all of Node streams, so it's actually possible to ditch all, if not most of
node streams. Probably the only place where that's currently not being done is
net / http as they rely on the stdlib - but there's wrappers available that
abstract away the internals and only expose the interface (such as
pull-http-server). There should be no need to ever have node streams in the
browser

Important to note that, modules like spdy, won't have a rxjs or a
pull-streams complete version (at least soon)

Yeah, chances are slim that will happen soon. But that's fine because in Node
land wrapping the Node streams API isn't the worst. It supports backpressure so
any other paradigm can me mapped on it easily, and perf should be comparable.
Here's an example pull-streams http wrapper

I strongly recommend and vote that we keep exposing natural (language
constructs based) API, such as callbacks, promises an streams. Requiring a
user to buy into a library to be able to use js-ipfs creates a lot of
friction.

Probably the only native abstraction capable of doing streaming stuff well
would be to have an explicit OO API to interface with the streaming state
machine, like Node's fs file descriptor methods.
All other abstractions could then be built on top of that.

I'd be super wary of exposing multiple interfaces as JS, the language™ is a
moving target. E.g. what will the story be for async / await, event emitters, promise cancellables. I reckon versioning and updating
interface-specific packages built around a single core API would be the most
sane approach, keeping everyone happy.

An example OO API:

const ipfs = require('ipfs')

ipfs('/my-cool/repo-name', (err, node) => {
  if (err) throw err
  const cat = node.files.cat(function (err, data) {
    if (err) return console.log('error: ' + err)
    console.log('chunk: ' + String(data))
  })

  cat.start()
  cat.pause()
  cat.stop()

  node.stop()
})

For native abstractions you could totes create a plugin system to expose
abstractions, if anything my playground
repo has shown that wrapping the API isn't that hard:

// pull-stream
const ipfsPull = require('ipfs-pull-stream')
const pull = require('pull-stream')
const ipfs = require('ipfs')

ipfs('/my-cool/repo-name', (err, node) => {
  if (err) throw err

  node.use(ipfsPull())

  const cat$ = node.files.cat('<hash>')
  pull(cat$, pull.log())
})

// node streams
const ipfsNodeStreams = require('ipfs-node-streams')
const ipfs = require('ipfs')
const pump = require('pump')

ipfs('/my-cool/repo-name', (err, node) => {
  if (err) throw err

  node.use(ipfsNodeStreams())

  const cat$ = node.files.cat('<hash>')
  pump(cat$, process.stdout)
})

Does this make sense?

whyrusleeping · 2016-08-04T15:38:25Z

If you guys don't make a solid decision during this meeting i'm going to make the decision for you.

That is a threat I intend to follow through on.

hackergrrl · 2016-08-04T15:56:04Z

Really awesome work @dignifiedquire in concretely identifying the pain points in dealing with node streams. I can't make the hangout, but here are my thoughts:

For me, pull streams' # 1 superpower is their simplicity (in the Rich Hickey sense of the word): anyone can write a full pull stream implementation from scratch in a few minutes, from first principles. This is not true of node streams in the slightest. Simplicity brings transparency with it, meaning debugging and reasoning about implementation gets easier. I think this is the biggest argument I can make for pull streams. In this sense, I bet js-ipfs would benefit tremendously.

That said, pull streams' lack mainstream awareness. Despite being simple, grokking pull streams is certainly not easy. It took me a while to wrap my head around the big ideas, and I still don't wield them as intuitively as node streams. Pushing IPFS contributors to grok pull streams could hurt the project's approachability.

Summary: deviating from node streams in 2016 seems like a high-risk move re raising the barrier to contribution, despite potential gains. As @diasdavid said, piece-wise solutions to each problem (like pump) is probably the safest approach.

daviddias · 2016-08-04T18:10:25Z

haadcode · 2016-08-05T07:23:33Z

we decided to give pull-streams a spin, however this is not a full buy in, because we still do not know if it will solve the entire problem for libp2p (specially secio) and ipfs-repo.

Can you give insights to what lead you to this decision? What were the pros and cons for pull-streams vs. RxJS?

daviddias · 2016-08-05T11:30:00Z

@haadcode The most important reasons are described in my first post, here: #362 (comment)

In a tl;dr:

Moving back and forward between Node.js streams is complicated (while pull-streams is literally one function)
No notion of backpressure
Larger dependency
Higher cost of entrance. It requires more than learning how a lib works, but actually how the whole reactive programming paradigm should work.

ekroon · 2016-08-05T13:40:48Z

As an alternative to RX, with backpressure possibilities, there is also the Reactive Streams specification (http://www.reactive-streams.org/). It is still not something that feels very natural compared to streams in nodejs (compare https://github.com/reactive-streams/reactive-streams-jvm/blob/v1.0.0/README.md).

dignifiedquire · 2016-08-05T13:45:53Z

@ekroon I saw them mentioned in the discussions around backpressure but it seems there is no implementation in JavaScript currently. Implementing something like this would add a lot of work for us without being sure that it's worth it (https://github.com/reactive-streams/reactive-streams-js is empty :()

whyrusleeping · 2016-08-05T15:10:01Z

reactive streams also looks like its its own network protocol.

ekroon · 2016-08-05T19:55:22Z

@dignifiedquire @whyrusleeping in the simplest form reactive streams is only a set of interfaces that make it possible to subscribe to a stream and be notified of errors and the end of the stream (which is also in RX). In addition you can effectively switch between pull and push as there is an extra call on the subscription (normally only a cancel or unsubscribe), namely request(n); to request a maximum of new items. If the receiving end is limiting the flow by doing request(1) after a message, it would be effectively pull and when the sending end is limiting via request(bignumber); than it is effectively (unbounded) push.

Looking at node.js streams and pull-streams specifically, it looks to me that you can almost do the same with it. I am not completely sure if you can specify a lot of data and that way be a bit more pushy (I get a bit stuck when reading javascript ;)); and that reactive streams is (still) more a JVM party and not very wide spread, but that the interface itself would be mappable onto a pull stream implementation. (RxJs is pretty popular I thought nowadays for UIs, so there is hope for a reactive streams implementation)

As for a network protocol: it is not, but it is meant to be used within an application and crossing network boundaries. E.g. if I have an photo resize service split up in a resizer + stream chunker (creating JPEG or something) and a sender over the network (over some protocol providing flow control, TCP / SPDY ...), than I would be able to rate limit the sender by requesting only 1 new photo from the stream chunker which in turn could rate limit the TCP connection a little bit if needed.

dominictarr · 2016-08-06T23:47:59Z

for converting between node streams and pull streams:
https://github.com/pull-stream/stream-to-pull-stream
https://github.com/pull-stream/pull-stream-to-stream

stream-to-pull-stream has had a lot of battle testing and has been very reliable.
in fact, it's worked so well that it's probably slowed down efforts to write lower level pull-streams
which go directly to the libuv bindings. but those are popping up.

dignifiedquire · 2016-09-12T10:52:16Z

Closing as we have merged and released our first version of js-ipfs using pull-streams internally :)

dignifiedquire added the status/in-progress In progress label Jul 22, 2016

dignifiedquire changed the title ~~[technical exploration] RxJS~~ [technical exploration] Life beyond node-streams Aug 1, 2016

daviddias mentioned this issue Aug 1, 2016

Sprint: Aug 1 → Aug 8 ipfs/team-mgmt#130

Closed

dignifiedquire added the exploration label Aug 4, 2016

dignifiedquire mentioned this issue Aug 9, 2016

refactor: ⎆ Migration to pull-streams (from Node.js Streams) #403

Closed

28 tasks

daviddias mentioned this issue Aug 13, 2016

🚣 captain.log - IPFS JavaScript implementation 🌟 #30

Closed

daviddias mentioned this issue Aug 24, 2016

Sprint: Aug 22 → Aug 28 ipfs/team-mgmt#146

Closed

dignifiedquire closed this as completed Sep 12, 2016

dignifiedquire removed the status/in-progress In progress label Sep 12, 2016

daviddias changed the title ~~[technical exploration] Life beyond node-streams~~ Node.js Streams vs pull-streams vs ReactiveJS Oct 30, 2016

daviddias mentioned this issue Oct 30, 2016

Exposed Interfaces (Callbacks & Promises, node.js streams and pull-streams) #557

Closed

daviddias mentioned this issue Jan 31, 2018

feat(breaking change): stats.bwReadableStream and stats.bwPullStream ipfs-inactive/interface-js-ipfs-core#211

Merged

Zerim mentioned this issue Apr 25, 2018

Update ETL.js mapping example to use Streams graphprotocol/research#10

Closed

daviddias mentioned this issue Jun 25, 2018

Adopt async/await and promises instead of callbacks #1311

Closed

dryajov mentioned this issue Mar 25, 2019

Wasm: add a transport that binds to a JS libp2p transport libp2p/rust-libp2p#974

Closed

This was referenced Jan 20, 2022

[Snyk] Fix for 2 vulnerabilities awaisab172/js-ipfs#39

Open

[Snyk] Fix for 2 vulnerabilities thexdesk/js-ipfs#10

Open

[Snyk] Fix for 2 vulnerabilities rtpro/js-ipfs#49

Open

zi3lak mentioned this issue Oct 30, 2022

[Snyk] Security upgrade libp2p-webrtc-star from 0.20.8 to 0.24.0 zi3lak/js-ipfs#179

Open

thexdesk mentioned this issue Oct 31, 2022

[Snyk] Security upgrade libp2p-webrtc-star from 0.16.1 to 0.24.0 thexdesk/js-ipfs#27

Open

This was referenced Dec 6, 2023

[Snyk] Fix for 1 vulnerabilities zi3lak/js-ipfs#273

Open

[Snyk] Security upgrade libp2p-webrtc-star from 0.20.8 to 0.22.4 zi3lak/js-ipfs#294

Open

awaisab172 mentioned this issue Feb 11, 2024

[Snyk] Fix for 1 vulnerabilities awaisab172/js-ipfs#64

Open

rtpro mentioned this issue Feb 11, 2024

[Snyk] Fix for 1 vulnerabilities rtpro/js-ipfs#76

Open

thexdesk mentioned this issue Feb 11, 2024

[Snyk] Fix for 1 vulnerabilities thexdesk/js-ipfs#44

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node.js Streams vs pull-streams vs ReactiveJS #362

Node.js Streams vs pull-streams vs ReactiveJS #362

dignifiedquire commented Jul 22, 2016 •

edited

Loading

dignifiedquire commented Aug 1, 2016

daviddias commented Aug 4, 2016

dignifiedquire commented Aug 4, 2016

daviddias commented Aug 4, 2016 •

edited

Loading

daviddias commented Aug 4, 2016

yoshuawuyts commented Aug 4, 2016 •

edited

Loading

whyrusleeping commented Aug 4, 2016

hackergrrl commented Aug 4, 2016 •

edited

Loading

daviddias commented Aug 4, 2016 •

edited

Loading

haadcode commented Aug 5, 2016 •

edited

Loading

daviddias commented Aug 5, 2016

ekroon commented Aug 5, 2016 •

edited

Loading

dignifiedquire commented Aug 5, 2016 •

edited

Loading

whyrusleeping commented Aug 5, 2016

ekroon commented Aug 5, 2016

dominictarr commented Aug 6, 2016

dignifiedquire commented Sep 12, 2016

Node.js Streams vs pull-streams vs ReactiveJS #362

Node.js Streams vs pull-streams vs ReactiveJS #362

Comments

dignifiedquire commented Jul 22, 2016 • edited Loading

Reasons for consideration

1. Pull Streams

Example code

Resources

2. RxJS

Example code

Resources

dignifiedquire commented Aug 1, 2016

daviddias commented Aug 4, 2016

Candidate solutions

pump

rxjs

pull-streams

Problem 1 - propagation of errors

Problem 2 - ensure data gets flushed

Problem 3 - streams ecosystem

Other concerns

Other thoughts

Summing up

dignifiedquire commented Aug 4, 2016

daviddias commented Aug 4, 2016 • edited Loading

daviddias commented Aug 4, 2016

yoshuawuyts commented Aug 4, 2016 • edited Loading

whyrusleeping commented Aug 4, 2016

hackergrrl commented Aug 4, 2016 • edited Loading

daviddias commented Aug 4, 2016 • edited Loading

The plan

Test pull-streams in libp2p

IPFS Repo

Acknowledgements

haadcode commented Aug 5, 2016 • edited Loading

daviddias commented Aug 5, 2016

ekroon commented Aug 5, 2016 • edited Loading

dignifiedquire commented Aug 5, 2016 • edited Loading

whyrusleeping commented Aug 5, 2016

ekroon commented Aug 5, 2016

dominictarr commented Aug 6, 2016

dignifiedquire commented Sep 12, 2016

dignifiedquire commented Jul 22, 2016 •

edited

Loading

daviddias commented Aug 4, 2016 •

edited

Loading

yoshuawuyts commented Aug 4, 2016 •

edited

Loading

hackergrrl commented Aug 4, 2016 •

edited

Loading

daviddias commented Aug 4, 2016 •

edited

Loading

haadcode commented Aug 5, 2016 •

edited

Loading

ekroon commented Aug 5, 2016 •

edited

Loading

dignifiedquire commented Aug 5, 2016 •

edited

Loading