fix: revert to serialized pubsub operations #319

alanshaw · 2018-07-02T10:17:59Z

During the refactor (#290) I took the opportunity to parallelise some pubsub operations that didn't explicitly depend on each other. This worked perfectly while testing locally, but on CI it is a different story. I found that tests for js-ipfs-api (which are run against go-ipfs) failed for seemingly random reasons.

After much investigation I finally tried re-serialising the operations I had refactored to be parallel and the tests started to pass. It seems that the pubsub implementation in go-ipfs has some issues with concurrency.

I also found two intermittent issues with swarm.connect in go-ipfs (seen significantly more often on CI):

Issuing two calls to this function from the same node might end up in the second not actually creating a connection and no error message reported to the user
Even after the response to the user it takes a few milliseconds for a connection to actually be connected

I intend to open issues on go-ipfs and write examples demonstrating these problems.

I created a utility function connect to temporarily mitigate these issues in one place. The utility serialises calls from a single node to another and pauses after each to allow the connection to properly establish.

Merging this PR will bring us significantly closer to merging ipfs-inactive/js-ipfs-http-client#785

During the refactor I took the opportunity to parallelise some pubsub operations that didn't explicitly depend on each other. This worked perfectly while testing locally, but on CI it is a different story. I found that tests for js-ipfs-api (which are run against go-ipfs) failed for seemingly random reasons. After much investigation I finally tried re-serialising the operations I had refactored to be parallel and the tests started to pass. It seems that the pubsub implementation in go-ipfs has some issues with concurrency. I also found two intermittent issues with `swarm.connect` in go-ipfs (seen significantly more often on CI): 1. Issuing two calls to this function from the same node might end up in the second not actually creating a connection and no error message reported to the user 2. Even after the response to the user it takes a few milliseconds for a connection to actually be connected I intend to open issues on go-ipfs and write examples demonstrating these problems. I created a utility function `connect` to temporarily mitigate these issues in one place. The utility serialises calls from a single node to another and pauses after each to allow the connection to properly establish. License: MIT Signed-off-by: Alan Shaw <alan@tableflip.io>

License: MIT Signed-off-by: Alan Shaw <alan@tableflip.io>

daviddias · 2018-07-02T11:42:03Z

It seems that the pubsub implementation in go-ipfs has some issues with concurrency.

@richardschneider hit this many times in the past and got documented somewhere. Can you make that there is a issue on go-ipfs? //cc @Stebalien

alanshaw · 2018-07-02T13:46:22Z

Can you make that there is a issue on go-ipfs?

I will triple check but all signs seem to be indicating that this is the case. I've seen these tests pass on js-ipfs on CI with the concurrency i.e. without this PR.

Like I said in the PR description:

I intend to open issues on go-ipfs and write examples demonstrating these problems.

😉

Stebalien · 2018-07-06T20:09:14Z

Which version of go-ipfs were you using?

I also found two intermittent issues with swarm.connect in go-ipfs (seen significantly more often on CI):

Issuing two calls to this function from the same node might end up in the second not actually creating a connection and no error message reported to the user

If you call ipfs swarm connect on the same node twice, we explicitly try to create only one connection. Was no connection created?

Even after the response to the user it takes a few milliseconds for a connection to actually be connected.

One side will generally learn about the connection before the other. Is that what's happening here?

alanshaw · 2018-07-09T14:42:41Z

If you call ipfs swarm connect on the same node twice, we explicitly try to create only one connection. Was no connection created?

Sorry I should have been more explicit - two calls to connect but with different addresses

One side will generally learn about the connection before the other. Is that what's happening here?

hmm that's an interesting one, does the response to connect come before both nodes have been connected together?

I'd have to double check the code but in think in general when we need node A to know about B we'll connect A to B rather than B to A before issuing commands.

Stebalien · 2018-07-20T22:42:11Z

hmm that's an interesting one, does the response to connect come before both nodes have been connected together?

No, it's just that one node learns that the connection is ready before the other. We could delay until the other learns but that'd cost us a round trip. For example, take TCP. In a TCP connection, the client always learns that the connection is ready before the server and can start sending data immediately after sending it's ACK (but the server needs to wait for the ACK before sending data).

I'd have to double check the code but in think in general when we need node A to know about B we'll connect A to B rather than B to A before issuing commands.

That's the right way to do this.

Sorry I should have been more explicit - two calls to connect but with different addresses

For two different nodes? That's definitely not supposed to happen. I'm trying to reproduce but I can't (connecting and disconnecting from two nodes in a loop).

Is the failing connect call returning no response or just no error? That is, is it returning a success response (the peer ID of the target peer)?

alanshaw added 2 commits July 2, 2018 09:42

fix: revert timout increase

ad3e9d2

License: MIT Signed-off-by: Alan Shaw <alan@tableflip.io>

alanshaw requested review from travisperson and daviddias July 2, 2018 10:17

alanshaw requested a review from victorb as a code owner July 2, 2018 10:18

ghost assigned alanshaw Jul 2, 2018

ghost added the in progress label Jul 2, 2018

This was referenced Jul 2, 2018

feat: modular interface tests ipfs-inactive/js-ipfs-http-client#785

Merged

feat: modular interface tests ipfs/js-ipfs#1389

Merged

daviddias approved these changes Jul 3, 2018

View reviewed changes

Update swarm.js

3c0778c

ghost assigned daviddias Jul 3, 2018

daviddias merged commit 4b5534e into master Jul 3, 2018

daviddias deleted the fix/re-serialise branch July 3, 2018 15:02

ghost removed the in progress label Jul 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: revert to serialized pubsub operations #319

fix: revert to serialized pubsub operations #319

alanshaw commented Jul 2, 2018 •

edited

daviddias commented Jul 2, 2018

alanshaw commented Jul 2, 2018

Stebalien commented Jul 6, 2018

alanshaw commented Jul 9, 2018

Stebalien commented Jul 20, 2018

fix: revert to serialized pubsub operations #319

fix: revert to serialized pubsub operations #319

Conversation

alanshaw commented Jul 2, 2018 • edited

daviddias commented Jul 2, 2018

alanshaw commented Jul 2, 2018

Stebalien commented Jul 6, 2018

alanshaw commented Jul 9, 2018

Stebalien commented Jul 20, 2018

alanshaw commented Jul 2, 2018 •

edited