PeerManager's lock is highly contended #352

Stebalien · 2020-04-16T01:36:36Z

The global lock in PeerManager is highly contended and is causing the preloader node to crash very quickly. The main contention seems to be:

Session removals.
Broadcasts. I believe this might be what's backing everything up.
New connections. We take this lock in the "Connected" event handler, that's likely blocking other parts of the system.

I don't believe we have a deadlock but we definitely need to get more efficient here.

ipfs-profile-preload-packet-sjc-001-2020-04-15T23_27_36+0000.tar.gz

Stebalien · 2020-04-16T06:27:41Z

I think I've found part of the issue:

extractOutgoingMessage locks a message queue.
Some other thread tries to add broadcast wants, takes the PeerManager lock, then tries to take the peer queue lock being held by extractOutgoingMessage.
Now nothing can continue.

I've also noticed that the mutex profile seems to think we're spending a lot of time in RemoveSession, walking through wants. We may want to keep a reverse index (session -> want) to avoid walking through all wants.

dirkmc · 2020-04-17T20:34:47Z

We need to take the MessageQueue lock during extractOutgoingMessage() so as to be able to move wants from pending to sent and remove items from the cancel queue.

One way we've discussed to make extractOutgoingMessage() run faster would be to get all wants from the wantlists in unsorted order. Currently we call SortedEntries() which is slow. This would be fine most of the time when the number of wants is less than the size of a message, but some of the time we would be sacrificing correctness (in terms of the order in which wants are sent) for performance.

We could also check the size of the cancels + wantlists, and if there are few enough that they're likely to come in under the message size, we just get them in unsorted order.

Stebalien · 2020-04-18T01:24:44Z

I'm not seeing this issue when testing with #351.

I believe we may have seen this issue due to repeatedly trying to re-send messages to unresponsive peers. Every time we tried to send a broadcast want, we'd try to re-send all pending messages to these peers. This theory matches with the CPU profile (where a lot of time was spent extracting messages).

Stebalien · 2020-04-18T03:15:25Z

Hm. I spoke too soon. It's better, but still pretty bad. I believe the issue now is that every time we connect to a peer, we send them our broadcast wantlist. We do that under a global lock.

Worse, we do it under the same lock we need to take when processing connectEventManager.OnMessage.

dirkmc · 2020-04-20T14:08:42Z

I will modify connectEventManager to lock processing the PeerConnected / PeerDisconnected events per-peer

dirkmc · 2020-04-20T14:09:08Z

Also I think this will help contention in general:
#356

Stebalien · 2020-04-21T02:39:09Z

Newer profiles: ipfs-profile-preload-packet-sjc-001-2020-04-18T00_43_39+0000.tar.gz

I'm going to try to make a new build with the new extractMessage optimizations.

dirkmc · 2020-04-21T15:26:48Z

In any case that looks better - seems like extractMessage() is running faster

Stebalien · 2020-04-21T17:09:28Z

With these optimizations: ipfs-profile-preload-packet-sjc-001-2020-04-21T16_04_50+0000.tar.gz

We're still falling behind, but it's getting better. The remaining CPU drag is canceling.

dirkmc · 2020-04-21T22:24:22Z

With peerWantManager cancels reverse index:
ipfs-profile-preload-packet-sjc-001-2020-04-21T22_10_42+0000.tar.gz

Stebalien · 2020-04-21T22:48:27Z

It looks like it's blocking on the wantlist lock.

Stebalien · 2020-04-21T22:50:09Z

Well, I'm not sure if it's blocking, or just spending a lot of time because we're sending a ton of cancels.

Stebalien · 2020-04-23T18:11:37Z

I'm not seeing this anymore since we fixed session usage in the ipfs refs command. TL;DR: constantly broadcasting to many peers slows everything down.

I'd still like to rework connection management a bit, but I think we can close this. We put a lot of effort into barking up the wrong tree, but I'm happy we spent the time optimizing this anyways.

dirkmc mentioned this issue Apr 17, 2020

feat: optimize entry sorting in MessageQueue #356

Merged

Stebalien closed this as completed in #356 Apr 20, 2020

Stebalien reopened this Apr 21, 2020

Stebalien closed this as completed Apr 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PeerManager's lock is highly contended #352

PeerManager's lock is highly contended #352

Stebalien commented Apr 16, 2020

Stebalien commented Apr 16, 2020

dirkmc commented Apr 17, 2020

Stebalien commented Apr 18, 2020

Stebalien commented Apr 18, 2020

dirkmc commented Apr 20, 2020

dirkmc commented Apr 20, 2020

Stebalien commented Apr 21, 2020

dirkmc commented Apr 21, 2020

Stebalien commented Apr 21, 2020

dirkmc commented Apr 21, 2020

Stebalien commented Apr 21, 2020

Stebalien commented Apr 21, 2020

Stebalien commented Apr 23, 2020

PeerManager's lock is highly contended #352

PeerManager's lock is highly contended #352

Comments

Stebalien commented Apr 16, 2020

Stebalien commented Apr 16, 2020

dirkmc commented Apr 17, 2020

Stebalien commented Apr 18, 2020

Stebalien commented Apr 18, 2020

dirkmc commented Apr 20, 2020

dirkmc commented Apr 20, 2020

Stebalien commented Apr 21, 2020

dirkmc commented Apr 21, 2020

Stebalien commented Apr 21, 2020

dirkmc commented Apr 21, 2020

Stebalien commented Apr 21, 2020

Stebalien commented Apr 21, 2020

Stebalien commented Apr 23, 2020