Skip to content
This repository has been archived by the owner on Feb 1, 2023. It is now read-only.

PeerManager's lock is highly contended #352

Closed
Stebalien opened this issue Apr 16, 2020 · 13 comments · Fixed by #356
Closed

PeerManager's lock is highly contended #352

Stebalien opened this issue Apr 16, 2020 · 13 comments · Fixed by #356

Comments

@Stebalien
Copy link
Member

The global lock in PeerManager is highly contended and is causing the preloader node to crash very quickly. The main contention seems to be:

  1. Session removals.
  2. Broadcasts. I believe this might be what's backing everything up.
  3. New connections. We take this lock in the "Connected" event handler, that's likely blocking other parts of the system.

I don't believe we have a deadlock but we definitely need to get more efficient here.

ipfs-profile-preload-packet-sjc-001-2020-04-15T23_27_36+0000.tar.gz

@Stebalien
Copy link
Member Author

I think I've found part of the issue:

  1. extractOutgoingMessage locks a message queue.
  2. Some other thread tries to add broadcast wants, takes the PeerManager lock, then tries to take the peer queue lock being held by extractOutgoingMessage.
  3. Now nothing can continue.

I've also noticed that the mutex profile seems to think we're spending a lot of time in RemoveSession, walking through wants. We may want to keep a reverse index (session -> want) to avoid walking through all wants.

@dirkmc
Copy link
Contributor

dirkmc commented Apr 17, 2020

We need to take the MessageQueue lock during extractOutgoingMessage() so as to be able to move wants from pending to sent and remove items from the cancel queue.

One way we've discussed to make extractOutgoingMessage() run faster would be to get all wants from the wantlists in unsorted order. Currently we call SortedEntries() which is slow. This would be fine most of the time when the number of wants is less than the size of a message, but some of the time we would be sacrificing correctness (in terms of the order in which wants are sent) for performance.

We could also check the size of the cancels + wantlists, and if there are few enough that they're likely to come in under the message size, we just get them in unsorted order.

@Stebalien
Copy link
Member Author

I'm not seeing this issue when testing with #351.

I believe we may have seen this issue due to repeatedly trying to re-send messages to unresponsive peers. Every time we tried to send a broadcast want, we'd try to re-send all pending messages to these peers. This theory matches with the CPU profile (where a lot of time was spent extracting messages).

@Stebalien
Copy link
Member Author

Hm. I spoke too soon. It's better, but still pretty bad. I believe the issue now is that every time we connect to a peer, we send them our broadcast wantlist. We do that under a global lock.

Worse, we do it under the same lock we need to take when processing connectEventManager.OnMessage.

@dirkmc
Copy link
Contributor

dirkmc commented Apr 20, 2020

I will modify connectEventManager to lock processing the PeerConnected / PeerDisconnected events per-peer

@dirkmc
Copy link
Contributor

dirkmc commented Apr 20, 2020

Also I think this will help contention in general:
#356

@Stebalien
Copy link
Member Author

Newer profiles: ipfs-profile-preload-packet-sjc-001-2020-04-18T00_43_39+0000.tar.gz

I'm going to try to make a new build with the new extractMessage optimizations.

@dirkmc
Copy link
Contributor

dirkmc commented Apr 21, 2020

In any case that looks better - seems like extractMessage() is running faster

@Stebalien
Copy link
Member Author

With these optimizations: ipfs-profile-preload-packet-sjc-001-2020-04-21T16_04_50+0000.tar.gz

We're still falling behind, but it's getting better. The remaining CPU drag is canceling.

@dirkmc
Copy link
Contributor

dirkmc commented Apr 21, 2020

@Stebalien
Copy link
Member Author

It looks like it's blocking on the wantlist lock.

@Stebalien
Copy link
Member Author

Well, I'm not sure if it's blocking, or just spending a lot of time because we're sending a ton of cancels.

@Stebalien
Copy link
Member Author

I'm not seeing this anymore since we fixed session usage in the ipfs refs command. TL;DR: constantly broadcasting to many peers slows everything down.

I'd still like to rework connection management a bit, but I think we can close this. We put a lot of effort into barking up the wrong tree, but I'm happy we spent the time optimizing this anyways.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants