-
Notifications
You must be signed in to change notification settings - Fork 112
PeerManager's lock is highly contended #352
Comments
I think I've found part of the issue:
I've also noticed that the mutex profile seems to think we're spending a lot of time in RemoveSession, walking through wants. We may want to keep a reverse index (session -> want) to avoid walking through all wants. |
We need to take the MessageQueue lock during One way we've discussed to make We could also check the size of the cancels + wantlists, and if there are few enough that they're likely to come in under the message size, we just get them in unsorted order. |
I'm not seeing this issue when testing with #351. I believe we may have seen this issue due to repeatedly trying to re-send messages to unresponsive peers. Every time we tried to send a broadcast want, we'd try to re-send all pending messages to these peers. This theory matches with the CPU profile (where a lot of time was spent extracting messages). |
Hm. I spoke too soon. It's better, but still pretty bad. I believe the issue now is that every time we connect to a peer, we send them our broadcast wantlist. We do that under a global lock. Worse, we do it under the same lock we need to take when processing |
I will modify |
Also I think this will help contention in general: |
Newer profiles: ipfs-profile-preload-packet-sjc-001-2020-04-18T00_43_39+0000.tar.gz I'm going to try to make a new build with the new |
In any case that looks better - seems like |
With these optimizations: ipfs-profile-preload-packet-sjc-001-2020-04-21T16_04_50+0000.tar.gz We're still falling behind, but it's getting better. The remaining CPU drag is canceling. |
It looks like it's blocking on the wantlist lock. |
Well, I'm not sure if it's blocking, or just spending a lot of time because we're sending a ton of cancels. |
I'm not seeing this anymore since we fixed session usage in the I'd still like to rework connection management a bit, but I think we can close this. We put a lot of effort into barking up the wrong tree, but I'm happy we spent the time optimizing this anyways. |
The global lock in PeerManager is highly contended and is causing the preloader node to crash very quickly. The main contention seems to be:
I don't believe we have a deadlock but we definitely need to get more efficient here.
ipfs-profile-preload-packet-sjc-001-2020-04-15T23_27_36+0000.tar.gz
The text was updated successfully, but these errors were encountered: