Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goroutine count, Node Performance, and correlation with peer count #5420

Open
bonedaddy opened this issue Sep 2, 2018 · 8 comments
Open
Labels
kind/bug A bug in existing code (including security flaws) topic/perf Performance

Comments

@bonedaddy
Copy link
Contributor

Version information:

go-ipfs version: 0.4.17-
Repo version: 7
System version: amd64/linux
Golang version: go1.10.3

Note that I'm also running an IPFS Cluster daemon on the same node which is connected to one other peer:

ipfs-cluster-service version 0.5.0

Type:

Possible goroutine leak or other bug

Description:

The number of goroutines running on my IPFS nodes appears to be highly correlated with the number of peers my node is connected to. In the past I would normally have between 600 -> 1000 peers, and I noticed overall "slow" performance to my node. One thing I noticed is that my nodes always had significantly high numbers of goroutines running constantly (10K+).

I suspected the poor performance was due to slow DHT querying with the large amount of peers so I lowered my peer count range (200 -> 500). Subsequently my node's performance was significantly better than before, and that my goroutine count was lower. Today while checking up on my monitoring system I noticed a pattern that was extremely interesting, and that the peer count on one of my nodes dropped sharply, and as the peer count dropped, so did the number of goroutines running on that node:

I'm unsure what the underlying issue would be, but it appears that the more peers you are connected too, the more goroutines you are running, which on the surface looks to be what was causing the poor performance for my nodes.

System Specs:

CPU: E5-2680 v2 - 12 cores
Memory: 16GB DDR3
Disk: 750GB 10K RPM
Disk Format: ext4
IPFS Repo: BadgerDS
@hsanjuan
Copy link
Contributor

hsanjuan commented Sep 4, 2018

@postables my first impression is that it is perfectly normal. The more peers you're connected to, the more communication streams need to be handled, and each of them will be a goroutine (or several).

Now perhaps you can ellaborate on "poor performance". Is go-ipfs causing abnormal load on your 12 cores? Is the disk IO super slow? or is it DHT query/resolving?

@bonedaddy
Copy link
Contributor Author

bonedaddy commented Sep 4, 2018

@hsanjuan Ah okay never mind about that then. The Disk IO isn't what's slow, but the DHT query/resolving. From everything I can tell our Disk IO is fine, and if anything, under-loaded.

@Stebalien
Copy link
Member

DHT querying/resolving shouldn't slow down when you have more connections. That's really interesting.

@Stebalien Stebalien added the kind/bug A bug in existing code (including security flaws) label Sep 5, 2018
@Stebalien
Copy link
Member

How about the CPU usage? If it's high, could you try dumping a CPU profile (https://github.com/ipfs/go-ipfs/blob/master/docs/debug-guide.md#beginning)?

@bonedaddy
Copy link
Contributor Author

Interesting, why shouldn't it be slower? Would it be because as hsan pointed out, the more communication streams there are, allowing for data to be fetched from multiple sources faster? As for CPU usage, it's quite low in general I'm actually surprised as to how low it is.

I don't have much explanation for why my performance increased when I decreased my connection count range. The only change I made within the same time frame, was that any time my nodes "add" files to IPFS, they do so without pinning, and the pinning occurs in a seperate process after the file has been successfully uploaded.

So perhaps the root issue was the concurrent add/pin issue, and I'm just mixing up the solution that resulted in higher performance?

@Stebalien
Copy link
Member

Interesting, why shouldn't it be slower? Would it be because as hsan pointed out, the more communication streams there are, allowing for data to be fetched from multiple sources faster? As for CPU usage, it's quite low in general I'm actually surprised as to how low it is.

DHT lookups should be faster because we're already likely to be connected to DHT servers with the information we need. However, bitswap may be slower as bitswap currently asks all connected peers for the objects we're looking for.

But this is interesting. We should look into this more to make sure we aren't making any incorrect assumptions.

@bonedaddy
Copy link
Contributor Author

bonedaddy commented Sep 5, 2018

Ah okay, perhaps it was bitswap then. Agreed! It's been very interesting and kind of fun to see what tricks work to increase node performance.

I should hopefully have time around the weekend to test this out. Are there any particular metrics which would help to resolve this? I've got Zabbix, Grafana, and Prometheus at my disposal for metric collection, and could possibly, at least temporarily implement another tool if it would help to get some additional information not available otherwise.

@Stebalien
Copy link
Member

Update: We've reduced bitswap chattiness through better use of sessions. This should, in general, reduce the impact of having many connections. However:

  1. We still have goroutines linear in the number of peers. Unfortunately, this is unavoidable as we need a goroutine to listen on every stream/connection to our peers.
  2. Bitswap is still pretty chatty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) topic/perf Performance
Projects
No open projects
Development

No branches or pull requests

4 participants