-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge amount of context switches, growing over time #1036
Comments
So, the next time you see this hapening, if i could get the output of |
I can get a core dump of the running process with |
A 1-2 minute strace summary:
A short system call trace of one of the threads:
|
if you send the process But thats a lot of threads, and quite the large number of calls to futex, epoll_wait and select... Looks to me like a goroutine leak |
I'll push a changeset that adds an option to disable reuseport. would be good to see the numbers then. |
Thank you for merging. I'm running the master branch now with |
Ok, my reasoning was, hmm, just wrong. Sorry. Since I am not able to catch the log message for some reason I just changed the Still I find the amount of system calls the daemon emits a bit alarming:
|
@okket that is pretty concerning. the futex is goroutine scheduling going on, and it implies we probably have too many goroutines. epoll_wait is typically in the network stack, i know that the REUSEPORT code makes a large number of calls to it, go/net probably does as well. At any rate, i think some profiling needs to be done. and eventlogs need to be turned off. |
I forgot to mention: Because Fedora does not include P-224 (for patent reasons I guess) I had to remove support from diff --git a/p2p/crypto/key.go b/p2p/crypto/key.go
index 7f0a13a..31dfeb3 100644
--- a/p2p/crypto/key.go
+++ b/p2p/crypto/key.go
@@ -102,8 +102,8 @@ func GenerateEKeyPair(curveName string) ([]byte, GenSharedKey, error) {
var curve elliptic.Curve
switch curveName {
- case "P-224":
- curve = elliptic.P224()
+// case "P-224":
+// curve = elliptic.P224()
case "P-256":
curve = elliptic.P256()
case "P-384":
diff --git a/p2p/crypto/secio/al.go b/p2p/crypto/secio/al.go
index 9a0cadf..9e0ffce 100644
--- a/p2p/crypto/secio/al.go
+++ b/p2p/crypto/secio/al.go
@@ -18,7 +18,7 @@ import (
)
// List of supported ECDH curves
-var SupportedExchanges = "P-256,P-224,P-384,P-521"
+var SupportedExchanges = "P-256,P-384,P-521"
// List of supported Ciphers
var SupportedCiphers = "AES-256,AES-128,Blowfish" I hope this change does not in a crazy way I can not possibly imagine create a feedback loop or trigger another thing that is responsible for the growing context switch problem. |
I dont see why fedora not having it means you have to remove it from ipfs, p224 is a go library. |
Also, i have two fedora boxes running ipfs without modification just fine. You may be running an older version of Go? |
Nope: $ yum info golang
Loaded plugins: langpacks
Installed Packages
Name : golang
Arch : x86_64
Version : 1.4.2
Release : 2.fc21
Size : 10 M
Repo : installed
From repo : updates
Summary : The Go Programming Language
URL : http://golang.org/
License : BSD
Description : The Go Programming Language. The build fails with P-224 support: $ make build
cd cmd/ipfs && go build -i
# github.com/ipfs/go-ipfs/p2p/crypto
../../../gocode/src/github.com/ipfs/go-ipfs/p2p/crypto/key.go:106: undefined: elliptic.P224
Makefile:27: recipe for target 'build' failed
make: *** [build] Error 2 |
can you also run
|
$ go version
go version go1.4.2 linux/amd64 Support for P-224 was removed in 2013 (1), Fedora 20 does not ship with it, see the http://koji.fedoraproject.org/koji/rpminfo?rpmID=6091693 Are you sure you use the system installed go like me? No $ which go
/bin/go (1) https://lists.fedoraproject.org/pipermail/golang/2013-December/000596.html |
I'll install the official binary go distribution locally to get this side issue out of the way. |
Ah, I always install from source. that is really weird that fedora ships a patched go. |
It appears that the fedora maintainers are paranoid, bruce schneier quote incoming:
https://www.schneier.com/blog/archives/2013/09/the_nsa_is_brea.html#c1675929 |
@whyrusleeping agreed. this should be a change in Go. |
Maybe a hint: If you look at these graphs and numbers, you see that the timer interrupts (green) are growing linearly (on a logarithmic scale!) up to the maximum of 716, which corresponds to 10k context switches. Nothing else on my server uses this much timers, but the real scary fact (and why I keep on pressing this) is the linear growing. This is not natural, most likely will not stop until server crash and can (IMHO) only be explained by some kind of 'leak'. I will look into this with debug ideas from http://elinux.org/Kernel_Timer_Systems after my system has finished |
okket, could you run ipfs with |
Also, thanks a ton for helping us debug this issue!! |
Please ignore my last posting above. The interrupt count on my system is not reliable:
(Sadly there is no BIOS update available) I doubt that this causes the high context switch problem. The system is very reliable otherwise. I'll insert a complete SIGQUIT dump when it reaches 10k or so. |
(also, thanks for the heads up on ZoL 0.6.4) |
~12k context switches per second again, growing linearly as usual: 60 second strace:
Now killed with ipfs.out |
is this with master? On Mon, Apr 13, 2015 at 2:19 AM, Okke Timm notifications@github.com wrote:
|
It was the master from two days ago. Dang, I should have noted the commit id. Idea: IPFS should hardcode current git commit id into the binary and output it during startup, along with the official version number. |
the stack trace looks fairly sane. 60 open socket connections, a few streams. Nothing huge by any means.. this is bizarre |
@okket do you still have the binary that you ran for those profiles? pprof really wants to have the same binary they were run with |
No, the binary was deleted. I'll keep it for the next round(s). |
Here is the link to the data files from the server restart today: http://gateway.ipfs.io/ipfs/QmUz7ij4KaM7Z5h9A4o7w1oahQUDwT2CiYpXA43HTGQYy7 |
This time the daemon crashed with "runtime: out of memory" http://gateway.ipfs.io/ipfs/QmSSvfWum6QDFxX38m8YT5vHTpT9pxgWywZURHPcVf44yY |
@okket thats what i get on mars every night, lol. thank you for the logs, i'll let you know if I find anything incriminating on them |
@okket So, ive seen this exact trace hundreds of times, but for some reason i decided to look at yours a little more closely. And i found the out of memory issue: This line in the panic goroutine:
Looks like its pretty normal, a function taking some pointers to some things... except that first argument isnt a pointer. Its the allocation size. Looking deeper, it appears that we ask msgio what the next message size is, it tells us this giant number, we say "okay, lets make some memory" and blindly allocate it. I believe we need to set an upper limit on these messages, and/or find out who is sending them. |
@jbenet o/ PING |
wow, good find! |
finally :) |
👏 |
thanks @okket @whyrusleeping -- this is great -- 👏 👏 |
When I let
ipfs daemon
run for about 48 hours my system starts to show latency issues due to a huge amount of context switches. They slowly grow over time and are directly attributable to ipfs (the problem does not exist without running the daemon).Here is a nice graph to illustrate:
Every other metric, like packets per second (about 300), connections (about 50), memory usage, etc. does not grow out of bounds.
Guesswork: With
vmstat 1
I see nearly exactly ever 10 seconds a huge spike in context switches, currently about 20k. I am somewhat sure that this is daemon.I'll update this issue tomorrow with fresh numbers and hopefully a dump of the daemon.
The text was updated successfully, but these errors were encountered: