Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge amount of context switches, growing over time #1036

Closed
okket opened this issue Apr 8, 2015 · 45 comments
Closed

Huge amount of context switches, growing over time #1036

okket opened this issue Apr 8, 2015 · 45 comments
Assignees

Comments

@okket
Copy link

okket commented Apr 8, 2015

When I let ipfs daemon run for about 48 hours my system starts to show latency issues due to a huge amount of context switches. They slowly grow over time and are directly attributable to ipfs (the problem does not exist without running the daemon).

Here is a nice graph to illustrate:
interrupts-week

Every other metric, like packets per second (about 300), connections (about 50), memory usage, etc. does not grow out of bounds.

Guesswork: With vmstat 1 I see nearly exactly ever 10 seconds a huge spike in context switches, currently about 20k. I am somewhat sure that this is daemon.

I'll update this issue tomorrow with fresh numbers and hopefully a dump of the daemon.

@whyrusleeping
Copy link
Member

So, the next time you see this hapening, if i could get the output of ps -eLf | grep ipfs and also the stack dump from pressing Ctrl + \

@okket
Copy link
Author

okket commented Apr 8, 2015

ps -eLf | grep ipfs yields:

ipfs     10438 29363 10438  0    1 05:02 pts/0    00:00:00 ps -eLf
ipfs     10439 29363 10439  0    1 05:02 pts/0    00:00:00 grep --color=auto -i ipfs
ipfs     22072     1 22072  0    1 Apr07 ?        00:00:01 tmux
ipfs     22073 22072 22073  0    1 Apr07 pts/1    00:00:00 -bash
ipfs     22167 22073 22167  0  117 Apr07 pts/1    00:00:00 ipfs daemon
[...]
ipfs     22167 22073 22317  0  117 Apr07 pts/1    00:00:00 ipfs daemon

I can get a core dump of the running process with gcore -o <file> <pid>. Does this help? Or do you need the dump of the terminated process? The latter is a bit more tricky, the environment options did not allow core dumps when the process was started. The best idea I have is to somehow halt the process (with a signal?) and then gcore.

@okket
Copy link
Author

okket commented Apr 8, 2015

A 1-2 minute strace summary:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 72.33    9.250829          96     96790     31186 futex
 22.64    2.895659         119     24235           epoll_wait
  3.81    0.487598          14     35409           select
  0.63    0.080998           3     30416      2570 read
  0.43    0.054642           3     16720           write
  0.11    0.013771        6886         2         1 restart_syscall
  0.05    0.006987           2      3154           sched_yield
------ ----------- ----------- --------- --------- ----------------
100.00   12.790484                206726     33757 total

A short system call trace of one of the threads:

22168      0.000047 restart_syscall(<... resuming interrupted call ...> <unfinished ...>
22168      0.000041 <... restart_syscall resumed> ) = 0
22168      0.000041 select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
22168      0.000026 <... select resumed> ) = 0 (Timeout)
22168      0.000026 select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
22168      0.000106 <... select resumed> ) = 0 (Timeout)
22168      0.000042 select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
22168      0.000051 <... select resumed> ) = 0 (Timeout)
22168      0.000041 futex(0x1083a40, FUTEX_WAIT, 0, {60, 0} <unfinished ...>
22168      0.000047 <... futex resumed> ) = 0
22168      0.000047 select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
22168      0.000057 <... select resumed> ) = 0 (Timeout)
[...]

@whyrusleeping
Copy link
Member

if you send the process SIGQUIT it will dump all the goroutine stacks.

But thats a lot of threads, and quite the large number of calls to futex, epoll_wait and select... Looks to me like a goroutine leak

@jbenet
Copy link
Member

jbenet commented Apr 8, 2015

I'll push a changeset that adds an option to disable reuseport. would be good to see the numbers then.

@okket
Copy link
Author

okket commented Apr 8, 2015

Sounds great. I'll gladly try that changeset and report.

I also managed to get the last 2k lines of the kill -SIGQUIT output out of the tmux session. The IPFS hash for the file is QmYGvyL3ZUQaQqwk9Zc4uUrpxWG42gh7X86UhRPeeU92U4

Here is an updated graph which shows the daemon restart
interrupts-week 2

@jbenet
Copy link
Member

jbenet commented Apr 8, 2015

@okket changeset available in the branch for #1038 -- it will be merged soon too.

@okket
Copy link
Author

okket commented Apr 8, 2015

Thank you for merging. I'm running the master branch now with IPFS_REUSEPORT=false ipfs daemon, but I am not sure the new code gets called. The small typo (log.Infof, see #1042) did not error. Consequently I do not see the confirmation message in the logs.

@okket
Copy link
Author

okket commented Apr 8, 2015

Ok, my reasoning was, hmm, just wrong. Sorry. Since I am not able to catch the log message for some reason I just changed the log_Infof into a print in my test setup. The code is called and hopefully the daemon now does not eat up resources over time. I'll report in a day or two.

Still I find the amount of system calls the daemon emits a bit alarming:

$ timeout 60 strace -c -f -p $(pidof ipfs)
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 72.08    3.705090         103     36124      3984 futex
 19.80    1.017752          38     26701           epoll_wait
  6.25    0.321034          18     17715           select
  1.03    0.053204           2     25965      2644 read
  0.75    0.038429           3     15155           write
  0.08    0.004328           3      1562           sched_yield
  0.01    0.000512          73         7           madvise
  0.00    0.000167          84         2         1 restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    5.140516                123231      6629 total

@whyrusleeping
Copy link
Member

@okket that is pretty concerning. the futex is goroutine scheduling going on, and it implies we probably have too many goroutines. epoll_wait is typically in the network stack, i know that the REUSEPORT code makes a large number of calls to it, go/net probably does as well.

At any rate, i think some profiling needs to be done. and eventlogs need to be turned off.

@okket
Copy link
Author

okket commented Apr 9, 2015

Preliminary results indicate that current master with IPFS_REUSEPORT=false ipfs daemon is still generating more context switches over time:

interrupts-week 3

@okket
Copy link
Author

okket commented Apr 9, 2015

I forgot to mention: Because Fedora does not include P-224 (for patent reasons I guess) I had to remove support from ipfs:

diff --git a/p2p/crypto/key.go b/p2p/crypto/key.go
index 7f0a13a..31dfeb3 100644
--- a/p2p/crypto/key.go
+++ b/p2p/crypto/key.go
@@ -102,8 +102,8 @@ func GenerateEKeyPair(curveName string) ([]byte, GenSharedKey, error) {
        var curve elliptic.Curve

        switch curveName {
-       case "P-224":
-               curve = elliptic.P224()
+//     case "P-224":
+//             curve = elliptic.P224()
        case "P-256":
                curve = elliptic.P256()
        case "P-384":
diff --git a/p2p/crypto/secio/al.go b/p2p/crypto/secio/al.go
index 9a0cadf..9e0ffce 100644
--- a/p2p/crypto/secio/al.go
+++ b/p2p/crypto/secio/al.go
@@ -18,7 +18,7 @@ import (
 )

 // List of supported ECDH curves
-var SupportedExchanges = "P-256,P-224,P-384,P-521"
+var SupportedExchanges = "P-256,P-384,P-521"

 // List of supported Ciphers
 var SupportedCiphers = "AES-256,AES-128,Blowfish"

I hope this change does not in a crazy way I can not possibly imagine create a feedback loop or trigger another thing that is responsible for the growing context switch problem.

@whyrusleeping
Copy link
Member

I dont see why fedora not having it means you have to remove it from ipfs, p224 is a go library.

@whyrusleeping
Copy link
Member

Also, i have two fedora boxes running ipfs without modification just fine. You may be running an older version of Go?

@okket
Copy link
Author

okket commented Apr 9, 2015

Nope:

$ yum info golang
Loaded plugins: langpacks
Installed Packages
Name        : golang
Arch        : x86_64
Version     : 1.4.2
Release     : 2.fc21
Size        : 10 M
Repo        : installed
From repo   : updates
Summary     : The Go Programming Language
URL         : http://golang.org/
License     : BSD
Description : The Go Programming Language.

The build fails with P-224 support:

$ make build
cd cmd/ipfs && go build -i
# github.com/ipfs/go-ipfs/p2p/crypto
../../../gocode/src/github.com/ipfs/go-ipfs/p2p/crypto/key.go:106: undefined: elliptic.P224
Makefile:27: recipe for target 'build' failed
make: *** [build] Error 2

@whyrusleeping
Copy link
Member

can you also run go version for me? I get the following:

[jeromy@localhost ~]$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch
Distributor ID: Fedora
Description:    Fedora release 20 (Heisenbug)
Release:    20
Codename:   Heisenbug
[jeromy@localhost ~]$ uname -a
Linux localhost.localdomain 3.17.8-200.fc20.x86_64 #1 SMP Thu Jan 8 23:26:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[jeromy@localhost ~]$ go version
go version go1.4.2 linux/amd64
[jeromy@localhost go-ipfs]$ make build
cd cmd/ipfs && go build -i
[jeromy@localhost go-ipfs]$ 

@okket
Copy link
Author

okket commented Apr 9, 2015

$ go version
go version go1.4.2 linux/amd64

Support for P-224 was removed in 2013 (1), Fedora 20 does not ship with it, see the golang-1.2-remove-ECC-p224.patch here:

http://koji.fedoraproject.org/koji/rpminfo?rpmID=6091693

Are you sure you use the system installed go like me? No $GOROOT? Also:

$ which go
/bin/go

(1) https://lists.fedoraproject.org/pipermail/golang/2013-December/000596.html

@okket
Copy link
Author

okket commented Apr 9, 2015

I'll install the official binary go distribution locally to get this side issue out of the way.

@whyrusleeping
Copy link
Member

Ah, I always install from source. that is really weird that fedora ships a patched go.

@whyrusleeping
Copy link
Member

It appears that the fedora maintainers are paranoid, bruce schneier quote incoming:

"On the crypto bits in your guardian piece, I found especially interesting that you suggest classic discrete log crypto over ecc. I want to ask if you could elaborate more on that."

I no longer trust the constants. I believe the NSA has manipulated them through their relationships with industry.

https://www.schneier.com/blog/archives/2013/09/the_nsa_is_brea.html#c1675929

@whyrusleeping
Copy link
Member

@jbenet
Copy link
Member

jbenet commented Apr 10, 2015

@whyrusleeping agreed. this should be a change in Go.

@okket
Copy link
Author

okket commented Apr 10, 2015

Unchanged master, official go binaries, started with IPFS_REUSEPORT=false ipfs daemon, still growing:

interrupts-week 4

@okket
Copy link
Author

okket commented Apr 10, 2015

Maybe a hint: If you look at these graphs and numbers, you see that the timer interrupts (green) are growing linearly (on a logarithmic scale!) up to the maximum of 716, which corresponds to 10k context switches. Nothing else on my server uses this much timers, but the real scary fact (and why I keep on pressing this) is the linear growing. This is not natural, most likely will not stop until server crash and can (IMHO) only be explained by some kind of 'leak'.

I will look into this with debug ideas from http://elinux.org/Kernel_Timer_Systems after my system has finished zfs scrub (hello ZoL 0.6.4).

irqstats-week

@whyrusleeping
Copy link
Member

okket, could you run ipfs with IPFS_PROF set to true? then ctrl+c the daemon when context switches get really high? I want both the cpuprofile and memprofile it outputs

@whyrusleeping
Copy link
Member

Also, thanks a ton for helping us debug this issue!!

@okket
Copy link
Author

okket commented Apr 11, 2015

Please ignore my last posting above. The interrupt count on my system is not reliable:

This system BIOS has enabled interrupt remapping on a chipset that contains an erratum
making that feature unstable. To maintain system stability interrupt remapping is being
disabled.  Please contact your BIOS vendor for an update

(Sadly there is no BIOS update available)

I doubt that this causes the high context switch problem. The system is very reliable otherwise. I'll insert a complete SIGQUIT dump when it reaches 10k or so.

@whyrusleeping
Copy link
Member

(also, thanks for the heads up on ZoL 0.6.4)

@okket
Copy link
Author

okket commented Apr 13, 2015

~12k context switches per second again, growing linearly as usual:

interrupts-week 3

60 second strace:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 79.78   18.854636          67    282022     96950 futex
 16.43    3.884003          82     47561         1 epoll_wait
  2.86    0.675084           7     91441        40 select
  0.43    0.101768           2     50400      4397 read
  0.33    0.077317           3     27993           write
  0.13    0.029989           1     20670           sched_yield
  0.04    0.009583         266        36        24 restart_syscall
  0.00    0.000285           1       563        70 rt_sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00   23.632665                520686    101482 total

Now killed with SIGQUIT, here are the debug files:

ipfs.out QmPh7Srs3W1ChrB9Bwfaz9cfxKDfbk7r2enGHwZ5etKChJ
ipfs.cpuprof QmePD76ngy2cLJYm4FhzDqY5HXo4WY7Voxzb8nTq8xsn7t
ipfs.memprof QmS6Qb94DEEwH3USaJdUn1BoSzYvpf74envHGyEpnyNWt1

@jbenet
Copy link
Member

jbenet commented Apr 13, 2015

is this with master?

On Mon, Apr 13, 2015 at 2:19 AM, Okke Timm notifications@github.com wrote:

~12k context switches per second again, growing linearly as usual:

[image: interrupts-week 3]
https://cloud.githubusercontent.com/assets/208337/7112734/20ae5140-e1cd-11e4-956e-49a7d9d12df2.png

60 second strace:

% time seconds usecs/call calls errors syscall


79.78 18.854636 67 282022 96950 futex
16.43 3.884003 82 47561 1 epoll_wait
2.86 0.675084 7 91441 40 select
0.43 0.101768 2 50400 4397 read
0.33 0.077317 3 27993 write
0.13 0.029989 1 20670 sched_yield
0.04 0.009583 266 36 24 restart_syscall
0.00 0.000285 1 563 70 rt_sigreturn


100.00 23.632665 520686 101482 total

ipfs.out QmPh7Srs3W1ChrB9Bwfaz9cfxKDfbk7r2enGHwZ5etKChJ
ipfs.cpuprof QmePD76ngy2cLJYm4FhzDqY5HXo4WY7Voxzb8nTq8xsn7t
ipfs.memprof QmS6Qb94DEEwH3USaJdUn1BoSzYvpf74envHGyEpnyNWt1


Reply to this email directly or view it on GitHub
#1036 (comment).

@okket
Copy link
Author

okket commented Apr 13, 2015

It was the master from two days ago. Dang, I should have noted the commit id.

Idea: IPFS should hardcode current git commit id into the binary and output it during startup, along with the official version number.

@jbenet
Copy link
Member

jbenet commented Apr 13, 2015

Idea: IPFS should hardcode current git commit id into the binary and output it during startup, along with the official version number.

Agreed. see #691 #693

The build process could do this, but it wouldn't be go-gettable, as modifying the binaries doesn't happen on go get.

@whyrusleeping
Copy link
Member

the stack trace looks fairly sane. 60 open socket connections, a few streams. Nothing huge by any means.. this is bizarre

@whyrusleeping
Copy link
Member

@okket do you still have the binary that you ran for those profiles? pprof really wants to have the same binary they were run with

@jbenet jbenet mentioned this issue Apr 13, 2015
32 tasks
@okket
Copy link
Author

okket commented Apr 13, 2015

No, the binary was deleted. I'll keep it for the next round(s).

@whyrusleeping whyrusleeping self-assigned this Apr 15, 2015
@okket
Copy link
Author

okket commented Apr 20, 2015

Here is the link to the data files from the server restart today:

http://gateway.ipfs.io/ipfs/QmUz7ij4KaM7Z5h9A4o7w1oahQUDwT2CiYpXA43HTGQYy7

@jbenet jbenet mentioned this issue Apr 20, 2015
34 tasks
@okket
Copy link
Author

okket commented Apr 23, 2015

@okket
Copy link
Author

okket commented Apr 30, 2015

This time the daemon crashed with "runtime: out of memory"

http://gateway.ipfs.io/ipfs/QmSSvfWum6QDFxX38m8YT5vHTpT9pxgWywZURHPcVf44yY

@whyrusleeping
Copy link
Member

@okket thats what i get on mars every night, lol. thank you for the logs, i'll let you know if I find anything incriminating on them

@whyrusleeping
Copy link
Member

@okket So, ive seen this exact trace hundreds of times, but for some reason i decided to look at yours a little more closely. And i found the out of memory issue:

This line in the panic goroutine:

runtime.mallocgc(0xb1b99a33, 0xa8a320, 0xc200000001, 0xc208fa9c20)

Looks like its pretty normal, a function taking some pointers to some things... except that first argument isnt a pointer. Its the allocation size. Looking deeper, it appears that we ask msgio what the next message size is, it tells us this giant number, we say "okay, lets make some memory" and blindly allocate it. I believe we need to set an upper limit on these messages, and/or find out who is sending them.

@whyrusleeping
Copy link
Member

@jbenet o/ PING

@jbenet
Copy link
Member

jbenet commented May 1, 2015

wow, good find!

@okket
Copy link
Author

okket commented May 2, 2015

finally :)

@okket okket closed this as completed May 2, 2015
@jbenet jbenet added backlog and removed backlog labels May 2, 2015
@whyrusleeping
Copy link
Member

👏

@jbenet
Copy link
Member

jbenet commented May 2, 2015

thanks @okket @whyrusleeping -- this is great -- 👏 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants