Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: high CPU usage with high timer adjustments #41699

Open
dobegor opened this issue Sep 29, 2020 · 15 comments
Open

runtime: high CPU usage with high timer adjustments #41699

dobegor opened this issue Sep 29, 2020 · 15 comments

Comments

@dobegor
Copy link

@dobegor dobegor commented Sep 29, 2020

What version of Go are you using (go version)?

$ go version
go version go1.15.2 linux/amd64

Does this issue reproduce with the latest release?

Yes. This performance regression reproduces only on Go 1.14+. Go 1.13.15 is unaffected.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOHOSTOS="linux"
GOARCH="amd64"

What did you do?

package main

import (
	"time"
)

func main() {
	for i := 0; i < 50000; i++ {
		go func() {
			t := time.NewTimer(5 * time.Second)

			for {
				t.Reset(5 * time.Second)
				t.Reset(10 * time.Second)
				time.Sleep(20 * time.Millisecond)
			}
		}()
	}

	<-make(chan struct{})
}

Running on 8-core Google Cloud machine.

What did you expect to see?

Same or less CPU usage on Go 1.15.2 as on Go 1.13.15

What did you see instead?

~70% CPU usage on Go 1.13.15 and 100% CPU usage on Go 1.15.2.

Actual story of this issue

We have a "worker" application that connects to email accounts via IMAP and listens them. There's one goroutine per connection, total ~160k connections per worker node (4-core machine on GCP).

Sometimes, connection's deadline is being moved (conn.SetReadDeadline) to a sooner point than was previously set. This is a requirement due to application logic. This causes runtime.adjusttimers to do a lot of work.

For example, here is the pprof profile with production workload: https://gist.github.com/dobegor/80134e0d4d997cf03fa8f429da72d753

Upgrading from Go 1.13 to 1.14/1.15.2 resulted in change ~80% CPU usage to 100% CPU usage and huge performance degradation of business metrics. We were unable to move from Go 1.13 because it actually takes a lot more resources to run the same code until we implemented some workarounds to not move deadline to a sooner point.

/cc @ianlancetaylor @ChrisHines -- awesome thanks for reaching out to help and find the root cause of this issue.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 30, 2020

Thanks for the test case. I wrote https://golang.org/cl/258303, which helps with that test case. It would be interesting to know whether it helps with your real code. Thanks.

@gopherbot
Copy link

@gopherbot gopherbot commented Sep 30, 2020

Change https://golang.org/cl/258303 mentions this issue: runtime: don't always adjust timers

@dobegor
Copy link
Author

@dobegor dobegor commented Oct 1, 2020

@ianlancetaylor Unfortunately that did not help much. CPU usage is still at 95-100%, but there's no runtime.adjusttimers in pprof top output anymore.
https://gist.github.com/dobegor/e64c5d70a59d6b24b4d70d9394e1de55

It's certainly lower than without the CL, but not nearly as good as on Go 1.13.15.

I'd add that this issue is no longer a problem for us since we've re-implemented our idling logic for IMAP so that it no longer involves moving connections deadlines to a sooner point. These pprof results are provided for an older version of our worker application specifically deployed for testing out the CL. But still there might be valid cases for moving deadlines further, so I'd not close this.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 1, 2020

@dobegor Thanks for testing it. Unfortunately that profile doesn't tell me anything useful. If you are interested in pursuing this, it would be very helpful to have a sample program that still shows the CPU regression. I do not see the CPU regression in the sample program that you already provided. That leads me to believe that there must be some relevant difference in your real application, but I don't know what it might be.

@rs
Copy link
Contributor

@rs rs commented Oct 2, 2020

We are experiencing the exact same issue. Upgrading from 1.13 to 1.15 makes our software (a HTTP reverse proxy) taking 40% more CPU with runtime.adjusttimers being the top function.

When testing your patch, it does not change the CPU consumption, but the top function becomes runtime.TimeSleepUntil instead.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 2, 2020

@rs Thanks for the info. I've updated https://golang.org/cl/258303. Would it be possible for you to give it another try? Thanks.

@rs
Copy link
Contributor

@rs rs commented Oct 3, 2020

With this patch, the delta is down to 10% increase instead of 40%. Nothing really stand out anymore in the top. The next "higher than before" function usage are runtime.scanobject and runtime.lock2 (see screenshot):

Screenshot 2020-10-03 at 11 24 11

If I compare flamegraphs, runtime.checkTimer seems to be the one adding 3% on the selected pair (top is control, bottom is canary):

image

Here is a zoom on runtime.checkTimer:

image

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 3, 2020

Thanks for the detailed examination. I've adjusted checkTimers to take advantage of the new field. It might be worth trying https://golang.org/cl/258303 again, at your convenience.

@rs
Copy link
Contributor

@rs rs commented Oct 4, 2020

We are down to an average of 4% CPU increase with still up to 10-20% on some canary pairs.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 5, 2020

@rs Thanks. In the absence of a test case I think I'm going to leave it there. We know that the new timer code is significantly better for a good range of real code, it's OK if it's a little worse for some code.

@dobegor I would still very much like to hear whether https://golang.org/cl/258303 helps at all with your real code. Thanks.

@rs
Copy link
Contributor

@rs rs commented Oct 6, 2020

@ianlancetaylor For us, 4% is still pretty high and costly given our scale. The code we are running is not particularly special, it's basically a simple reverse proxy built on top of net/http and x/net/http2. I'm surprised this issue does not seem hit more people.

What do you suggest to make our code less impacted by this issue? We do not update timers and we aren't using so much of them except a deadline context to expire client requests which is almost never reached.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 6, 2020

@rs I agree that 4% is higher than I would like, but there is only so much I can do without a test case. I'm essentially guessing. So what I suggest is: if at all possible, give me a test case that I can run to see the results that you are seeing. Thanks.

@dobegor
Copy link
Author

@dobegor dobegor commented Oct 6, 2020

@ianlancetaylor

whether https://golang.org/cl/258303 helps at all with your real code

Will do that this week and get back to you with the results.

@rs
Copy link
Contributor

@rs rs commented Oct 6, 2020

@rs I agree that 4% is higher than I would like, but there is only so much I can do without a test case. I'm essentially guessing. So what I suggest is: if at all possible, give me a test case that I can run to see the results that you are seeing. Thanks.

Ok I'll try to come up with a test case.

@rs
Copy link
Contributor

@rs rs commented Oct 19, 2020

@ianlancetaylor: I can't find a way to isolate the issue. With all my synthetic tests I've done trying to simulate what we do in production, the more I put timers, the better Go 1.15 behave compared to 1.13 (as you would expect). We don't have an unreasonable use of timers anyway, the only timer we have in the critical path is a context deadline for each client request.

Is there any info I could provide that would help you understand what's going on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.