Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: high CPU usage with high timer adjustments #41699

Closed
dobegor opened this issue Sep 29, 2020 · 27 comments
Closed

runtime: high CPU usage with high timer adjustments #41699

dobegor opened this issue Sep 29, 2020 · 27 comments

Comments

@dobegor
Copy link

@dobegor dobegor commented Sep 29, 2020

What version of Go are you using (go version)?

$ go version
go version go1.15.2 linux/amd64

Does this issue reproduce with the latest release?

Yes. This performance regression reproduces only on Go 1.14+. Go 1.13.15 is unaffected.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOHOSTOS="linux"
GOARCH="amd64"

What did you do?

package main

import (
	"time"
)

func main() {
	for i := 0; i < 50000; i++ {
		go func() {
			t := time.NewTimer(5 * time.Second)

			for {
				t.Reset(5 * time.Second)
				t.Reset(10 * time.Second)
				time.Sleep(20 * time.Millisecond)
			}
		}()
	}

	<-make(chan struct{})
}

Running on 8-core Google Cloud machine.

What did you expect to see?

Same or less CPU usage on Go 1.15.2 as on Go 1.13.15

What did you see instead?

~70% CPU usage on Go 1.13.15 and 100% CPU usage on Go 1.15.2.

Actual story of this issue

We have a "worker" application that connects to email accounts via IMAP and listens them. There's one goroutine per connection, total ~160k connections per worker node (4-core machine on GCP).

Sometimes, connection's deadline is being moved (conn.SetReadDeadline) to a sooner point than was previously set. This is a requirement due to application logic. This causes runtime.adjusttimers to do a lot of work.

For example, here is the pprof profile with production workload: https://gist.github.com/dobegor/80134e0d4d997cf03fa8f429da72d753

Upgrading from Go 1.13 to 1.14/1.15.2 resulted in change ~80% CPU usage to 100% CPU usage and huge performance degradation of business metrics. We were unable to move from Go 1.13 because it actually takes a lot more resources to run the same code until we implemented some workarounds to not move deadline to a sooner point.

/cc @ianlancetaylor @ChrisHines -- awesome thanks for reaching out to help and find the root cause of this issue.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 30, 2020

Thanks for the test case. I wrote https://golang.org/cl/258303, which helps with that test case. It would be interesting to know whether it helps with your real code. Thanks.

@gopherbot
Copy link

@gopherbot gopherbot commented Sep 30, 2020

Change https://golang.org/cl/258303 mentions this issue: runtime: don't always adjust timers

@dobegor
Copy link
Author

@dobegor dobegor commented Oct 1, 2020

@ianlancetaylor Unfortunately that did not help much. CPU usage is still at 95-100%, but there's no runtime.adjusttimers in pprof top output anymore.
https://gist.github.com/dobegor/e64c5d70a59d6b24b4d70d9394e1de55

It's certainly lower than without the CL, but not nearly as good as on Go 1.13.15.

I'd add that this issue is no longer a problem for us since we've re-implemented our idling logic for IMAP so that it no longer involves moving connections deadlines to a sooner point. These pprof results are provided for an older version of our worker application specifically deployed for testing out the CL. But still there might be valid cases for moving deadlines further, so I'd not close this.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 1, 2020

@dobegor Thanks for testing it. Unfortunately that profile doesn't tell me anything useful. If you are interested in pursuing this, it would be very helpful to have a sample program that still shows the CPU regression. I do not see the CPU regression in the sample program that you already provided. That leads me to believe that there must be some relevant difference in your real application, but I don't know what it might be.

@rs
Copy link
Contributor

@rs rs commented Oct 2, 2020

We are experiencing the exact same issue. Upgrading from 1.13 to 1.15 makes our software (a HTTP reverse proxy) taking 40% more CPU with runtime.adjusttimers being the top function.

When testing your patch, it does not change the CPU consumption, but the top function becomes runtime.TimeSleepUntil instead.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 2, 2020

@rs Thanks for the info. I've updated https://golang.org/cl/258303. Would it be possible for you to give it another try? Thanks.

@rs
Copy link
Contributor

@rs rs commented Oct 3, 2020

With this patch, the delta is down to 10% increase instead of 40%. Nothing really stand out anymore in the top. The next "higher than before" function usage are runtime.scanobject and runtime.lock2 (see screenshot):

Screenshot 2020-10-03 at 11 24 11

If I compare flamegraphs, runtime.checkTimer seems to be the one adding 3% on the selected pair (top is control, bottom is canary):

image

Here is a zoom on runtime.checkTimer:

image

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 3, 2020

Thanks for the detailed examination. I've adjusted checkTimers to take advantage of the new field. It might be worth trying https://golang.org/cl/258303 again, at your convenience.

@rs
Copy link
Contributor

@rs rs commented Oct 4, 2020

We are down to an average of 4% CPU increase with still up to 10-20% on some canary pairs.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 5, 2020

@rs Thanks. In the absence of a test case I think I'm going to leave it there. We know that the new timer code is significantly better for a good range of real code, it's OK if it's a little worse for some code.

@dobegor I would still very much like to hear whether https://golang.org/cl/258303 helps at all with your real code. Thanks.

@rs
Copy link
Contributor

@rs rs commented Oct 6, 2020

@ianlancetaylor For us, 4% is still pretty high and costly given our scale. The code we are running is not particularly special, it's basically a simple reverse proxy built on top of net/http and x/net/http2. I'm surprised this issue does not seem hit more people.

What do you suggest to make our code less impacted by this issue? We do not update timers and we aren't using so much of them except a deadline context to expire client requests which is almost never reached.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 6, 2020

@rs I agree that 4% is higher than I would like, but there is only so much I can do without a test case. I'm essentially guessing. So what I suggest is: if at all possible, give me a test case that I can run to see the results that you are seeing. Thanks.

@dobegor
Copy link
Author

@dobegor dobegor commented Oct 6, 2020

@ianlancetaylor

whether https://golang.org/cl/258303 helps at all with your real code

Will do that this week and get back to you with the results.

@rs
Copy link
Contributor

@rs rs commented Oct 6, 2020

@rs I agree that 4% is higher than I would like, but there is only so much I can do without a test case. I'm essentially guessing. So what I suggest is: if at all possible, give me a test case that I can run to see the results that you are seeing. Thanks.

Ok I'll try to come up with a test case.

@rs
Copy link
Contributor

@rs rs commented Oct 19, 2020

@ianlancetaylor: I can't find a way to isolate the issue. With all my synthetic tests I've done trying to simulate what we do in production, the more I put timers, the better Go 1.15 behave compared to 1.13 (as you would expect). We don't have an unreasonable use of timers anyway, the only timer we have in the critical path is a context deadline for each client request.

Is there any info I could provide that would help you understand what's going on?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 20, 2020

@rs I suppose updated profiles with the most recent version of the CL might help. Basically: where is the time going? Also, what can you tell me about how your application uses timers?

For that matter, are we sure that timers are the reason for the slowdown from Go 1.13 to Go 1.15? There were, of course, other changes.

@dobegor Have you had any chance to try the later version of https://golang.org/cl/258303? Thanks.

@rs
Copy link
Contributor

@rs rs commented Oct 20, 2020

@ianlancetaylor here it is

Go 1.15 right:
image

Go 1.15 down:
image

Edit: fixed the control (wasn't the right code version)

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 20, 2020

Thanks. I don't see anything obviously timer related there. The only possible timer related change I see there is a possible slowdown in findrunnable, which will likely be addressed by https://golang.org/cl/232298.

@rs
Copy link
Contributor

@rs rs commented Oct 20, 2020

I just edited the comment, the control wasn't right. Sorry about that.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 22, 2020

I suppose I still don't see anything obviously timer related.

@gopherbot gopherbot closed this in b4b0144 Oct 27, 2020
@dobegor
Copy link
Author

@dobegor dobegor commented Oct 27, 2020

@ianlancetaylor I'm sorry for delaying the testing of the CL on our use case. I will post the test results asap as I've just got back from sick leave, this covid thing is no joke.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 27, 2020

Sorry to hear that, feel better soon.

No rush on testing, I went ahead and submitted the change anyhow. It should be an improvement even if it doesn't fix the whole problem.

@agirbal
Copy link

@agirbal agirbal commented Jan 28, 2021

Hey @ianlancetaylor I'm following up on @rs report as we are still experiencing this with Go 1.15.7. My initial digging showed a large CPU used by sysmon.timeSleepUntil with Go 1.15 which let me to try delaying sysmon (as reported in #18236)

The graph below shows canary results:

  • baseline is always vanilla Go 1.13
  • beginning of graph until 11:00 has Go 1.15 vanilla as canary. It results in fairly massive ~+15% increase. Our busiest sites are over 100% increase.
  • 11:00 to 16:00 uses Go 1.15 with sysmon patched to 10ms. We see ~+10% increase, so getting back about 5%.
  • 16:00 onward uses Go 1.13 with sysmon patched to 10ms. At this point we see a gain of 3%.

image

A better view of some of our busiest sites show the trend, with large CPU increase from Go 1.15, helped somewhat by sysmon removal:
image

image

From the profiler the main difference we now see is runtime.adjusttimers taking a lot of CPU:
image

Compared to Go 1.13:
image

Thank you for any pointer! We're stuck on Go 1.13 :(
PS: we run on FreeBSD OS.

@prattmic
Copy link
Member

@prattmic prattmic commented Jan 28, 2021

@agirbal could you test with Go 1.16 as well (rc1 should be released today!)? 1.16 contains http://golang.org/cl/258303, which is specifically intended to reduce time in adjusttimers.

Also, could you clarify which case the profile containing adjusttimers is on? Is that 1.15 with the sysmon frequency reduced, or baseline 1.15?

@agirbal
Copy link

@agirbal agirbal commented Jan 28, 2021

@prattmic ok I just tried 1.16rc1 and the issue seems gone! That is amazing news, we can get back on top of our upgrades, thank you.

In graph below:

  • baseline is Go 1.13 vanilla
  • until 11:00 canary is Go 1.13 with sysmon reduction, showing about 4% savings.
  • 11:00 to 14:00 canary is Go 1.16 with sysmon reduction, showing about 4% savings.
  • from 14:00 canary is vanilla Go 1.16. It seems to have a slight regression of less than 1%.

image

I think for our purposes the <1% increase is acceptable and we won't have much time digging further. The sysmon part is interesting though, maybe we can discuss at #18236

@dobegor
Copy link
Author

@dobegor dobegor commented Jan 29, 2021

I can confirm that using 1.16rc1 with the last version of CL 258303 did not show any measurable performance hit compared to Go 1.13.15 baseline on our production workloads.

@prattmic
Copy link
Member

@prattmic prattmic commented Jan 29, 2021

Great! I think we are done here, and agreed that sysmon discussion can go on #18236.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants