-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve counter performance with per-CPU sharded values once they are available in Go #677
Comments
Approaches like this were considered. However, I couldn't find a good general purpose interface for that. Very few users will actually need this kind of speedup. I don't want to make the interface much more complicated for the normal use case. The few users that need to do a buffered counting can easily implement it on top of the current primitives. I'd be all ears if you have good ideas for an interface that would cater for all use cases. My preliminary conclusion is that we need Go to give access to core-local variables. Interestingly, the Go runtime is already using that for some stuff in the sync package. So there is hope this will be made available normally one day. |
This rabbit hole can go deep. There are tons of discussions and (denied) proposals. For those interested, there is an outstanding proposal which is supposed to supersede all those other proposals to address this. More specifically, lots of people have exactly our use case: how can we count without blowing cache lines? For those interested, here it is: golang/go#18802 Also, there are a few hacks/workarounds shown there. One of them directly linked to internal I'm not proposing anything here (for now), merely pointing to the fact this is a pain point in Go. |
These are great pointers. Thanks a lot. I'm especially glad that finally some movement seems to happen on the topic. |
I don't know how much we should be excited that the proposal exists. It's been around since 26 Jan 2017 (almost two years now), and is on hold. The reason being is that apparently people cannot agree what problem / use case this should target. "I wat to count fast on 64-core machines" is one use case. But there are other. And the group is "undecided". Is it worth one of Prometheus maintainers to voice our need on that thread to prod it in the right direction? In the meantime, what if we assume the feature does exist already? I think we can hack it using something along lines of fastcount.go. How would we implement this? I think I have an idea, I can give it a stab in the next few days as time permits and do a demo PR for people to have a look. Is that an OK approach? If people like it enough, maybe "hacky" can be surfaced up to the users as "experimental, use at your own risk" feature. But this is too early to discuss. |
First of all, I'd like to see how relevant use cases are where we need the hack. As you have already demonstrated, you can work around with "buffering" increments in the local goroutine and then do a batched increment every now and then. If there are like five people in the world that have to do that to solve their use case, and the other 8807 packages importing are happy with the current state, I wouldn't do anything. This package is already on the side of too many moving parts. I have worked relentlessly to bring that down a bit over the past years. If at all, I would offer hacky and experimental solutions in a separate package. |
Yes that's fair enough. Sounds reasonable. With regards to finding if there are use cases here. I guess we just leave this issue open and see if anyone expresses interest? Unless there are better ideas. One possibility is that everyone who had the use case solved it for themselves already either with buffering or "hacks" :) Still might do the "hacky" version out of pure experiment and curiosity. |
I see my benchmark dropping from around 20k requests per second down to 13k requests per second simply after adding the go default collector. Could this be related to the atomic counting? |
TL;DR: Prometheus counters use atomic primitives, which is slow. If I didn't do something stupid, the buffered counter implementation gives potenial speedup of 200x in hot loops across multiple goroutines, and up to 6x speedup for "local" use. The buffered version approaches "you can't do it faster" version.
Also may apply to other things like gauges.
Full version plus benchmarks: https://github.com/ppanyukov/go-bench/tree/master/counter
The text was updated successfully, but these errors were encountered: