Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve counter performance with per-CPU sharded values once they are available in Go #677

Open
ppanyukov opened this issue Nov 19, 2019 · 8 comments

Comments

@ppanyukov
Copy link

TL;DR: Prometheus counters use atomic primitives, which is slow. If I didn't do something stupid, the buffered counter implementation gives potenial speedup of 200x in hot loops across multiple goroutines, and up to 6x speedup for "local" use. The buffered version approaches "you can't do it faster" version.

Also may apply to other things like gauges.

Full version plus benchmarks: https://github.com/ppanyukov/go-bench/tree/master/counter

@beorn7
Copy link
Member

beorn7 commented Nov 20, 2019

Approaches like this were considered. However, I couldn't find a good general purpose interface for that. Very few users will actually need this kind of speedup. I don't want to make the interface much more complicated for the normal use case.

The few users that need to do a buffered counting can easily implement it on top of the current primitives.

I'd be all ears if you have good ideas for an interface that would cater for all use cases.

My preliminary conclusion is that we need Go to give access to core-local variables. Interestingly, the Go runtime is already using that for some stuff in the sync package. So there is hope this will be made available normally one day.

@ppanyukov
Copy link
Author

core-local variables

This rabbit hole can go deep. There are tons of discussions and (denied) proposals. For those interested, there is an outstanding proposal which is supposed to supersede all those other proposals to address this. More specifically, lots of people have exactly our use case: how can we count without blowing cache lines?

For those interested, here it is: golang/go#18802

Also, there are a few hacks/workarounds shown there. One of them directly linked to internal runtime.procPin() and runtime.procUnpin() and uses very tacky asm, but hey! :) See here: https://github.com/cespare/misc/blob/master/fastcount/fastcount.go

I'm not proposing anything here (for now), merely pointing to the fact this is a pain point in Go.

@beorn7
Copy link
Member

beorn7 commented Nov 23, 2019

These are great pointers. Thanks a lot. I'm especially glad that finally some movement seems to happen on the topic.
I'll leave this issue open and change the title to reflect what we'll do.

@beorn7 beorn7 changed the title Improve counter performance with regards to atomic Improve counter performance with per-CPU sharded values once they are available in Go Nov 23, 2019
@beorn7 beorn7 self-assigned this Nov 23, 2019
@ppanyukov
Copy link
Author

I don't know how much we should be excited that the proposal exists. It's been around since 26 Jan 2017 (almost two years now), and is on hold. The reason being is that apparently people cannot agree what problem / use case this should target. "I wat to count fast on 64-core machines" is one use case. But there are other. And the group is "undecided".

Is it worth one of Prometheus maintainers to voice our need on that thread to prod it in the right direction?

In the meantime, what if we assume the feature does exist already? I think we can hack it using something along lines of fastcount.go. How would we implement this? I think I have an idea, I can give it a stab in the next few days as time permits and do a demo PR for people to have a look. Is that an OK approach?

If people like it enough, maybe "hacky" can be surfaced up to the users as "experimental, use at your own risk" feature. But this is too early to discuss.

@beorn7
Copy link
Member

beorn7 commented Nov 27, 2019

First of all, I'd like to see how relevant use cases are where we need the hack. As you have already demonstrated, you can work around with "buffering" increments in the local goroutine and then do a batched increment every now and then. If there are like five people in the world that have to do that to solve their use case, and the other 8807 packages importing are happy with the current state, I wouldn't do anything.

This package is already on the side of too many moving parts. I have worked relentlessly to bring that down a bit over the past years. If at all, I would offer hacky and experimental solutions in a separate package.

@ppanyukov
Copy link
Author

Yes that's fair enough. Sounds reasonable.

With regards to finding if there are use cases here. I guess we just leave this issue open and see if anyone expresses interest? Unless there are better ideas. One possibility is that everyone who had the use case solved it for themselves already either with buffering or "hacks" :)

Still might do the "hacky" version out of pure experiment and curiosity.

@beorn7
Copy link
Member

beorn7 commented Jun 2, 2021

I mostly assigned this to myself as a “default” because I used to be the maintainer of this repo. Therefore, I'm un-assigning this from myself no. New maintainers @bwplotka & @kakkoyun, please deal with this as you see fit.

@beorn7 beorn7 removed their assignment Jun 2, 2021
@timo-klarshift
Copy link

I see my benchmark dropping from around 20k requests per second down to 13k requests per second simply after adding the go default collector. Could this be related to the atomic counting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants