-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise alert ingest path #1201
Comments
Need to run the benchmarks again, for exact numbers. I remember that the channel alerts are sent to in the dispatcher had the longest wait times. Increasing the channel buffer improved this a bit, but obviously only pushes out the issue. A question we need to ask ourselves as well is, what kind of load are we expecting Alertmanager to handle? (nonetheless the limit should be resource bound not a technical limitation) |
I think a million active alerts is a reasonable starting point. See also prometheus/prometheus#2585 |
Hey! I am a graduate student who want to apply GSoC this summer. I previously have some Go experience and I am seeking a Go performance optimization related project. The problem here is the dispatcher can not send out alert messages to all kinds of client efficiently, or data can not write into dispatcher efficiently? |
It's everything before the dispatcher that needs optimisation. |
What sort of request rate are you expecting for these 1 million alerts? In what little testing I've done, AM had no issue maintaining several thousands of active alerts (< 50,000), but had issues processing incoming alert batches when the rate hit ~30 requests/sec (the issue could occur at a lower rate, this was just the rate I noticed at which AM locked up). |
With a 10s eval interval it'd be 100k alerts/s which with a batch size of 64 would be ~1.5k requests/s. prometheus/prometheus#2585 can bring that down to ~250/s. |
Hey, I tried to use Go pprof to profiling AlertManager, but this seems can only help us to locate some inefficient function implementation. The whole workflow of AlertManager includes: deduplicating, grouping, inhibition, and routing, if we want to locate which stage is the bottleneck in high concurrency case, seems we need manually write code to track and time functions? @brancz Do you have some benchmark code to share? |
@starsdeep I had previously built and used this, I remember having taken mutex profiles and seen a lot of lock contention starting at the dispatcher. |
@brancz I tried to use the ambench code to launch a benchmark, there are some goroutine block events though, there is no mutex contention events: PS: I launch AlertManager instances using "goreman start" with "DEBUG=true" environment variable, and run load test with |
what does the block profile look like? 🙂 |
I would look closer at that Dispatcher |
@stuartnelson3 hi, about "AM had no issue maintaining several thousands of active alerts (< 50,000)", Do you have the corresponding benchmark data? |
From @brancz :
While benchmarking & profiling Alertmanager it quickly became obvious, that the ingest path blocks a lot
Can you share some numbers about lock wait times?
The text was updated successfully, but these errors were encountered: