Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise alert ingest path #1201

Open
gouthamve opened this issue Jan 16, 2018 · 13 comments
Open

Optimise alert ingest path #1201

gouthamve opened this issue Jan 16, 2018 · 13 comments

Comments

@gouthamve
Copy link
Member

gouthamve commented Jan 16, 2018

From @brancz :
While benchmarking & profiling Alertmanager it quickly became obvious, that the ingest path blocks a lot

Can you share some numbers about lock wait times?

@brancz
Copy link
Member

brancz commented Jan 16, 2018

Need to run the benchmarks again, for exact numbers. I remember that the channel alerts are sent to in the dispatcher had the longest wait times. Increasing the channel buffer improved this a bit, but obviously only pushes out the issue.

A question we need to ask ourselves as well is, what kind of load are we expecting Alertmanager to handle? (nonetheless the limit should be resource bound not a technical limitation)

@brian-brazil
Copy link
Contributor

I think a million active alerts is a reasonable starting point. See also prometheus/prometheus#2585

@starsdeep
Copy link

Hey! I am a graduate student who want to apply GSoC this summer. I previously have some Go experience and I am seeking a Go performance optimization related project.

The problem here is the dispatcher can not send out alert messages to all kinds of client efficiently, or data can not write into dispatcher efficiently?

@brian-brazil
Copy link
Contributor

It's everything before the dispatcher that needs optimisation.

@stuartnelson3
Copy link
Contributor

I think a million active alerts is a reasonable starting point.

What sort of request rate are you expecting for these 1 million alerts? In what little testing I've done, AM had no issue maintaining several thousands of active alerts (< 50,000), but had issues processing incoming alert batches when the rate hit ~30 requests/sec (the issue could occur at a lower rate, this was just the rate I noticed at which AM locked up).

@brian-brazil
Copy link
Contributor

With a 10s eval interval it'd be 100k alerts/s which with a batch size of 64 would be ~1.5k requests/s. prometheus/prometheus#2585 can bring that down to ~250/s.

@starsdeep
Copy link

starsdeep commented Mar 3, 2018

Hey, I tried to use Go pprof to profiling AlertManager, but this seems can only help us to locate some inefficient function implementation. The whole workflow of AlertManager includes: deduplicating, grouping, inhibition, and routing, if we want to locate which stage is the bottleneck in high concurrency case, seems we need manually write code to track and time functions?

pprof005

@brancz Do you have some benchmark code to share?

@brancz
Copy link
Member

brancz commented Mar 5, 2018

@starsdeep I had previously built and used this, I remember having taken mutex profiles and seen a lot of lock contention starting at the dispatcher.

@starsdeep
Copy link

starsdeep commented Mar 6, 2018

@brancz I tried to use the ambench code to launch a benchmark, there are some goroutine block events though, there is no mutex contention events:

jj

PS: I launch AlertManager instances using "goreman start" with "DEBUG=true" environment variable, and run load test with ./ambench -alertmanagers=http://localhost:9093,http://localhost:9094,http://localhost:9095

@brancz
Copy link
Member

brancz commented Mar 7, 2018

what does the block profile look like? 🙂

@starsdeep
Copy link

@brancz

001
002

@brancz
Copy link
Member

brancz commented Mar 8, 2018

I would look closer at that Dispatcher 90s that's in the ingest path of alerts being added through the API.

@glidea
Copy link

glidea commented Sep 1, 2022

I think a million active alerts is a reasonable starting point.

What sort of request rate are you expecting for these 1 million alerts? In what little testing I've done, AM had no issue maintaining several thousands of active alerts (< 50,000), but had issues processing incoming alert batches when the rate hit ~30 requests/sec (the issue could occur at a lower rate, this was just the rate I noticed at which AM locked up).

@stuartnelson3 hi, about "AM had no issue maintaining several thousands of active alerts (< 50,000)", Do you have the corresponding benchmark data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants