Optimise alert ingest path #1201

gouthamve · 2018-01-16T15:10:53Z

From @brancz :
While benchmarking & profiling Alertmanager it quickly became obvious, that the ingest path blocks a lot

Can you share some numbers about lock wait times?

brancz · 2018-01-16T16:15:01Z

Need to run the benchmarks again, for exact numbers. I remember that the channel alerts are sent to in the dispatcher had the longest wait times. Increasing the channel buffer improved this a bit, but obviously only pushes out the issue.

A question we need to ask ourselves as well is, what kind of load are we expecting Alertmanager to handle? (nonetheless the limit should be resource bound not a technical limitation)

brian-brazil · 2018-01-16T16:42:45Z

I think a million active alerts is a reasonable starting point. See also prometheus/prometheus#2585

starsdeep · 2018-03-02T05:55:32Z

Hey! I am a graduate student who want to apply GSoC this summer. I previously have some Go experience and I am seeking a Go performance optimization related project.

The problem here is the dispatcher can not send out alert messages to all kinds of client efficiently, or data can not write into dispatcher efficiently?

brian-brazil · 2018-03-02T06:34:21Z

It's everything before the dispatcher that needs optimisation.

stuartnelson3 · 2018-03-02T10:11:27Z

I think a million active alerts is a reasonable starting point.

What sort of request rate are you expecting for these 1 million alerts? In what little testing I've done, AM had no issue maintaining several thousands of active alerts (< 50,000), but had issues processing incoming alert batches when the rate hit ~30 requests/sec (the issue could occur at a lower rate, this was just the rate I noticed at which AM locked up).

brian-brazil · 2018-03-02T10:16:52Z

With a 10s eval interval it'd be 100k alerts/s which with a batch size of 64 would be ~1.5k requests/s. prometheus/prometheus#2585 can bring that down to ~250/s.

starsdeep · 2018-03-03T23:20:07Z

Hey, I tried to use Go pprof to profiling AlertManager, but this seems can only help us to locate some inefficient function implementation. The whole workflow of AlertManager includes: deduplicating, grouping, inhibition, and routing, if we want to locate which stage is the bottleneck in high concurrency case, seems we need manually write code to track and time functions?

@brancz Do you have some benchmark code to share?

brancz · 2018-03-05T09:16:47Z

@starsdeep I had previously built and used this, I remember having taken mutex profiles and seen a lot of lock contention starting at the dispatcher.

starsdeep · 2018-03-06T21:47:11Z

@brancz I tried to use the ambench code to launch a benchmark, there are some goroutine block events though, there is no mutex contention events:

PS: I launch AlertManager instances using "goreman start" with "DEBUG=true" environment variable, and run load test with ./ambench -alertmanagers=http://localhost:9093,http://localhost:9094,http://localhost:9095

brancz · 2018-03-07T06:55:12Z

what does the block profile look like? 🙂

starsdeep · 2018-03-07T18:39:08Z

@brancz

brancz · 2018-03-08T17:15:53Z

I would look closer at that Dispatcher 90s that's in the ingest path of alerts being added through the API.

glidea · 2022-09-01T13:27:55Z

I think a million active alerts is a reasonable starting point.

What sort of request rate are you expecting for these 1 million alerts? In what little testing I've done, AM had no issue maintaining several thousands of active alerts (< 50,000), but had issues processing incoming alert batches when the rate hit ~30 requests/sec (the issue could occur at a lower rate, this was just the rate I noticed at which AM locked up).

@stuartnelson3 hi, about "AM had no issue maintaining several thousands of active alerts (< 50,000)", Do you have the corresponding benchmark data?

starsdeep mentioned this issue Mar 15, 2018

Asynchronously send alerts #1285

Open

simonpasquier added the area/performance label Sep 28, 2018

grobie mentioned this issue Nov 15, 2018

Provide way to query (almost) current state of firing alerts #1625

Closed

stuartnelson3 mentioned this issue Apr 11, 2019

Add internal alert ingest queue #1832

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise alert ingest path #1201

Optimise alert ingest path #1201

gouthamve commented Jan 16, 2018 •

edited

Loading

brancz commented Jan 16, 2018

brian-brazil commented Jan 16, 2018

starsdeep commented Mar 2, 2018

brian-brazil commented Mar 2, 2018

stuartnelson3 commented Mar 2, 2018

brian-brazil commented Mar 2, 2018

starsdeep commented Mar 3, 2018 •

edited

Loading

brancz commented Mar 5, 2018

starsdeep commented Mar 6, 2018 •

edited

Loading

brancz commented Mar 7, 2018

starsdeep commented Mar 7, 2018

brancz commented Mar 8, 2018

glidea commented Sep 1, 2022

Optimise alert ingest path #1201

Optimise alert ingest path #1201

Comments

gouthamve commented Jan 16, 2018 • edited Loading

brancz commented Jan 16, 2018

brian-brazil commented Jan 16, 2018

starsdeep commented Mar 2, 2018

brian-brazil commented Mar 2, 2018

stuartnelson3 commented Mar 2, 2018

brian-brazil commented Mar 2, 2018

starsdeep commented Mar 3, 2018 • edited Loading

brancz commented Mar 5, 2018

starsdeep commented Mar 6, 2018 • edited Loading

brancz commented Mar 7, 2018

starsdeep commented Mar 7, 2018

brancz commented Mar 8, 2018

glidea commented Sep 1, 2022

gouthamve commented Jan 16, 2018 •

edited

Loading

starsdeep commented Mar 3, 2018 •

edited

Loading

starsdeep commented Mar 6, 2018 •

edited

Loading