improve message sending when there's an alert surge situation #8

ahmadalli · 2020-08-31T07:54:21Z

currently if there's a surge of alerts, the notifier would send them all regardless of the time passed since the dispatch of the alert and we see alerts from a few days ago in the bot. this could be improved by adding a timeout (e.g. 1 min, should be configurable) to the message sending. since alertmanager itself would resends the message to the receiver if the issue persists

ahmadalli · 2020-08-31T07:55:53Z

I think this is related to this code:
https://github.com/ix-ai/notifiers/blob/15af75cda8bb8ae09afef27e66d6b95365279529/ix_notifiers/telegram_notifier.py#L65

tlex · 2020-08-31T08:51:13Z

Thanks for the report.

I'm tempted to change this in such a way that alertmanager-notifier doesn't retry at all, but returns a 5xx code to alertmanager. This way, alertmanager handles the retries. With an added timeout, like you suggest.

ahmadalli · 2020-08-31T08:55:34Z

I’m not sure about alertmanager’s behavior if it receives 5xx code from the webhook (it might just logs an error) but generally this is a good idea. (if there’s a server breakdown, telegram shouldn’t be your source of truce so if there are some missing alerts that would be fine)

See ix-ai/alertmanager-notifier#8

In effect, this disables the telegram retry in case of timeout, relying on the alertmanager exponential back-off retry for HTTP `5xx` response for webhooks. See also #8

tlex · 2020-08-31T11:51:55Z

So, looking at the code (good catch btw, it was exactly the exception handling that you've linked), I've noticed that the only retry happens in case of timeout. Now, I've read up on the behavior of alertmanager and, if it receives a 5xx from the webhook, it will actually retry in intervals that are increasingly longer (see also https://github.com/prometheus/alertmanager/pull/2290 - marked as code as to not pollute the MR there).

This behavior is then consistent to having alertmanager as a source of truth about the notifications that went out.

The changes are now merged to master - until I schedule a release, you can use ixdotai/alertmanager-notifier:dev-master to try it out.

ahmadalli · 2020-08-31T11:59:03Z

thanks :) is it possible to have an option to disable the 5xx error on the receiver side? because the main issue still persists since and error surge would results in alertmanager retrying for days

tlex · 2020-08-31T14:38:21Z

I've just added TELEGRAM_RETRY_ON_FAILURE - This flag, if set to no, always sends a 200 OK back to alertmanager,
even if the Telegram notification wasn't successful. The default is yes.

(Commit: a7603aa)

ixdotai/alertmanager-notifier:dev-master is already built, so you can take it for a spin.

ahmadalli · 2020-08-31T14:40:25Z

great! thanks 👍

ahmadalli · 2020-10-11T08:38:38Z

I've been testing it for a while and it was a huge improvement over the previous version (no alerts from days age showing up in the channel) but now, all but one of the alerts which have been sent at same time would be ignored; probably because telegram has a cooldown time between messages.

I think it's possible to improve it by retrying message sending (with an added random duration) but returning 200 if the process failed at the end (so that alertmanager wouldn't try sending it forever)

tlex · 2020-10-12T09:09:20Z

Currently, there is an ON / OFF switch for retries. If I understand correctly, you would like the following behavior:

Add a fixed number of retries before giving up
In such a case, return 200

If this is what you mean, I'll look into adding variables for both.

The retry duration is already based on what the Telegram API responds (with an added 0.5s) - you can see it here: https://github.com/ix-ai/notifiers/blob/82609e3092e7dfa7feb537280f3e2afc30fb4826/ix_notifiers/telegram_notifier.py#L70. If it wasn't the rate limiter, but a timeout, then the retry happens after 2 seconds.

ahmadalli · 2020-10-13T06:57:58Z

yeah. I think it's better to have two separate switches: one for retries and one for whether return 500 or not on failure

Drop `retry`. See ix-ai/alertmanager-notifier#8

tlex · 2020-10-21T12:11:46Z

I've built a new ixdotai/alertmanager-notifier:dev-master with the following (relevant) changes:

Add TELEGRAM_MAX_RETRIES and TELEGRAM_ALWAYS_SUCCEED
Drop TELEGRAM_RETRY_ON_FAILURE

The two new environment variables are described in README.md.

If it works well, I'll create a release in a week.

Relevant commit: ff5150e

ahmadalli · 2020-10-21T18:54:37Z

I've tested it and it's working great :) please close this after the next release

tlex · 2020-10-25T10:34:53Z

Thanks for the feedback. v0.3.0 released

tlex self-assigned this Aug 31, 2020

ix-ai-bot pushed a commit to ix-ai/notifiers that referenced this issue Aug 31, 2020

Adds optional param retry for Telegram

82609e3

See ix-ai/alertmanager-notifier#8

tlex added the enhancement New feature or request label Aug 31, 2020

ix-ai-bot pushed a commit to ix-ai/notifiers that referenced this issue Oct 21, 2020

Add max_retries and always_succeed

2a9174e

Drop `retry`. See ix-ai/alertmanager-notifier#8

tlex closed this as completed Oct 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve message sending when there's an alert surge situation #8

improve message sending when there's an alert surge situation #8

ahmadalli commented Aug 31, 2020

ahmadalli commented Aug 31, 2020 •

edited

tlex commented Aug 31, 2020

ahmadalli commented Aug 31, 2020 via email •

edited

tlex commented Aug 31, 2020

ahmadalli commented Aug 31, 2020

tlex commented Aug 31, 2020 •

edited

ahmadalli commented Aug 31, 2020

ahmadalli commented Oct 11, 2020

tlex commented Oct 12, 2020

ahmadalli commented Oct 13, 2020 •

edited

tlex commented Oct 21, 2020 •

edited

ahmadalli commented Oct 21, 2020

tlex commented Oct 25, 2020

improve message sending when there's an alert surge situation #8

improve message sending when there's an alert surge situation #8

Comments

ahmadalli commented Aug 31, 2020

ahmadalli commented Aug 31, 2020 • edited

tlex commented Aug 31, 2020

ahmadalli commented Aug 31, 2020 via email • edited

tlex commented Aug 31, 2020

ahmadalli commented Aug 31, 2020

tlex commented Aug 31, 2020 • edited

ahmadalli commented Aug 31, 2020

ahmadalli commented Oct 11, 2020

tlex commented Oct 12, 2020

ahmadalli commented Oct 13, 2020 • edited

tlex commented Oct 21, 2020 • edited

ahmadalli commented Oct 21, 2020

tlex commented Oct 25, 2020

ahmadalli commented Aug 31, 2020 •

edited

ahmadalli commented Aug 31, 2020 via email •

edited

tlex commented Aug 31, 2020 •

edited

ahmadalli commented Oct 13, 2020 •

edited

tlex commented Oct 21, 2020 •

edited