Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve message sending when there's an alert surge situation #8

Closed
ahmadalli opened this issue Aug 31, 2020 · 13 comments
Closed

improve message sending when there's an alert surge situation #8

ahmadalli opened this issue Aug 31, 2020 · 13 comments
Assignees
Labels
enhancement New feature or request

Comments

@ahmadalli
Copy link
Contributor

currently if there's a surge of alerts, the notifier would send them all regardless of the time passed since the dispatch of the alert and we see alerts from a few days ago in the bot. this could be improved by adding a timeout (e.g. 1 min, should be configurable) to the message sending. since alertmanager itself would resends the message to the receiver if the issue persists

@ahmadalli
Copy link
Contributor Author

ahmadalli commented Aug 31, 2020

@tlex
Copy link
Member

tlex commented Aug 31, 2020

Thanks for the report.

I'm tempted to change this in such a way that alertmanager-notifier doesn't retry at all, but returns a 5xx code to alertmanager. This way, alertmanager handles the retries. With an added timeout, like you suggest.

@tlex tlex self-assigned this Aug 31, 2020
@ahmadalli
Copy link
Contributor Author

ahmadalli commented Aug 31, 2020 via email

ix-ai-bot pushed a commit to ix-ai/notifiers that referenced this issue Aug 31, 2020
ix-ai-bot pushed a commit that referenced this issue Aug 31, 2020
In effect, this disables the telegram retry in case of timeout, relying
on the alertmanager exponential back-off retry for HTTP `5xx` response
for webhooks. See also #8
@tlex
Copy link
Member

tlex commented Aug 31, 2020

So, looking at the code (good catch btw, it was exactly the exception handling that you've linked), I've noticed that the only retry happens in case of timeout. Now, I've read up on the behavior of alertmanager and, if it receives a 5xx from the webhook, it will actually retry in intervals that are increasingly longer (see also https://github.com/prometheus/alertmanager/pull/2290 - marked as code as to not pollute the MR there).

This behavior is then consistent to having alertmanager as a source of truth about the notifications that went out.

The changes are now merged to master - until I schedule a release, you can use ixdotai/alertmanager-notifier:dev-master to try it out.

@ahmadalli
Copy link
Contributor Author

thanks :) is it possible to have an option to disable the 5xx error on the receiver side? because the main issue still persists since and error surge would results in alertmanager retrying for days

@tlex
Copy link
Member

tlex commented Aug 31, 2020

I've just added TELEGRAM_RETRY_ON_FAILURE - This flag, if set to no, always sends a 200 OK back to alertmanager,
even if the Telegram notification wasn't successful. The default is yes.

(Commit: a7603aa)

ixdotai/alertmanager-notifier:dev-master is already built, so you can take it for a spin.

@ahmadalli
Copy link
Contributor Author

great! thanks 👍

@tlex tlex added the enhancement New feature or request label Aug 31, 2020
@ahmadalli
Copy link
Contributor Author

I've been testing it for a while and it was a huge improvement over the previous version (no alerts from days age showing up in the channel) but now, all but one of the alerts which have been sent at same time would be ignored; probably because telegram has a cooldown time between messages.

I think it's possible to improve it by retrying message sending (with an added random duration) but returning 200 if the process failed at the end (so that alertmanager wouldn't try sending it forever)

@tlex
Copy link
Member

tlex commented Oct 12, 2020

Currently, there is an ON / OFF switch for retries. If I understand correctly, you would like the following behavior:

  • Add a fixed number of retries before giving up
  • In such a case, return 200

If this is what you mean, I'll look into adding variables for both.

The retry duration is already based on what the Telegram API responds (with an added 0.5s) - you can see it here: https://github.com/ix-ai/notifiers/blob/82609e3092e7dfa7feb537280f3e2afc30fb4826/ix_notifiers/telegram_notifier.py#L70. If it wasn't the rate limiter, but a timeout, then the retry happens after 2 seconds.

@ahmadalli
Copy link
Contributor Author

ahmadalli commented Oct 13, 2020

yeah. I think it's better to have two separate switches: one for retries and one for whether return 500 or not on failure

ix-ai-bot pushed a commit to ix-ai/notifiers that referenced this issue Oct 21, 2020
@tlex
Copy link
Member

tlex commented Oct 21, 2020

I've built a new ixdotai/alertmanager-notifier:dev-master with the following (relevant) changes:

  • Add TELEGRAM_MAX_RETRIES and TELEGRAM_ALWAYS_SUCCEED
  • Drop TELEGRAM_RETRY_ON_FAILURE

The two new environment variables are described in README.md.

If it works well, I'll create a release in a week.

Relevant commit: ff5150e

@ahmadalli
Copy link
Contributor Author

I've tested it and it's working great :) please close this after the next release

@tlex
Copy link
Member

tlex commented Oct 25, 2020

Thanks for the feedback. v0.3.0 released

@tlex tlex closed this as completed Oct 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants