-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve message sending when there's an alert surge situation #8
Comments
I think this is related to this code: |
Thanks for the report. I'm tempted to change this in such a way that alertmanager-notifier doesn't retry at all, but returns a |
I’m not sure about alertmanager’s behavior if it receives 5xx code from the webhook (it might just logs an error) but generally this is a good idea. (if there’s a server breakdown, telegram shouldn’t be your source of truce so if there are some missing alerts that would be fine)
|
In effect, this disables the telegram retry in case of timeout, relying on the alertmanager exponential back-off retry for HTTP `5xx` response for webhooks. See also #8
So, looking at the code (good catch btw, it was exactly the exception handling that you've linked), I've noticed that the only retry happens in case of timeout. Now, I've read up on the behavior of alertmanager and, if it receives a This behavior is then consistent to having alertmanager as a source of truth about the notifications that went out. The changes are now merged to master - until I schedule a release, you can use |
thanks :) is it possible to have an option to disable the |
I've just added (Commit: a7603aa)
|
great! thanks 👍 |
I've been testing it for a while and it was a huge improvement over the previous version (no alerts from days age showing up in the channel) but now, all but one of the alerts which have been sent at same time would be ignored; probably because telegram has a cooldown time between messages. I think it's possible to improve it by retrying message sending (with an added random duration) but returning 200 if the process failed at the end (so that alertmanager wouldn't try sending it forever) |
Currently, there is an
If this is what you mean, I'll look into adding variables for both. The retry duration is already based on what the Telegram API responds (with an added 0.5s) - you can see it here: https://github.com/ix-ai/notifiers/blob/82609e3092e7dfa7feb537280f3e2afc30fb4826/ix_notifiers/telegram_notifier.py#L70. If it wasn't the rate limiter, but a timeout, then the retry happens after 2 seconds. |
yeah. I think it's better to have two separate switches: one for retries and one for whether return |
Drop `retry`. See ix-ai/alertmanager-notifier#8
I've built a new
The two new environment variables are described in README.md. If it works well, I'll create a release in a week. Relevant commit: ff5150e |
I've tested it and it's working great :) please close this after the next release |
Thanks for the feedback. v0.3.0 released |
currently if there's a surge of alerts, the notifier would send them all regardless of the time passed since the dispatch of the alert and we see alerts from a few days ago in the bot. this could be improved by adding a timeout (e.g. 1 min, should be configurable) to the message sending. since alertmanager itself would resends the message to the receiver if the issue persists
The text was updated successfully, but these errors were encountered: