Unified Alerting: Set `max_attempts` to 1 by default #79095

gotjosh · 2023-12-05T16:14:05Z

The retry logic for unified alerting has been broken as far as v9.4.x, rather than fixing it in one go and causing a headache to our users with rules putting extra load on their datasources - I think a better approach is to simply set 1 as a default and then let our users change it.

I see two cons with this approach:

Configuration for legacy to unified alerting cannot be ported over automatically, users will have to manually set max_attempts to 3 when migrating.
Users expecting to get any sort of retrying (as with legacy alerting) will not have it out of the box and will have to manually edit the configuration.

I think this is the best compromise we can make with the introduction of #79037

Signed-off-by: gotjosh josue.abreu@gmail.com

The retry logic for unified alerting has been broken as far as v9.4.x, rather than fixing it in one go and causing a headache to our users with rules putting extra load on their datasources - I think a better approach is to simply set 1 as a default and then let our users change it. I see two cons with this approach: - Configuration for legacy to unified alerting cannot be ported over automatically, users will have to manually set `max_attempts` to 3 when migrating. - Users expecting to get any sort of retrying (as with legacy alerting) will not have it out of the box and will have to manually edit the configuration. Signed-off-by: gotjosh <josue.abreu@gmail.com>

Signed-off-by: gotjosh <josue.abreu@gmail.com>

alexweav · 2023-12-05T16:27:52Z

LGTM as it appears this config option has no effect currently.

Signed-off-by: gotjosh <josue.abreu@gmail.com>

JacobsonMT

LGTM

gotjosh · 2023-12-05T17:43:12Z

We discussed this offline extensively and all agreed that it was a fine change to merge.

Please find the conversation in: https://raintank-corp.slack.com/archives/C028MCV4R7C/p1701445758479449

grafana-delivery-bot · 2023-12-05T17:43:49Z

The backport to v9.4.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-79095-to-v9.4.x origin/v9.4.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x 0c9356a3c78b83b8bb5abf99263053b10de84d48

When the conflicts are resolved, stage and commit the changes:

git add . && git cherry-pick --continue

If you have the GitHub CLI installed:

# Push the branch to GitHub:
git push --set-upstream origin backport-79095-to-v9.4.x
# Create the PR body template
PR_BODY=$(gh pr view 79095 --json body --template 'Backport 0c9356a3c78b83b8bb5abf99263053b10de84d48 from #79095{{ "\n\n---\n\n" }}{{ index . "body" }}')
# Create the PR on GitHub
echo "${PR_BODY}" | gh pr create --title "[v9.4.x] Unified Alerting: Set `max_attempts` to 1 by default" --body-file - --label "type/docs" --label "area/backend" --label "add to changelog" --label "product-approved" --label "backport" --base v9.4.x --milestone 9.4.x --web

Or, if you don't have the GitHub CLI installed (we recommend you install it!):

# Push the branch to GitHub:
git push --set-upstream origin backport-79095-to-v9.4.x

# Create a pull request where the `base` branch is `v9.4.x` and the `compare`/`head` branch is `backport-79095-to-v9.4.x`.

# Remove the local backport branch
git switch main
git branch -D backport-79095-to-v9.4.x

grafana-delivery-bot · 2023-12-05T17:43:51Z

The backport to v9.5.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-79095-to-v9.5.x origin/v9.5.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x 0c9356a3c78b83b8bb5abf99263053b10de84d48

When the conflicts are resolved, stage and commit the changes:

git add . && git cherry-pick --continue

If you have the GitHub CLI installed:

# Push the branch to GitHub:
git push --set-upstream origin backport-79095-to-v9.5.x
# Create the PR body template
PR_BODY=$(gh pr view 79095 --json body --template 'Backport 0c9356a3c78b83b8bb5abf99263053b10de84d48 from #79095{{ "\n\n---\n\n" }}{{ index . "body" }}')
# Create the PR on GitHub
echo "${PR_BODY}" | gh pr create --title "[v9.5.x] Unified Alerting: Set `max_attempts` to 1 by default" --body-file - --label "type/docs" --label "area/backend" --label "add to changelog" --label "product-approved" --label "backport" --base v9.5.x --milestone 9.5.x --web

Or, if you don't have the GitHub CLI installed (we recommend you install it!):

# Push the branch to GitHub:
git push --set-upstream origin backport-79095-to-v9.5.x

# Create a pull request where the `base` branch is `v9.5.x` and the `compare`/`head` branch is `backport-79095-to-v9.5.x`.

# Remove the local backport branch
git switch main
git branch -D backport-79095-to-v9.5.x

Unified Alerting: Set `max_attempts` to 1 by default (#79095) * Unified Alerting: Set `max_attempts` to 1 by default The retry logic for unified alerting has been broken as far as v9.4.x, rather than fixing it in one go and causing a headache to our users with rules putting extra load on their datasources - I think a better approach is to simply set 1 as a default and then let our users change it. I see two cons with this approach: - Configuration for legacy to unified alerting cannot be ported over automatically, users will have to manually set `max_attempts` to 3 when migrating. - Users expecting to get any sort of retrying (as with legacy alerting) will not have it out of the box and will have to manually edit the configuration. Signed-off-by: gotjosh <josue.abreu@gmail.com> --------- Signed-off-by: gotjosh <josue.abreu@gmail.com> (cherry picked from commit 0c9356a) Co-authored-by: gotjosh <josue.abreu@gmail.com>

Unified Alerting: Set `max_attempts` to 1 by default (#79095) * Unified Alerting: Set `max_attempts` to 1 by default The retry logic for unified alerting has been broken as far as v9.4.x, rather than fixing it in one go and causing a headache to our users with rules putting extra load on their datasources - I think a better approach is to simply set 1 as a default and then let our users change it. I see two cons with this approach: - Configuration for legacy to unified alerting cannot be ported over automatically, users will have to manually set `max_attempts` to 3 when migrating. - Users expecting to get any sort of retrying (as with legacy alerting) will not have it out of the box and will have to manually edit the configuration. Signed-off-by: gotjosh <josue.abreu@gmail.com> --------- Signed-off-by: gotjosh <josue.abreu@gmail.com> (cherry picked from commit 0c9356a)

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible. I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected. There's two small differences between how retries work now and how they used to work in legacy alerting. Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying. We have added a constant backoff of 1s in between retries. Signed-off-by: gotjosh <josue.abreu@gmail.com>

* Alerting: Attempt to retry retryable errors Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible. I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected. There's two small differences between how retries work now and how they used to work in legacy alerting. Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying. We have added a constant backoff of 1s in between retries. --------- Signed-off-by: gotjosh <josue.abreu@gmail.com>

* Alerting: Attempt to retry retryable errors Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible. I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected. There's two small differences between how retries work now and how they used to work in legacy alerting. Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying. We have added a constant backoff of 1s in between retries. --------- Signed-off-by: gotjosh <josue.abreu@gmail.com> (cherry picked from commit c631261)

* Alerting: Attempt to retry retryable errors Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible. I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected. There's two small differences between how retries work now and how they used to work in legacy alerting. Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying. We have added a constant backoff of 1s in between retries. --------- Signed-off-by: gotjosh <josue.abreu@gmail.com> (cherry picked from commit c631261) Signed-off-by: gotjosh <josue.abreu@gmail.com>

Alerting: Attempt to retry retryable errors (#79161) * Alerting: Attempt to retry retryable errors Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible. I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected. There's two small differences between how retries work now and how they used to work in legacy alerting. Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying. We have added a constant backoff of 1s in between retries. --------- (cherry picked from commit c631261) Signed-off-by: gotjosh <josue.abreu@gmail.com>

Alerting: Attempt to retry retryable errors (#79161) * Alerting: Attempt to retry retryable errors Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible. I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected. There's two small differences between how retries work now and how they used to work in legacy alerting. Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying. We have added a constant backoff of 1s in between retries. --------- Signed-off-by: gotjosh <josue.abreu@gmail.com> (cherry picked from commit c631261) Co-authored-by: gotjosh <josue.abreu@gmail.com>

* Alerting: Attempt to retry retryable errors Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible. I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected. There's two small differences between how retries work now and how they used to work in legacy alerting. Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying. We have added a constant backoff of 1s in between retries. --------- Signed-off-by: gotjosh <josue.abreu@gmail.com> (cherry picked from commit c631261)

* Alerting: Attempt to retry retryable errors Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible. I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected. There's two small differences between how retries work now and how they used to work in legacy alerting. Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying. We have added a constant backoff of 1s in between retries. --------- Signed-off-by: gotjosh <josue.abreu@gmail.com> (cherry picked from commit c631261) Signed-off-by: gotjosh <josue.abreu@gmail.com>

* Alerting: Attempt to retry retryable errors Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible. I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected. There's two small differences between how retries work now and how they used to work in legacy alerting. Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying. We have added a constant backoff of 1s in between retries. --------- Signed-off-by: gotjosh <josue.abreu@gmail.com> (cherry picked from commit c631261)

Alerting: Attempt to retry retryable errors (#79161) * Alerting: Attempt to retry retryable errors Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible. I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected. There's two small differences between how retries work now and how they used to work in legacy alerting. Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying. We have added a constant backoff of 1s in between retries. --------- Signed-off-by: gotjosh <josue.abreu@gmail.com> (cherry picked from commit c631261)

Alerting: Attempt to retry retryable errors (#79161) * Alerting: Attempt to retry retryable errors Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible. I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected. There's two small differences between how retries work now and how they used to work in legacy alerting. Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying. We have added a constant backoff of 1s in between retries. --------- (cherry picked from commit c631261) Signed-off-by: gotjosh <josue.abreu@gmail.com>

Alerting: Attempt to retry retryable errors (#79161) * Alerting: Attempt to retry retryable errors Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible. I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected. There's two small differences between how retries work now and how they used to work in legacy alerting. Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying. We have added a constant backoff of 1s in between retries. --------- Signed-off-by: gotjosh <josue.abreu@gmail.com> (cherry picked from commit c631261)

gotjosh requested review from torkelo, a team and chri2547 as code owners December 5, 2023 16:14

gotjosh requested review from mildwonkey, undef1nd and suntala and removed request for a team December 5, 2023 16:14

grafana-delivery-bot bot added this to the 10.3.x milestone Dec 5, 2023

grafana-pr-automation bot added area/backend type/docs labels Dec 5, 2023

fix tests

97ee905

Signed-off-by: gotjosh <josue.abreu@gmail.com>

alexweav approved these changes Dec 5, 2023

View reviewed changes

fix more tests

20b1ff9

Signed-off-by: gotjosh <josue.abreu@gmail.com>

JacobsonMT approved these changes Dec 5, 2023

View reviewed changes

gotjosh merged commit 0c9356a into main Dec 5, 2023
14 checks passed

gotjosh deleted the unified-alerting-set-max-attempts-to-1 branch December 5, 2023 17:42

grafana-delivery-bot bot added the backport-failed Failed to generate backport PR. Please resolve conflicts and create one manually. label Dec 5, 2023

grafana-delivery-bot bot mentioned this pull request Dec 5, 2023

[v10.0.x] Unified Alerting: Set max_attempts to 1 by default #79101

Merged

gotjosh mentioned this pull request Dec 5, 2023

[v9.5.x] Unified Alerting: Set to 1 by default #79109

Merged

gotjosh mentioned this pull request Dec 6, 2023

Alerting: Attempt to retry retryable errors #79037

Merged

3 tasks

grafana-delivery-bot bot mentioned this pull request Dec 6, 2023

[v10.2.x] Alerting: Attempt to retry retryable errors #79152

Closed

3 tasks

gotjosh mentioned this pull request Dec 6, 2023

Alerting: Attempt to retry retryable errors #79161

Merged

3 tasks

grafana-delivery-bot bot mentioned this pull request Dec 6, 2023

[v10.2.x] Alerting: Attempt to retry retryable errors #79175

Merged

3 tasks

gotjosh mentioned this pull request Dec 6, 2023

[v9.4.x] Alerting: Attempt to retry retryable errors #79180

Merged

3 tasks

gotjosh mentioned this pull request Dec 7, 2023

[v9.5.x] Alerting: Attempt to retry retryable errors #79209

Merged

3 tasks

gotjosh mentioned this pull request Dec 7, 2023

[v10.0.x] Alerting: Attempt to retry retryable errors #79210

Merged

3 tasks

gotjosh mentioned this pull request Dec 7, 2023

[v10.1.x] Alerting: Attempt to retry retryable errors #79211

Merged

3 tasks

aangelisc modified the milestones: 10.3.x, 10.2.3 Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified Alerting: Set `max_attempts` to 1 by default #79095

Unified Alerting: Set `max_attempts` to 1 by default #79095

gotjosh commented Dec 5, 2023

alexweav commented Dec 5, 2023

JacobsonMT left a comment

gotjosh commented Dec 5, 2023

grafana-delivery-bot bot commented Dec 5, 2023

grafana-delivery-bot bot commented Dec 5, 2023

Unified Alerting: Set max_attempts to 1 by default #79095

Unified Alerting: Set max_attempts to 1 by default #79095

Conversation

gotjosh commented Dec 5, 2023

alexweav commented Dec 5, 2023

JacobsonMT left a comment

Choose a reason for hiding this comment

gotjosh commented Dec 5, 2023

grafana-delivery-bot bot commented Dec 5, 2023

grafana-delivery-bot bot commented Dec 5, 2023

Unified Alerting: Set `max_attempts` to 1 by default #79095

Unified Alerting: Set `max_attempts` to 1 by default #79095