Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified Alerting: Set max_attempts to 1 by default #79095

Merged
merged 3 commits into from
Dec 5, 2023

Conversation

gotjosh
Copy link
Contributor

@gotjosh gotjosh commented Dec 5, 2023

The retry logic for unified alerting has been broken as far as v9.4.x, rather than fixing it in one go and causing a headache to our users with rules putting extra load on their datasources - I think a better approach is to simply set 1 as a default and then let our users change it.

I see two cons with this approach:

  • Configuration for legacy to unified alerting cannot be ported over automatically, users will have to manually set max_attempts to 3 when migrating.
  • Users expecting to get any sort of retrying (as with legacy alerting) will not have it out of the box and will have to manually edit the configuration.

I think this is the best compromise we can make with the introduction of #79037

Signed-off-by: gotjosh josue.abreu@gmail.com

The retry logic for unified alerting has been broken as far as v9.4.x, rather than fixing it in one go and causing a headache to our users with rules putting extra load on their datasources - I think a better approach is to simply set 1 as a default and then let our users change it.

I see two cons with this approach:

- Configuration for legacy to unified alerting cannot be ported over automatically, users will have to manually set `max_attempts` to 3 when migrating.
- Users expecting to get any sort of retrying (as with legacy alerting) will not have it out of the box and will have to manually edit the configuration.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
@gotjosh gotjosh added add to changelog product-approved Pull requests that are approved by product/managers and are allowed to be backported backport v9.4.x Mark PR for automatic backport to v9.4.x backport v9.5.x Bot will automatically open backport PR backport v10.0.x backport v10.1.x backport v10.2.x labels Dec 5, 2023
@gotjosh gotjosh requested review from torkelo, a team and chri2547 as code owners December 5, 2023 16:14
@gotjosh gotjosh requested review from mildwonkey, undef1nd and suntala and removed request for a team December 5, 2023 16:14
@grafana-delivery-bot grafana-delivery-bot bot added this to the 10.3.x milestone Dec 5, 2023
Signed-off-by: gotjosh <josue.abreu@gmail.com>
@alexweav
Copy link
Contributor

alexweav commented Dec 5, 2023

LGTM as it appears this config option has no effect currently.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
Copy link
Member

@JacobsonMT JacobsonMT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gotjosh gotjosh merged commit 0c9356a into main Dec 5, 2023
14 checks passed
@gotjosh gotjosh deleted the unified-alerting-set-max-attempts-to-1 branch December 5, 2023 17:42
@gotjosh
Copy link
Contributor Author

gotjosh commented Dec 5, 2023

We discussed this offline extensively and all agreed that it was a fine change to merge.

Please find the conversation in: https://raintank-corp.slack.com/archives/C028MCV4R7C/p1701445758479449

Copy link
Contributor

The backport to v9.4.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-79095-to-v9.4.x origin/v9.4.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x 0c9356a3c78b83b8bb5abf99263053b10de84d48

When the conflicts are resolved, stage and commit the changes:

git add . && git cherry-pick --continue

If you have the GitHub CLI installed:

# Push the branch to GitHub:
git push --set-upstream origin backport-79095-to-v9.4.x
# Create the PR body template
PR_BODY=$(gh pr view 79095 --json body --template 'Backport 0c9356a3c78b83b8bb5abf99263053b10de84d48 from #79095{{ "\n\n---\n\n" }}{{ index . "body" }}')
# Create the PR on GitHub
echo "${PR_BODY}" | gh pr create --title "[v9.4.x] Unified Alerting: Set `max_attempts` to 1 by default" --body-file - --label "type/docs" --label "area/backend" --label "add to changelog" --label "product-approved" --label "backport" --base v9.4.x --milestone 9.4.x --web

Or, if you don't have the GitHub CLI installed (we recommend you install it!):

# Push the branch to GitHub:
git push --set-upstream origin backport-79095-to-v9.4.x

# Create a pull request where the `base` branch is `v9.4.x` and the `compare`/`head` branch is `backport-79095-to-v9.4.x`.

# Remove the local backport branch
git switch main
git branch -D backport-79095-to-v9.4.x

@grafana-delivery-bot grafana-delivery-bot bot added the backport-failed Failed to generate backport PR. Please resolve conflicts and create one manually. label Dec 5, 2023
Copy link
Contributor

The backport to v9.5.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-79095-to-v9.5.x origin/v9.5.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x 0c9356a3c78b83b8bb5abf99263053b10de84d48

When the conflicts are resolved, stage and commit the changes:

git add . && git cherry-pick --continue

If you have the GitHub CLI installed:

# Push the branch to GitHub:
git push --set-upstream origin backport-79095-to-v9.5.x
# Create the PR body template
PR_BODY=$(gh pr view 79095 --json body --template 'Backport 0c9356a3c78b83b8bb5abf99263053b10de84d48 from #79095{{ "\n\n---\n\n" }}{{ index . "body" }}')
# Create the PR on GitHub
echo "${PR_BODY}" | gh pr create --title "[v9.5.x] Unified Alerting: Set `max_attempts` to 1 by default" --body-file - --label "type/docs" --label "area/backend" --label "add to changelog" --label "product-approved" --label "backport" --base v9.5.x --milestone 9.5.x --web

Or, if you don't have the GitHub CLI installed (we recommend you install it!):

# Push the branch to GitHub:
git push --set-upstream origin backport-79095-to-v9.5.x

# Create a pull request where the `base` branch is `v9.5.x` and the `compare`/`head` branch is `backport-79095-to-v9.5.x`.

# Remove the local backport branch
git switch main
git branch -D backport-79095-to-v9.5.x

gotjosh added a commit that referenced this pull request Dec 5, 2023
Unified Alerting: Set `max_attempts` to 1 by default (#79095)

* Unified Alerting: Set `max_attempts` to 1 by default

The retry logic for unified alerting has been broken as far as v9.4.x, rather than fixing it in one go and causing a headache to our users with rules putting extra load on their datasources - I think a better approach is to simply set 1 as a default and then let our users change it.

I see two cons with this approach:

- Configuration for legacy to unified alerting cannot be ported over automatically, users will have to manually set `max_attempts` to 3 when migrating.
- Users expecting to get any sort of retrying (as with legacy alerting) will not have it out of the box and will have to manually edit the configuration.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
(cherry picked from commit 0c9356a)

Co-authored-by: gotjosh <josue.abreu@gmail.com>
gotjosh added a commit that referenced this pull request Dec 5, 2023
Unified Alerting: Set `max_attempts` to 1 by default (#79095)

* Unified Alerting: Set `max_attempts` to 1 by default

The retry logic for unified alerting has been broken as far as v9.4.x, rather than fixing it in one go and causing a headache to our users with rules putting extra load on their datasources - I think a better approach is to simply set 1 as a default and then let our users change it.

I see two cons with this approach:

- Configuration for legacy to unified alerting cannot be ported over automatically, users will have to manually set `max_attempts` to 3 when migrating.
- Users expecting to get any sort of retrying (as with legacy alerting) will not have it out of the box and will have to manually edit the configuration.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
(cherry picked from commit 0c9356a)

Co-authored-by: gotjosh <josue.abreu@gmail.com>
gotjosh added a commit that referenced this pull request Dec 5, 2023
Unified Alerting: Set `max_attempts` to 1 by default (#79095)

* Unified Alerting: Set `max_attempts` to 1 by default

The retry logic for unified alerting has been broken as far as v9.4.x, rather than fixing it in one go and causing a headache to our users with rules putting extra load on their datasources - I think a better approach is to simply set 1 as a default and then let our users change it.

I see two cons with this approach:

- Configuration for legacy to unified alerting cannot be ported over automatically, users will have to manually set `max_attempts` to 3 when migrating.
- Users expecting to get any sort of retrying (as with legacy alerting) will not have it out of the box and will have to manually edit the configuration.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
(cherry picked from commit 0c9356a)

Co-authored-by: gotjosh <josue.abreu@gmail.com>
gotjosh added a commit that referenced this pull request Dec 5, 2023
Unified Alerting: Set `max_attempts` to 1 by default (#79095)

* Unified Alerting: Set `max_attempts` to 1 by default

The retry logic for unified alerting has been broken as far as v9.4.x, rather than fixing it in one go and causing a headache to our users with rules putting extra load on their datasources - I think a better approach is to simply set 1 as a default and then let our users change it.

I see two cons with this approach:

- Configuration for legacy to unified alerting cannot be ported over automatically, users will have to manually set `max_attempts` to 3 when migrating.
- Users expecting to get any sort of retrying (as with legacy alerting) will not have it out of the box and will have to manually edit the configuration.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
(cherry picked from commit 0c9356a)
gotjosh added a commit that referenced this pull request Dec 6, 2023
Unified Alerting: Set `max_attempts` to 1 by default (#79095)

* Unified Alerting: Set `max_attempts` to 1 by default

The retry logic for unified alerting has been broken as far as v9.4.x, rather than fixing it in one go and causing a headache to our users with rules putting extra load on their datasources - I think a better approach is to simply set 1 as a default and then let our users change it.

I see two cons with this approach:

- Configuration for legacy to unified alerting cannot be ported over automatically, users will have to manually set `max_attempts` to 3 when migrating.
- Users expecting to get any sort of retrying (as with legacy alerting) will not have it out of the box and will have to manually edit the configuration.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
(cherry picked from commit 0c9356a)
gotjosh added a commit that referenced this pull request Dec 6, 2023
Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

Signed-off-by: gotjosh <josue.abreu@gmail.com>
gotjosh added a commit that referenced this pull request Dec 6, 2023
* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
grafana-delivery-bot bot pushed a commit that referenced this pull request Dec 6, 2023
* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
(cherry picked from commit c631261)
gotjosh added a commit that referenced this pull request Dec 6, 2023
* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
(cherry picked from commit c631261)
gotjosh added a commit that referenced this pull request Dec 6, 2023
* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
(cherry picked from commit c631261)
Signed-off-by: gotjosh <josue.abreu@gmail.com>
gotjosh added a commit that referenced this pull request Dec 6, 2023
Alerting: Attempt to retry retryable errors (#79161)

* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------


(cherry picked from commit c631261)

Signed-off-by: gotjosh <josue.abreu@gmail.com>
gotjosh added a commit that referenced this pull request Dec 7, 2023
Alerting: Attempt to retry retryable errors (#79161)

* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
(cherry picked from commit c631261)

Co-authored-by: gotjosh <josue.abreu@gmail.com>
gotjosh added a commit that referenced this pull request Dec 7, 2023
* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
(cherry picked from commit c631261)
gotjosh added a commit that referenced this pull request Dec 7, 2023
* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
(cherry picked from commit c631261)
Signed-off-by: gotjosh <josue.abreu@gmail.com>
gotjosh added a commit that referenced this pull request Dec 7, 2023
* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
(cherry picked from commit c631261)
gotjosh added a commit that referenced this pull request Dec 7, 2023
* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
(cherry picked from commit c631261)
gotjosh added a commit that referenced this pull request Dec 7, 2023
Alerting: Attempt to retry retryable errors (#79161)

* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
(cherry picked from commit c631261)
gotjosh added a commit that referenced this pull request Dec 7, 2023
Alerting: Attempt to retry retryable errors (#79161)

* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------


(cherry picked from commit c631261)

Signed-off-by: gotjosh <josue.abreu@gmail.com>
gotjosh added a commit that referenced this pull request Dec 7, 2023
Alerting: Attempt to retry retryable errors (#79161)

* Alerting: Attempt to retry retryable errors

Retrying has been broken for a good while now (at least since version 9.4) - this change attempts to re-introduce them in their simplest and safest form possible.

I first introduced #79095 to make sure we don't disrupt or put additional load on our customer's data sources with this change in a patch release. Paired with this change, retries can now work as expected.

There's two small differences between how retries work now and how they used to work in legacy alerting.

Retries only occur for valid alert definitions - if we suspect that that error comes from a malformed alert definition we skip retrying.
We have added a constant backoff of 1s in between retries.

---------

Signed-off-by: gotjosh <josue.abreu@gmail.com>
(cherry picked from commit c631261)
@aangelisc aangelisc modified the milestones: 10.3.x, 10.2.3 Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
add to changelog area/backend backport v9.4.x Mark PR for automatic backport to v9.4.x backport v9.5.x Bot will automatically open backport PR backport v10.0.x backport v10.1.x backport v10.2.x backport-failed Failed to generate backport PR. Please resolve conflicts and create one manually. product-approved Pull requests that are approved by product/managers and are allowed to be backported type/docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants