Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: Fix migration edge-case race condition for silences #81206

Merged
merged 1 commit into from Feb 16, 2024

Conversation

JacobsonMT
Copy link
Member

What is this feature?

This change opts to bypass the unnecessary step of writing the silences to disk during the migration and instead write them directly to the kvstore.

Why do we need this feature?

If the db already has an entry in the kvstore for the silences of an alertmanager before the migration has taken place, then it's possible that the active alertmanager will overwrite the silence file created by the migration before it has a chance to load it into memory.

This should not happen normally but is possible in edge-cases.

The change avoids the race condition entirely and is more correct as we treat the database as the source of truth for AM state.

Special notes for your reviewer:

Can reproduce by adding an empty silence entry to the kvstore before migrating alerts that would normally create silences (such as alerts with NoData/Error set to keep_state).

@JacobsonMT JacobsonMT added type/bug area/alerting Grafana Alerting area/backend add to changelog area/alerting/migration Issues relating to legacy alerting migration labels Jan 24, 2024
@JacobsonMT JacobsonMT added this to the 10.4.x milestone Jan 24, 2024
@JacobsonMT JacobsonMT requested a review from a team as a code owner January 24, 2024 21:08
@JacobsonMT JacobsonMT requested review from rwwiv, yuri-tceretian and grobinson-grafana and removed request for a team January 24, 2024 21:08
@grobinson-grafana
Copy link
Contributor

This can happen because we moved the migration into the ngalert package where it runs at a different time? Or has this issue existed for a long time?

@JacobsonMT
Copy link
Member Author

This can happen because we moved the migration into the ngalert package where it runs at a different time? Or has this issue existed for a long time?

Good question, I'm not 100% sure but I don't see why the issue wouldn't have existed previously. I don't see anything special in ngalert.init that would cause the silence file to take precedence over an existing kvstore entry.

If the db already has an entry in the kvstore for the silences of an
alertmanager before the migration has taken place, then it's possible that the
active alertmanager will overwrite the silence file created by the migration
before it has a chance to load it into memory.

This should not happen normally but is possible in edge-cases.

This change opts to bypass the unnecessary step of writing the silences to disk
during the migration and instead write them directly to the kvstore. This avoids
 the race condition entirely and is more correct as we treat the database as the
  source of truth for AM state.
@JacobsonMT JacobsonMT force-pushed the jacobsonmt/fix-migration-silence-race branch from 9502a79 to 8a96f5c Compare February 16, 2024 15:19
Copy link
Contributor

@rwwiv rwwiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Considering this likely existed already we should keep in mind we may need to backport it.

@JacobsonMT JacobsonMT merged commit e7c6e9c into main Feb 16, 2024
12 checks passed
@JacobsonMT JacobsonMT deleted the jacobsonmt/fix-migration-silence-race branch February 16, 2024 15:47
@aangelisc aangelisc modified the milestones: 10.4.x, 10.4.0 Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
add to changelog area/alerting/migration Issues relating to legacy alerting migration area/alerting Grafana Alerting area/backend type/bug
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants