Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: Fetch alerts from a remote Alertmanager #75844

Merged

Conversation

santihernandezc
Copy link
Contributor

@santihernandezc santihernandezc commented Oct 2, 2023

This PR enables the proxy Alertmanager struct to retrieve alerts from the remote Alertmanager. The methods for enabling this feature are part of the Alertmanager interface:

  • GetAlerts()
  • GetAlertGroups()

As we need to communicate to an Alertmanager via HTTP requests, these two methods now take a context.Context as the first argument.

The PutAlerts() method needs to be implemented in the way described in #76692

@@ -381,7 +381,7 @@ func (am *alertmanager) buildReceiverIntegrations(receiver *alertingNotify.APIRe
}

// PutAlerts receives the alerts and then sends them through the corresponding route based on whenever the alert has a receiver embedded or not
func (am *alertmanager) PutAlerts(postableAlerts apimodels.PostableAlerts) error {
func (am *alertmanager) PutAlerts(_ context.Context, postableAlerts apimodels.PostableAlerts) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use the context.Context in the internal Alertmanager, but these methods need to accept one to implement the Alertmanager interface and be able to swap Alertmanager structs in the future.

@@ -103,12 +110,7 @@ func (am *externalAlertmanager) GetSilence(ctx context.Context, silenceID string
return apimodels.GettableSilence{}, err
}

if res != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to add this guard to almost every method, I think it's overkill. I'd say it's safe to assume that the go-swagger package won't introduce a breaking change like returning a nil value when no error occurred.

@@ -36,7 +36,7 @@ type ScheduleService interface {
//
//go:generate mockery --name AlertsSender --structname AlertsSenderMock --inpackage --filename alerts_sender_mock.go --with-expecter
type AlertsSender interface {
Send(key ngmodels.AlertRuleKey, alerts definitions.PostableAlerts)
Send(ctx context.Context, key ngmodels.AlertRuleKey, alerts definitions.PostableAlerts)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change here is adding a context.Context to the interface and the .Send() calls.
I also updated the mock for this interface.

pkg/services/ngalert/schedule/schedule_unit_test.go Outdated Show resolved Hide resolved
@@ -59,7 +59,7 @@ func TestProcessTicks(t *testing.T) {
mockedClock := clock.NewMock()

notifier := &AlertsSenderMock{}
notifier.EXPECT().Send(mock.Anything, mock.Anything).Return()
notifier.EXPECT().Send(mock.Anything, mock.Anything, mock.Anything).Return()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in this file are:

  • Adding a first argument corresponding to the context.Context
  • Checking that the third (not second) argument is of type definitions.PostableAlerts

@santihernandezc santihernandezc marked this pull request as ready for review October 16, 2023 15:39
@santihernandezc santihernandezc requested review from a team, rwwiv, JacobsonMT, yuri-tceretian and grobinson-grafana and removed request for a team October 16, 2023 15:39
…ernandezc/forward_alerts_to_remote_alertmanager
@santihernandezc santihernandezc force-pushed the santihernandezc/forward_alerts_to_remote_alertmanager branch from 0872944 to 5c85df3 Compare October 16, 2023 20:38
@santihernandezc santihernandezc force-pushed the santihernandezc/forward_alerts_to_remote_alertmanager branch from be72f88 to 0f3224b Compare October 16, 2023 21:12
Copy link
Contributor

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but please take a look at my comments.

Comment on lines +177 to +179
if !ok {
t.Skip("No Alertmanager URL provided")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially dangerous - if someone makes a mistake and renames the CI variable we wouldn't be able to tell if this test is running or not. I'd rather abort the test and tell the user the devev command they to run to make sure the test runs correctly.

Copy link
Contributor Author

@santihernandezc santihernandezc Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same thing and started by not doing this but the CI failed in normal integration tests.
I then checked how we handle tests for e.g. Memcached, Redis storage, Redis cache, Postgres and MySQL which are the other integration tests that use docker containers. All of them use this pattern.

Copy link
Contributor

@gotjosh gotjosh Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I don't love this approach but if there's so much precedence I take it it might not be a simple problem. Can you file an issue and attach it in a separate task list to the epic so that can investigate later? There are ways to simply skip these tests if we don't have the right setup in the CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue created: #76798

Comment on lines +159 to +171
alerts := make(alertingNotify.PostableAlerts, 0, len(postableAlerts.PostableAlerts))
for _, pa := range postableAlerts.PostableAlerts {
alerts = append(alerts, &alertingNotify.PostableAlert{
Annotations: pa.Annotations,
EndsAt: pa.EndsAt,
StartsAt: pa.StartsAt,
Alert: pa.Alert,
})
}

params := amalert.NewPostAlertsParamsWithContext(ctx).WithAlerts(alerts)
_, err := am.amClient.Alert.PostAlerts(params)
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the implementation that we want here, I would prefer if we leave it out of this PR and just let this PR handle reads.

I'll leave you with the details so that you can file an issue and add it as a follow-up.

If you look closely at how prometheus handles sending of alerts to the alertmanager, you'll realise it's a combination of two things:

This is important because the notifier uses an internal queue to make sure alerts are delivered in batches when possible.

The good news is that we already have this implementation in Grafana, the bad news is that we'll need to adapt it so that it works with or in the externalAlertmanager. Please note that the way that "remote URLs" are configured in the notifier is not straightforward and require some level of finesse to implement as they rely on service discovery, to top that off, the notifier is used in other parts of Grafana (hence the public API).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue created #76692

}

func (am *externalAlertmanager) GetAlertGroups(active, silenced, inhibited bool, filter []string, receiver string) (apimodels.AlertGroups, error) {
return apimodels.AlertGroups{}, nil
func (am *externalAlertmanager) PutAlerts(ctx context.Context, postableAlerts apimodels.PostableAlerts) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a recover here as PostAlerts can panic.

@@ -121,30 +123,62 @@ func (am *externalAlertmanager) ListSilences(ctx context.Context, filter []strin
return res.Payload, nil
}

func (am *externalAlertmanager) GetStatus() apimodels.GettableStatus {
return apimodels.GettableStatus{}
func (am *externalAlertmanager) GetAlerts(ctx context.Context, active, silenced, inhibited bool, filter []string, receiver string) (apimodels.GettableAlerts, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a recover here as GetAlerts can panic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where exactly can it panic? The res.Payload bit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetAlerts(params) panics

}

func (am *externalAlertmanager) GetAlerts(active, silenced, inhibited bool, filter []string, receiver string) (apimodels.GettableAlerts, error) {
return apimodels.GettableAlerts{}, nil
func (am *externalAlertmanager) GetAlertGroups(ctx context.Context, active, silenced, inhibited bool, filter []string, receiver string) (apimodels.AlertGroups, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a recover here as GetAlertGroups can panic.

@santihernandezc santihernandezc changed the title Alerting: Post alerts to a remote Alertmanager and fetch them Alerting: Fetch alerts from a remote Alertmanager Oct 17, 2023
@gotjosh
Copy link
Contributor

gotjosh commented Oct 17, 2023

@grobinson-grafana can you elaborate under which conditions one would expect the panic? From reading the client code it seems to be there as a safeguard for code generation issues on upstream - with the amount of coverage we have on this client (all of our integrations test use it) I'm not sure it's worth it the recover.

But perhaps this is customary of using openapi clients?

Copy link
Contributor

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - but I'm dubious on wether we need the whole recover as part of the client calls.

@santihernandezc santihernandezc merged commit 61cb267 into main Oct 19, 2023
15 checks passed
@santihernandezc santihernandezc deleted the santihernandezc/forward_alerts_to_remote_alertmanager branch October 19, 2023 09:27
@grafana-delivery-bot grafana-delivery-bot bot modified the milestones: 10.2.x, 10.3.x Oct 19, 2023
@grobinson-grafana
Copy link
Contributor

@grobinson-grafana can you elaborate under which conditions one would expect the panic? From reading the client code it seems to be there as a safeguard for code generation issues on upstream - with the amount of coverage we have on this client (all of our integrations test use it) I'm not sure it's worth it the recover.

If you have 100% confidence in the codegen I'd be very happy to remove it. I figured anywhere there is an explicit panic, codegen or otherwise, better to be defensive and recover then risk one bug bringing down the entire process.

@aangelisc aangelisc modified the milestones: 10.3.x, 10.2.3 Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
add to changelog area/backend kata:am_unify no-backport Skip backport of PR type/build-packaging type/ci Tasks related to Continuous Integration workflow
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants