Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudwatch Logs: Set Alerting timeout to datasource config's logsTimeout (#72611) #72611

Merged

Conversation

idastambuk
Copy link
Contributor

@idastambuk idastambuk commented Jul 31, 2023

What is this feature?

When getting results for CW Logs queries, the client needs to get the ID of the query first, then periodically poll the results with GetQueryResults. The Status of the query we get from AWS is "Running". Once the request resolves with Status "Done", it will contain complete results. More info here

In CloudwatchLogsQueryRunner for dashboards and explore, there is a timeout set in the ConfigEditor (default: 30 minutes) to retry query results for Cloudwatch Logs queries with GetQueryResults.

After some users noticed they were getting "failed to query data: fetching of query results exceeded max number of attempts" errors in their alerts, we decided to make this option configurable (only for alerts). However, in the process we discovered that the 30 minute configuration wasn't being used on the BE for alert queries. Instead, CW datasource plugin polled the results for 8 seconds maximum, which might be too short for some queries.

So instead of making this configurable, we decided to first use this setting from ConfigEditor and make it 30 minutes.
Note that this means that some queries will timeout with the alerting runner's error "context cancelled", since the default value when running queries in this context is 30s (https://github.com/grafana/grafana/blob/main/conf/defaults.ini#L1207) but this can be increased per instance and doesn't need any change in the code

The next step in improving this flow will be to not poll the results every second, but to instead do something like in async-query-data, which will increase wait time with every request that returns still "Running".
After this, we could implement a query option input that will give the user more control over how often the results are polled per-query, but it might not be necessary after ☝🏻

Some additional things in this PR:

  1. Made the tooltip in ConfigEditor more clear that this is not a retry after error, but a retry after "not yet done" response.
  2. Defined the default 30 minute timeout in executor struct, in order to be able to pass a shorter timeout for tests (otherwise the test with default value would run for 30 min 👀)
  3. Moved the comment about logs queries closer to where the requests for GetQueryResults were happening (from datasource.go to log_sync_query.go, since I felt it was easy to miss where it was. Lmk if there's a better place.

Who is this feature for?

[Add information on what kind of user the feature is for.]

Which issue(s) does this PR fix?:

Fixes #63305

Special notes for your reviewer:
When this is merged, I will comment here and here to inform the user and customer support about the change.
TODO: Add to what's new

Please check that:

  • It works as expected from a user's perspective.
  • If this is a pre-GA feature, it is behind a feature toggle.
  • The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

@idastambuk idastambuk force-pushed the 63305-cloudwatch-logs-make-alertmaxattempts-configurable branch from 70aac6a to 63d30bd Compare July 31, 2023 16:02
@idastambuk idastambuk marked this pull request as ready for review July 31, 2023 16:15
@idastambuk idastambuk requested a review from a team as a code owner July 31, 2023 16:15
@idastambuk idastambuk requested review from fridgepoet and kevinwcyu and removed request for a team July 31, 2023 16:15
Copy link
Member

@fridgepoet fridgepoet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, Ida!

I am fairly certain executor does not need to have the timeout. You ought to be able to inject your custom timeout for tests through the Settings instead. This will simplify the code a little. Let me know if I have missed something, though!

pkg/tsdb/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
pkg/tsdb/cloudwatch/models/settings.go Outdated Show resolved Hide resolved
pkg/tsdb/cloudwatch/cloudwatch.go Outdated Show resolved Hide resolved
pkg/tsdb/cloudwatch/log_sync_query.go Outdated Show resolved Hide resolved
pkg/tsdb/cloudwatch/log_sync_query_test.go Outdated Show resolved Hide resolved
@fridgepoet
Copy link
Member

Great Go code, these are purely functional questions/suggestions and not a remark on the Go code!

@idastambuk idastambuk changed the title Cloudwatch Logs: Set alert timeout from datasource config Cloudwatch Logs: Use alert timeout from datasource config Aug 2, 2023
@idastambuk idastambuk changed the title Cloudwatch Logs: Use alert timeout from datasource config Cloudwatch Logs: Set Alerting timeout to datasource config's logsTimeout (#72611) Aug 3, 2023
@idastambuk idastambuk merged commit abff6e2 into main Aug 3, 2023
16 checks passed
@idastambuk idastambuk deleted the 63305-cloudwatch-logs-make-alertmaxattempts-configurable branch August 3, 2023 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CloudWatch Logs: Make alertMaxAttempts configurable
3 participants