-
Notifications
You must be signed in to change notification settings - Fork 11.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloudwatch Logs: Set Alerting timeout to datasource config's logsTimeout (#72611) #72611
Cloudwatch Logs: Set Alerting timeout to datasource config's logsTimeout (#72611) #72611
Conversation
70aac6a
to
63d30bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, Ida!
I am fairly certain executor does not need to have the timeout. You ought to be able to inject your custom timeout for tests through the Settings
instead. This will simplify the code a little. Let me know if I have missed something, though!
Great Go code, these are purely functional questions/suggestions and not a remark on the Go code! |
What is this feature?
When getting results for CW Logs queries, the client needs to get the ID of the query first, then periodically poll the results with GetQueryResults. The Status of the query we get from AWS is "Running". Once the request resolves with Status "Done", it will contain complete results. More info here
In CloudwatchLogsQueryRunner for dashboards and explore, there is a timeout set in the ConfigEditor (default: 30 minutes) to retry query results for Cloudwatch Logs queries with GetQueryResults.
After some users noticed they were getting
"failed to query data: fetching of query results exceeded max number of attempts"
errors in their alerts, we decided to make this option configurable (only for alerts). However, in the process we discovered that the 30 minute configuration wasn't being used on the BE for alert queries. Instead, CW datasource plugin polled the results for 8 seconds maximum, which might be too short for some queries.So instead of making this configurable, we decided to first use this setting from ConfigEditor and make it 30 minutes.
Note that this means that some queries will timeout with the alerting runner's error "context cancelled", since the default value when running queries in this context is 30s (https://github.com/grafana/grafana/blob/main/conf/defaults.ini#L1207) but this can be increased per instance and doesn't need any change in the code
The next step in improving this flow will be to not poll the results every second, but to instead do something like in async-query-data, which will increase wait time with every request that returns still "Running".
After this, we could implement a query option input that will give the user more control over how often the results are polled per-query, but it might not be necessary after ☝🏻
Some additional things in this PR:
Who is this feature for?
[Add information on what kind of user the feature is for.]
Which issue(s) does this PR fix?:
Fixes #63305
Special notes for your reviewer:
When this is merged, I will comment here and here to inform the user and customer support about the change.
TODO: Add to what's new
Please check that: