New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloudwatch Metrics: Adjust error handling #79911
Cloudwatch Metrics: Adjust error handling #79911
Conversation
@@ -121,6 +122,7 @@ func (e *cloudWatchExecutor) executeTimeSeriesQuery(ctx context.Context, logger | |||
Error: fmt.Errorf("metric request error: %q", err), | |||
} | |||
resultChan <- &responseWrapper{ | |||
RefId: getQueryRefIdFromErrorString(err.Error(), requestQueries), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more of an improvement. We dont really need the refId in the error in order to display the error in the panel top left corner. However, we do need it to display it in the query editor inside the panel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it and it worked! a couple questions about the tests.
public/app/plugins/datasource/cloudwatch/query-runner/CloudWatchMetricsQueryRunner.test.ts
Outdated
Show resolved
Hide resolved
public/app/plugins/datasource/cloudwatch/query-runner/CloudWatchMetricsQueryRunner.test.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm tested this locally and I don't think it works? When I run an incorrect query like "nonsense + 2" (that was linked in the issue) I get this weird flash of an error message but then it disappears? Let me know if I'm missing something!
Screen.Recording.2024-01-05.at.10.08.32.AM.mov
dispatch: jest.fn(), | ||
}); | ||
beforeEach(() => { | ||
redux.setStore({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we use redux in cloudwatch? why is this necessary again? sorry I know you didn't change this just seeing it now lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a ticket to remove this here: #80151
return { data: [] }; | ||
} | ||
|
||
const lastError = findLast(res.data, (v) => !!v.error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the deal again with "last error"? I feel like there was some kind of intentionality about showing the last error but I don't remember why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it was in order to only get one error to display. I guess at some point before errors were passed in the res.data, but they're not anymore - they're passed as a separate errors
array from the DatasourceWithBackend srv, in response.errors
. As far as I can see there isn't really a way for one metric query to return multiple errors from AWS and our BE
public/app/plugins/datasource/cloudwatch/query-runner/CloudWatchMetricsQueryRunner.ts
Show resolved
Hide resolved
(refId && !failedRefIds.includes(refId)) || res.includes(region) ? res : [...res, region], | ||
[] | ||
); | ||
regionsAffected.forEach((region) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you talk through why there's logic here about affected regions? Is that only needed for throttling or something? are there other errors we want to alert on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, regions were only checked for throttling in the old error handling, so I kept it. Regarding other errors, I have some of the reasoning about this is in the description. TBH I originally wanted to remove all toaster alerts and just depend on the regular grafana error ui, so it's definitely something we can discuss
if (!isFrameError && err.data && err.data.message === 'Metric request error' && err.data.error) { | ||
err.message = err.data.error; | ||
return throwError(() => err); | ||
catchError((err: unknown) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry why did we move our error handling out of catch error? Does it not get down here? Under what circumstances would catchError happen here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bit more info in the description about what caused this, basically cause DatasourceWithBackend that we use now, returns data and not error
This PR was probably missing from the branch: #79943, just merged main into it. Can you pull and try again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's a lint error to fix, but I manually tested and it works for me!
…rfaced-to-user-metrics
What is this feature?
Partly fixes this:
The problem here was that we had a catchError block in MetricsRunner that was getting skipped because we're using DataSourceWithBackend's
query
method now. There, the response from Cloudwatch BE is getting processed before it goes into the Metrics Runner. The way errors are processed is that the response is turned into an error object and transformed into a observable:This response doesn't get caught in the
catch
block of the CWMetricsQueryRunner, so the errors weren't being propagated to the panel.The biggest change here is that I removed toast alerts for errors in Cloudwatch. These errors could be disruptive to the user, as we display an error icon with every panel that it anyway:
I looked for the throwError usage elsewhere in grafana and it's used in Loki, Prometheus and Graphite, but AFAIK in none of the AWS datasources.
However, throttling errors were handled separately before, i.e. errors concerning CW limits: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_limits.html
I kept this handling, just added "Rate exceeded" (https://aws.amazon.com/blogs/mt/managing-monitoring-api-throttling-in-workloads/) Im not sure if this should be removed in favor of just displaying the errors in the top left corner of the panel. But it could be argued that if a user has a dashboard with many CW panels and they would want to know if they hit any quotas immediately upon opening the dashboard.
Which issue(s) does this PR fix?:
Fixes #
Special notes for your reviewer:
This should be done for Logs queries as well, since they're broken too, but that will be done in another PR
Please check that: