Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudwatch Metrics: Adjust error handling #79911

Conversation

idastambuk
Copy link
Contributor

What is this feature?
Partly fixes this:

The only way I see the error is in the network tab of my browser which is not a great experience as a user, as it seems to imply it was a valid query there just was not data associated:

The problem here was that we had a catchError block in MetricsRunner that was getting skipped because we're using DataSourceWithBackend's query method now. There, the response from Cloudwatch BE is getting processed before it goes into the Metrics Runner. The way errors are processed is that the response is turned into an error object and transformed into a observable:

catchError((err) => {
          return of(toDataQueryResponse(err));
        })

This response doesn't get caught in the catch block of the CWMetricsQueryRunner, so the errors weren't being propagated to the panel.

The biggest change here is that I removed toast alerts for errors in Cloudwatch. These errors could be disruptive to the user, as we display an error icon with every panel that it anyway:

Screenshot 2023-12-28 at 10 50 18

I looked for the throwError usage elsewhere in grafana and it's used in Loki, Prometheus and Graphite, but AFAIK in none of the AWS datasources.

However, throttling errors were handled separately before, i.e. errors concerning CW limits: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_limits.html
I kept this handling, just added "Rate exceeded" (https://aws.amazon.com/blogs/mt/managing-monitoring-api-throttling-in-workloads/) Im not sure if this should be removed in favor of just displaying the errors in the top left corner of the panel. But it could be argued that if a user has a dashboard with many CW panels and they would want to know if they hit any quotas immediately upon opening the dashboard.

Which issue(s) does this PR fix?:

Fixes #

Special notes for your reviewer:
This should be done for Logs queries as well, since they're broken too, but that will be done in another PR

Please check that:

  • It works as expected from a user's perspective.
  • If this is a pre-GA feature, it is behind a feature toggle.
  • The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

@fridgepoet fridgepoet requested review from a team and kevinwcyu and removed request for fridgepoet and a team January 1, 2024 09:15
@@ -121,6 +122,7 @@ func (e *cloudWatchExecutor) executeTimeSeriesQuery(ctx context.Context, logger
Error: fmt.Errorf("metric request error: %q", err),
}
resultChan <- &responseWrapper{
RefId: getQueryRefIdFromErrorString(err.Error(), requestQueries),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of an improvement. We dont really need the refId in the error in order to display the error in the panel top left corner. However, we do need it to display it in the query editor inside the panel.

Copy link
Contributor

@iwysiu iwysiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it and it worked! a couple questions about the tests.

Copy link
Member

@sarahzinger sarahzinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm tested this locally and I don't think it works? When I run an incorrect query like "nonsense + 2" (that was linked in the issue) I get this weird flash of an error message but then it disappears? Let me know if I'm missing something!

Screen.Recording.2024-01-05.at.10.08.32.AM.mov

dispatch: jest.fn(),
});
beforeEach(() => {
redux.setStore({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we use redux in cloudwatch? why is this necessary again? sorry I know you didn't change this just seeing it now lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a ticket to remove this here: #80151

return { data: [] };
}

const lastError = findLast(res.data, (v) => !!v.error);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the deal again with "last error"? I feel like there was some kind of intentionality about showing the last error but I don't remember why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it was in order to only get one error to display. I guess at some point before errors were passed in the res.data, but they're not anymore - they're passed as a separate errors array from the DatasourceWithBackend srv, in response.errors. As far as I can see there isn't really a way for one metric query to return multiple errors from AWS and our BE

(refId && !failedRefIds.includes(refId)) || res.includes(region) ? res : [...res, region],
[]
);
regionsAffected.forEach((region) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you talk through why there's logic here about affected regions? Is that only needed for throttling or something? are there other errors we want to alert on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, regions were only checked for throttling in the old error handling, so I kept it. Regarding other errors, I have some of the reasoning about this is in the description. TBH I originally wanted to remove all toaster alerts and just depend on the regular grafana error ui, so it's definitely something we can discuss

if (!isFrameError && err.data && err.data.message === 'Metric request error' && err.data.error) {
err.message = err.data.error;
return throwError(() => err);
catchError((err: unknown) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry why did we move our error handling out of catch error? Does it not get down here? Under what circumstances would catchError happen here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bit more info in the description about what caused this, basically cause DatasourceWithBackend that we use now, returns data and not error

@idastambuk idastambuk requested a review from a team as a code owner January 11, 2024 12:14
@idastambuk idastambuk requested review from axelavargas and kaydelaney and removed request for a team January 11, 2024 12:14
@idastambuk
Copy link
Contributor Author

Hmm tested this locally and I don't think it works? When I run an incorrect query like "nonsense + 2" (that was linked in the issue) I get this weird flash of an error message but then it disappears? Let me know if I'm missing something!

Screen.Recording.2024-01-05.at.10.08.32.AM.mov

This PR was probably missing from the branch: #79943, just merged main into it. Can you pull and try again?

Copy link
Member

@sarahzinger sarahzinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's a lint error to fix, but I manually tested and it works for me!

@idastambuk idastambuk merged commit d3a89a2 into main Jan 15, 2024
15 checks passed
@idastambuk idastambuk deleted the 78819-cloudwatch-errors-do-not-always-get-surfaced-to-user-metrics branch January 15, 2024 16:19
@grafana-delivery-bot grafana-delivery-bot bot modified the milestones: 10.3.x, 10.4.x Jan 15, 2024
s0lesurviv0r pushed a commit to s0lesurviv0r/grafana that referenced this pull request Feb 3, 2024
@aangelisc aangelisc modified the milestones: 10.4.x, 10.4.0 Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cloudwatch Errors do not always get surfaced to user and instead show No Data
4 participants