Cloudwatch Metrics: Adjust error handling #79911

idastambuk · 2023-12-29T12:21:31Z

What is this feature?
Partly fixes this:

The only way I see the error is in the network tab of my browser which is not a great experience as a user, as it seems to imply it was a valid query there just was not data associated:

The problem here was that we had a catchError block in MetricsRunner that was getting skipped because we're using DataSourceWithBackend's query method now. There, the response from Cloudwatch BE is getting processed before it goes into the Metrics Runner. The way errors are processed is that the response is turned into an error object and transformed into a observable:

catchError((err) => {
          return of(toDataQueryResponse(err));
        })

This response doesn't get caught in the catch block of the CWMetricsQueryRunner, so the errors weren't being propagated to the panel.

The biggest change here is that I removed toast alerts for errors in Cloudwatch. These errors could be disruptive to the user, as we display an error icon with every panel that it anyway:

I looked for the throwError usage elsewhere in grafana and it's used in Loki, Prometheus and Graphite, but AFAIK in none of the AWS datasources.

However, throttling errors were handled separately before, i.e. errors concerning CW limits: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_limits.html
I kept this handling, just added "Rate exceeded" (https://aws.amazon.com/blogs/mt/managing-monitoring-api-throttling-in-workloads/) Im not sure if this should be removed in favor of just displaying the errors in the top left corner of the panel. But it could be argued that if a user has a dashboard with many CW panels and they would want to know if they hit any quotas immediately upon opening the dashboard.

Which issue(s) does this PR fix?:

Fixes #

Special notes for your reviewer:
This should be done for Logs queries as well, since they're broken too, but that will be done in another PR

Please check that:

It works as expected from a user's perspective.
If this is a pre-GA feature, it is behind a feature toggle.
The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

idastambuk · 2024-01-02T14:16:13Z

pkg/tsdb/cloudwatch/time_series_query.go

@@ -121,6 +122,7 @@ func (e *cloudWatchExecutor) executeTimeSeriesQuery(ctx context.Context, logger
 			Error: fmt.Errorf("metric request error: %q", err),
 		}
 		resultChan <- &responseWrapper{
+			RefId:        getQueryRefIdFromErrorString(err.Error(), requestQueries),


This is more of an improvement. We dont really need the refId in the error in order to display the error in the panel top left corner. However, we do need it to display it in the query editor inside the panel.

iwysiu

I tried it and it worked! a couple questions about the tests.

public/app/plugins/datasource/cloudwatch/query-runner/CloudWatchMetricsQueryRunner.test.ts

sarahzinger

Hmm tested this locally and I don't think it works? When I run an incorrect query like "nonsense + 2" (that was linked in the issue) I get this weird flash of an error message but then it disappears? Let me know if I'm missing something!

Screen.Recording.2024-01-05.at.10.08.32.AM.mov

sarahzinger · 2024-01-05T14:20:56Z

public/app/plugins/datasource/cloudwatch/query-runner/CloudWatchMetricsQueryRunner.test.ts

-            dispatch: jest.fn(),
-          });
+      beforeEach(() => {
+        redux.setStore({


do we use redux in cloudwatch? why is this necessary again? sorry I know you didn't change this just seeing it now lol

We have a ticket to remove this here: #80151

sarahzinger · 2024-01-05T14:23:37Z

public/app/plugins/datasource/cloudwatch/query-runner/CloudWatchMetricsQueryRunner.ts

-          return { data: [] };
-        }
-
-        const lastError = findLast(res.data, (v) => !!v.error);


what's the deal again with "last error"? I feel like there was some kind of intentionality about showing the last error but I don't remember why?

I assume it was in order to only get one error to display. I guess at some point before errors were passed in the res.data, but they're not anymore - they're passed as a separate errors array from the DatasourceWithBackend srv, in response.errors. As far as I can see there isn't really a way for one metric query to return multiple errors from AWS and our BE

public/app/plugins/datasource/cloudwatch/query-runner/CloudWatchMetricsQueryRunner.ts

sarahzinger · 2024-01-05T14:35:57Z

public/app/plugins/datasource/cloudwatch/query-runner/CloudWatchMetricsQueryRunner.ts

+                  (refId && !failedRefIds.includes(refId)) || res.includes(region) ? res : [...res, region],
+                []
+              );
+              regionsAffected.forEach((region) => {


can you talk through why there's logic here about affected regions? Is that only needed for throttling or something? are there other errors we want to alert on?

Yea, regions were only checked for throttling in the old error handling, so I kept it. Regarding other errors, I have some of the reasoning about this is in the description. TBH I originally wanted to remove all toaster alerts and just depend on the regular grafana error ui, so it's definitely something we can discuss

sarahzinger · 2024-01-05T14:38:47Z

public/app/plugins/datasource/cloudwatch/query-runner/CloudWatchMetricsQueryRunner.ts

-        if (!isFrameError && err.data && err.data.message === 'Metric request error' && err.data.error) {
-          err.message = err.data.error;
-          return throwError(() => err);
+      catchError((err: unknown) => {


sorry why did we move our error handling out of catch error? Does it not get down here? Under what circumstances would catchError happen here?

There's a bit more info in the description about what caused this, basically cause DatasourceWithBackend that we use now, returns data and not error

idastambuk · 2024-01-11T13:08:14Z

Hmm tested this locally and I don't think it works? When I run an incorrect query like "nonsense + 2" (that was linked in the issue) I get this weird flash of an error message but then it disappears? Let me know if I'm missing something!

Screen.Recording.2024-01-05.at.10.08.32.AM.mov

This PR was probably missing from the branch: #79943, just merged main into it. Can you pull and try again?

sarahzinger

Looks like there's a lint error to fix, but I manually tested and it works for me!

…rfaced-to-user-metrics

Fix tests

40b61e6

idastambuk requested a review from a team as a code owner December 29, 2023 12:21

idastambuk requested review from fridgepoet and iwysiu and removed request for a team December 29, 2023 12:21

grafana-delivery-bot bot added this to the 10.3.x milestone Dec 29, 2023

grafana-pr-automation bot added area/frontend datasource/CloudWatch area/backend labels Dec 29, 2023

idastambuk added add to changelog no-backport Skip backport of PR labels Dec 29, 2023

idastambuk added 2 commits December 29, 2023 14:23

Lint

a332e3b

Lint v2

6eadce7

fridgepoet requested review from a team and kevinwcyu and removed request for fridgepoet and a team January 1, 2024 09:15

idastambuk commented Jan 2, 2024

View reviewed changes

Display error

bcea804

iwysiu reviewed Jan 4, 2024

View reviewed changes

public/app/plugins/datasource/cloudwatch/query-runner/CloudWatchMetricsQueryRunner.test.ts Outdated Show resolved Hide resolved

public/app/plugins/datasource/cloudwatch/query-runner/CloudWatchMetricsQueryRunner.test.ts Outdated Show resolved Hide resolved

sarahzinger requested changes Jan 5, 2024

View reviewed changes

PR fixes

23816e4

idastambuk requested a review from a team as a code owner January 11, 2024 12:14

idastambuk requested review from axelavargas and kaydelaney and removed request for a team January 11, 2024 12:14

idastambuk added 2 commits January 11, 2024 13:17

Merge main

917fcce

Fix conflicts

b5dd954

idastambuk requested a review from sarahzinger January 11, 2024 13:08

idastambuk requested a review from iwysiu January 11, 2024 13:08

idastambuk mentioned this pull request Jan 11, 2024

Cloudwatch Logs Errors do not always get surfaced to user and instead show No Data #80365

Open

idastambuk removed request for axelavargas and kaydelaney January 11, 2024 16:04

Remove test repeat

abbe34b

sarahzinger approved these changes Jan 12, 2024

View reviewed changes

idastambuk added 2 commits January 15, 2024 15:18

Prettier

093ddee

Merge branch 'main' into 78819-cloudwatch-errors-do-not-always-get-su…

865adde

…rfaced-to-user-metrics

idastambuk merged commit d3a89a2 into main Jan 15, 2024
15 checks passed

idastambuk deleted the 78819-cloudwatch-errors-do-not-always-get-surfaced-to-user-metrics branch January 15, 2024 16:19

grafana-delivery-bot bot modified the milestones: 10.3.x, 10.4.x Jan 15, 2024

s0lesurviv0r pushed a commit to s0lesurviv0r/grafana that referenced this pull request Feb 3, 2024

Cloudwatch Metrics: Adjust error handling (grafana#79911)

0d1f652

aangelisc modified the milestones: 10.4.x, 10.4.0 Mar 6, 2024

BrewTestBot mentioned this pull request Mar 6, 2024

grafana 10.4.0 Homebrew/homebrew-core#165251

Closed

BrewTestBot mentioned this pull request Mar 14, 2024

grafana 10.4.0 Homebrew/homebrew-core#166070

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloudwatch Metrics: Adjust error handling #79911

Cloudwatch Metrics: Adjust error handling #79911

idastambuk commented Dec 29, 2023

idastambuk Jan 2, 2024

iwysiu left a comment

sarahzinger left a comment

sarahzinger Jan 5, 2024

idastambuk Jan 11, 2024

sarahzinger Jan 5, 2024

idastambuk Jan 11, 2024

sarahzinger Jan 5, 2024

idastambuk Jan 11, 2024

sarahzinger Jan 5, 2024

idastambuk Jan 11, 2024

idastambuk commented Jan 11, 2024

sarahzinger left a comment

Cloudwatch Metrics: Adjust error handling #79911

Cloudwatch Metrics: Adjust error handling #79911

Conversation

idastambuk commented Dec 29, 2023

Choose a reason for hiding this comment

iwysiu left a comment

Choose a reason for hiding this comment

sarahzinger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

idastambuk commented Jan 11, 2024

sarahzinger left a comment

Choose a reason for hiding this comment