Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloudWatch: Correctly add dimension values to labels #74847

Merged
merged 13 commits into from
Sep 27, 2023
Merged

CloudWatch: Correctly add dimension values to labels #74847

merged 13 commits into from
Sep 27, 2023

Conversation

iwysiu
Copy link
Contributor

@iwysiu iwysiu commented Sep 13, 2023

What is this feature?

Fixes how dimension values are added to frame labels by fetching the values for any query dimension using a * and matchExact set to false.
I looked into it, and we're not getting passed the values as is on the return value from the AWS API, so this PR helps out the code in responseparser.go that iterates over all the dimension values and sees if any are contained in the returned label (which, if not set by the user, contains all the differentiating dimension values separated by spaces).
This has the same problem it would have previously had, which is if a dimension value for dimension A is a subset of a value for dimension B it could get mislabeled, but I don't think there's a way for us to work around that.

Which issue(s) does this PR fix?:

Fixes #72172

Special notes for your reviewer:
Should this be behind a feature toggle? I was checking pricing and the additional list metrics call for every query could start costing money if they burn through their 1,000,000 free api calls. Or should we look into caching these values?

Please check that:

  • It works as expected from a user's perspective.
  • If this is a pre-GA feature, it is behind a feature toggle.
  • The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

@grafana-delivery-bot grafana-delivery-bot bot added this to the 10.2.x milestone Sep 13, 2023
@iwysiu iwysiu added add to changelog no-backport Skip backport of PR labels Sep 13, 2023
@iwysiu iwysiu marked this pull request as ready for review September 13, 2023 18:37
@iwysiu iwysiu requested a review from a team as a code owner September 13, 2023 18:37
@iwysiu iwysiu requested review from fridgepoet and idastambuk and removed request for a team September 13, 2023 18:37
@iwysiu iwysiu changed the title CloudWatch: correctly add dimension values to labels CloudWatch: Correctly add dimension values to labels Sep 13, 2023
@iwysiu iwysiu marked this pull request as draft September 14, 2023 21:02
@iwysiu iwysiu marked this pull request as ready for review September 14, 2023 21:30
@fridgepoet
Copy link
Member

fridgepoet commented Sep 18, 2023

Woohoo! This looks good!

This has the same problem it would have previously had, which is if a dimension value for dimension A is a subset of a value for dimension B it could get mislabeled, but I don't think there's a way for us to work around that.

Would you illustrate the above situation with a test case maybe? Didn't quite follow but will be useful to keep in mind by documenting it with a test case.

Should this be behind a feature toggle? I was checking pricing and the additional list metrics call for every query could start costing money if they burn through their 1,000,000 free api calls. Or should we look into caching these values?

How many people do we expect are falling into this condition?
I support the idea of caching these values, though!


I've just added some suggestions to harmonize our mock usage and move towards using testify/mock for our mocks.

pkg/tsdb/cloudwatch/mocks/cloudwatch_metric_api.go Outdated Show resolved Hide resolved
pkg/tsdb/cloudwatch/mocks/cloudwatch_metric_api.go Outdated Show resolved Hide resolved
pkg/tsdb/cloudwatch/get_dimension_values_test.go Outdated Show resolved Hide resolved
pkg/tsdb/cloudwatch/get_dimension_values_test.go Outdated Show resolved Hide resolved
pkg/tsdb/cloudwatch/time_series_query.go Outdated Show resolved Hide resolved
pkg/tsdb/cloudwatch/time_series_query_test.go Show resolved Hide resolved
pkg/tsdb/cloudwatch/get_dimension_values_test.go Outdated Show resolved Hide resolved
pkg/tsdb/cloudwatch/get_dimension_values_test.go Outdated Show resolved Hide resolved
pkg/tsdb/cloudwatch/get_dimension_values_test.go Outdated Show resolved Hide resolved
pkg/tsdb/cloudwatch/get_dimension_values_test.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@iwysiu iwysiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you illustrate the above situation with a test case maybe? Didn't quite follow but will be useful to keep in mind by documenting it with a test case.

I'm kind of against adding a test that confirms we're doing things wrong, but an example would be:
ex. Dimension 1 has values ["A", "A B"] and Dimension 2 has values ["B", and "C"]. For dimension 1="A" and dimension 2="B" AWS will by default produce the label "A B". When we find the labels, its going to get the labels as 1:"A B" and 2:"B".

How many people do we expect are falling into this condition?

I'm not really sure. the reason I'm hesitant is because we would be doing this for every dimension every time a query is called, which depending on how often a dashboard refreshes could be a lot.

pkg/tsdb/cloudwatch/get_dimension_values_test.go Outdated Show resolved Hide resolved
pkg/tsdb/cloudwatch/time_series_query.go Outdated Show resolved Hide resolved
pkg/tsdb/cloudwatch/time_series_query_test.go Outdated Show resolved Hide resolved
pkg/tsdb/cloudwatch/time_series_query_test.go Show resolved Hide resolved
@fridgepoet
Copy link
Member

I'm kind of against adding a test that confirms we're doing things wrong, but an example would be: ex. Dimension 1 has values ["A", "A B"] and Dimension 2 has values ["B", and "C"]. For dimension 1="A" and dimension 2="B" AWS will by default produce the label "A B". When we find the labels, its going to get the labels as 1:"A B" and 2:"B".

What would be the "right" way to do it?

I'm not really sure. the reason I'm hesitant is because we would be doing this for every dimension every time a query is called, which depending on how often a dashboard refreshes could be a lot.

Just every time a query is called with a wildcard and matchExact false, right?
What would a cache solution be like?

Copy link
Contributor Author

@iwysiu iwysiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the "right" way to do it?

The best way to do it would be for AWS to pass us the dimensions as a map/array as part of the result instead of smushed into a string (or smushed into a string with a distinct separator character that can't be in a label value), this is the best way within our power

Just every time a query is called with a wildcard and matchExact false, right?
What would a cache solution be like?

Yeah, but the free api request limit includes basically everything that isn't a "GetMetricsData" request.
Doing the math, if you have a query refreshing every 5 seconds (on the high end) with 1 wildcard dimension, you have 306024*30=1296000 listMetricsPages requests a month, which costs $2.96 for the listMetricsPages requests over the limit. And any additional wildcard dimension refreshing at the same rate is $12.96 a month.
Having done the math I'm going to look into implementing a simple cache that expires (after a day maybe?) to cut those down, because that's kind of ridiculous considering that we're not passing a time range parameter.

@iwysiu
Copy link
Contributor Author

iwysiu commented Sep 21, 2023

@fridgepoet if you would like to take another look now that it caches, the last 2 commits are the changes since you approved it

Copy link
Member

@fridgepoet fridgepoet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache looks really nice!

continue
}

cacheKey := fmt.Sprintf("%s-%s-%s-%s", region, query.Namespace, query.MetricName, dimensionKey)
Copy link
Member

@fridgepoet fridgepoet Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about edge cases, though I don't know everything about the uniqueness here.
Is there any risk for fetching the wrong data with the same key?
For example if there are custom Namespaces that happen to be the same region and same name, what happens there?

Are there any security issues?
To what extent could any of these be private to the account?
Will there be anything secure that we should not store in-memory too long?

How often can these values change, by the way? Will 24 hours be "too long"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on the security, I moved the cache to be by instance so that you have to have the related permissions to access the cache. I should have done that anyway to avoid causing future multi-tenancy issues.

I don't think labels should change frequently. 24 hours of expiration gets us to 30 queries a month per dimension, and we could up it a bit. I'm not sure what the best expiration would be, but 1 hour or 720 queries per dimension probably wouldn't be crazy out of the 1 million. We could make it configurable? They could resave the datasource to clear the cache if they need, though I don't know if all users would have access to that.

api.On("ListMetricsPages").Return(nil)
_, err := executor.getDimensionValuesForWildcards(context.Background(), api, "us-east-1", []*models.CloudWatchQuery{query})
assert.Nil(t, err)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this test does not assert quite what we think.
The query input is mutated by the results from the first call to ListMetricsPages, so what happens during the second call is that the dimension is no longer a wildcard. The loop continues.
The assertions still pass, however, since the result is that ListMetricsPages doesn't get called again.

I'm aware I'm stepping back here, but would it be possible to not mutate the queries []*models.CloudWatchQuery in getDimensionValuesForWildcards? We are already using a function signature which returns the result type we want. I think it may be misleading to mutate the input as well as provide the output in the function signature. Let's make some copies internally and return that, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to not mutate the queries.

Copy link
Contributor Author

@iwysiu iwysiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it to cache by instance!

Sorry about the rebase, I needed to get the fix for the instance manager and forgot that it would mess up the history of the pr.

I started filling out a hosted grafana readiness doc so we can be more reassured about the safety of it, and we can run it by them once our team approves the pr.

api.On("ListMetricsPages").Return(nil)
_, err := executor.getDimensionValuesForWildcards(context.Background(), api, "us-east-1", []*models.CloudWatchQuery{query})
assert.Nil(t, err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to not mutate the queries.

continue
}

cacheKey := fmt.Sprintf("%s-%s-%s-%s", region, query.Namespace, query.MetricName, dimensionKey)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on the security, I moved the cache to be by instance so that you have to have the related permissions to access the cache. I should have done that anyway to avoid causing future multi-tenancy issues.

I don't think labels should change frequently. 24 hours of expiration gets us to 30 queries a month per dimension, and we could up it a bit. I'm not sure what the best expiration would be, but 1 hour or 720 queries per dimension probably wouldn't be crazy out of the 1 million. We could make it configurable? They could resave the datasource to clear the cache if they need, though I don't know if all users would have access to that.

@iwysiu
Copy link
Contributor Author

iwysiu commented Sep 26, 2023

(Made another commit to only cache when values are returned so users can't maliciously spam fake dimension keys)

@iwysiu iwysiu merged commit 06a35f5 into main Sep 27, 2023
19 checks passed
@iwysiu iwysiu deleted the iwysiu/72172 branch September 27, 2023 14:41
otilor pushed a commit to otilor/grafana that referenced this pull request Sep 28, 2023
Co-authored-by: Shirley <4163034+fridgepoet@users.noreply.github.com>
rwwiv pushed a commit that referenced this pull request Oct 2, 2023
Co-authored-by: Shirley <4163034+fridgepoet@users.noreply.github.com>
mildwonkey pushed a commit that referenced this pull request Oct 4, 2023
Co-authored-by: Shirley <4163034+fridgepoet@users.noreply.github.com>
@zerok zerok modified the milestones: 10.2.x, 10.2.0 Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue with Dimension Values in Alerting with CloudWatch Datasource
3 participants