-
Notifications
You must be signed in to change notification settings - Fork 11.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CloudWatch: Correctly add dimension values to labels #74847
Conversation
5e97c93
to
6bedced
Compare
Woohoo! This looks good!
Would you illustrate the above situation with a test case maybe? Didn't quite follow but will be useful to keep in mind by documenting it with a test case.
How many people do we expect are falling into this condition? I've just added some suggestions to harmonize our mock usage and move towards using testify/mock for our mocks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you illustrate the above situation with a test case maybe? Didn't quite follow but will be useful to keep in mind by documenting it with a test case.
I'm kind of against adding a test that confirms we're doing things wrong, but an example would be:
ex. Dimension 1 has values ["A", "A B"] and Dimension 2 has values ["B", and "C"]. For dimension 1="A" and dimension 2="B" AWS will by default produce the label "A B". When we find the labels, its going to get the labels as 1:"A B" and 2:"B".
How many people do we expect are falling into this condition?
I'm not really sure. the reason I'm hesitant is because we would be doing this for every dimension every time a query is called, which depending on how often a dashboard refreshes could be a lot.
What would be the "right" way to do it?
Just every time a query is called with a wildcard and matchExact false, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the "right" way to do it?
The best way to do it would be for AWS to pass us the dimensions as a map/array as part of the result instead of smushed into a string (or smushed into a string with a distinct separator character that can't be in a label value), this is the best way within our power
Just every time a query is called with a wildcard and matchExact false, right?
What would a cache solution be like?
Yeah, but the free api request limit includes basically everything that isn't a "GetMetricsData" request.
Doing the math, if you have a query refreshing every 5 seconds (on the high end) with 1 wildcard dimension, you have 306024*30=1296000 listMetricsPages requests a month, which costs $2.96 for the listMetricsPages requests over the limit. And any additional wildcard dimension refreshing at the same rate is $12.96 a month.
Having done the math I'm going to look into implementing a simple cache that expires (after a day maybe?) to cut those down, because that's kind of ridiculous considering that we're not passing a time range parameter.
0da4ca9
to
d9a8bcf
Compare
@fridgepoet if you would like to take another look now that it caches, the last 2 commits are the changes since you approved it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache looks really nice!
continue | ||
} | ||
|
||
cacheKey := fmt.Sprintf("%s-%s-%s-%s", region, query.Namespace, query.MetricName, dimensionKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering about edge cases, though I don't know everything about the uniqueness here.
Is there any risk for fetching the wrong data with the same key?
For example if there are custom Namespaces that happen to be the same region and same name, what happens there?
Are there any security issues?
To what extent could any of these be private to the account?
Will there be anything secure that we should not store in-memory too long?
How often can these values change, by the way? Will 24 hours be "too long"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call on the security, I moved the cache to be by instance so that you have to have the related permissions to access the cache. I should have done that anyway to avoid causing future multi-tenancy issues.
I don't think labels should change frequently. 24 hours of expiration gets us to 30 queries a month per dimension, and we could up it a bit. I'm not sure what the best expiration would be, but 1 hour or 720 queries per dimension probably wouldn't be crazy out of the 1 million. We could make it configurable? They could resave the datasource to clear the cache if they need, though I don't know if all users would have access to that.
api.On("ListMetricsPages").Return(nil) | ||
_, err := executor.getDimensionValuesForWildcards(context.Background(), api, "us-east-1", []*models.CloudWatchQuery{query}) | ||
assert.Nil(t, err) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this test does not assert quite what we think.
The query
input is mutated by the results from the first call to ListMetricsPages
, so what happens during the second call is that the dimension is no longer a wildcard. The loop continue
s.
The assertions still pass, however, since the result is that ListMetricsPages
doesn't get called again.
I'm aware I'm stepping back here, but would it be possible to not mutate the queries []*models.CloudWatchQuery
in getDimensionValuesForWildcards
? We are already using a function signature which returns the result type we want. I think it may be misleading to mutate the input as well as provide the output in the function signature. Let's make some copies internally and return that, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to not mutate the queries.
Co-authored-by: Shirley <4163034+fridgepoet@users.noreply.github.com>
19a9d1b
to
b825ef1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated it to cache by instance!
Sorry about the rebase, I needed to get the fix for the instance manager and forgot that it would mess up the history of the pr.
I started filling out a hosted grafana readiness doc so we can be more reassured about the safety of it, and we can run it by them once our team approves the pr.
api.On("ListMetricsPages").Return(nil) | ||
_, err := executor.getDimensionValuesForWildcards(context.Background(), api, "us-east-1", []*models.CloudWatchQuery{query}) | ||
assert.Nil(t, err) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to not mutate the queries.
continue | ||
} | ||
|
||
cacheKey := fmt.Sprintf("%s-%s-%s-%s", region, query.Namespace, query.MetricName, dimensionKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call on the security, I moved the cache to be by instance so that you have to have the related permissions to access the cache. I should have done that anyway to avoid causing future multi-tenancy issues.
I don't think labels should change frequently. 24 hours of expiration gets us to 30 queries a month per dimension, and we could up it a bit. I'm not sure what the best expiration would be, but 1 hour or 720 queries per dimension probably wouldn't be crazy out of the 1 million. We could make it configurable? They could resave the datasource to clear the cache if they need, though I don't know if all users would have access to that.
(Made another commit to only cache when values are returned so users can't maliciously spam fake dimension keys) |
Co-authored-by: Shirley <4163034+fridgepoet@users.noreply.github.com>
Co-authored-by: Shirley <4163034+fridgepoet@users.noreply.github.com>
Co-authored-by: Shirley <4163034+fridgepoet@users.noreply.github.com>
What is this feature?
Fixes how dimension values are added to frame labels by fetching the values for any query dimension using a
*
andmatchExact
set to false.I looked into it, and we're not getting passed the values as is on the return value from the AWS API, so this PR helps out the code in responseparser.go that iterates over all the dimension values and sees if any are contained in the returned label (which, if not set by the user, contains all the differentiating dimension values separated by spaces).
This has the same problem it would have previously had, which is if a dimension value for dimension
A
is a subset of a value for dimensionB
it could get mislabeled, but I don't think there's a way for us to work around that.Which issue(s) does this PR fix?:
Fixes #72172
Special notes for your reviewer:
Should this be behind a feature toggle? I was checking pricing and the additional list metrics call for every query could start costing money if they burn through their 1,000,000 free api calls. Or should we look into caching these values?
Please check that: