Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally use GetMetricData instead of GetMetricStatistics #414

Conversation

or-shachar
Copy link
Contributor

@or-shachar or-shachar commented Apr 14, 2022

Trying to resolve #134 in an opt-in way, leaving the previous GetMetricStatistics option still as default.

Strategy:

Add per rule / global config called use_get_metric_data (boolean) to allow selection between getMetricStatistics or the new GetMetricData

Implementation:

  1. Extracted the metric-data retrieval operation to a separate interface (see docs).
  2. Extracted the existing behavior into an implementation called GetMetricStatisticsDataGetter
  3. By default CloudWatchCollector will use it. This is why all existing UT pass.
  4. In case use_get_metric_data it set to true - we expect no calls to be made to cloudwatchClient.getMetricStatistics (added UT for that).
  5. Created alternative GetMetricDataDataGetter implementation.

Additional changes

  1. Instead of managing separate list of samples - there are two maps of Statistic -> Samples and ExtendedValue -> Samples and we populate those lists accordingly.
  2. [Testing] I moved all matches to new file called "RequestsMatchers" to keep the test file a little more cleaned.

I did not implement all tests using the GetMetricData but selection of them

TODO:

  • Support edge cases for the getMetricData calls (error handling etc...)
  • Add more UT to cover critical areas in case the new class is used
  • In case we see > 500 data queries in one call - we need to break it to multiple smaller calls.

@or-shachar or-shachar force-pushed the or-shachar/getMetricStatistics-GetMetricData branch 2 times, most recently from 7512e68 to 2530e48 Compare April 19, 2022 07:34
@or-shachar or-shachar force-pushed the or-shachar/getMetricStatistics-GetMetricData branch from 2530e48 to d696fd9 Compare April 19, 2022 07:49
@matthiasr
Copy link
Contributor

It's great to see this worked on! Please let me know if you need guidance or a deeper review.

One question I have is what factors would make someone choose the existing implementation? When would this implementation be fit to become the default or only one?

@or-shachar
Copy link
Contributor Author

or-shachar commented Apr 20, 2022

Hey @matthiasr 👋

The original issue mentioned a possible gain in costs, but then later it was ruled out.

This official post is recommending using GetMetricData to reduce the number of API calls you make.

For instance - if you ask for ApproximateNumberOfMessagesVisible for all SQS queues, and you have 300 of them - you'd have to make 300 calls to MakeMetricStatistics but only a single call to GetMetricData.

@matthiasr
Copy link
Contributor

So there is no reason to keep GetMetricStatistics?

@or-shachar
Copy link
Contributor Author

In general - seems like there isn't.

Looking YACE - it seems like they chose to keep using GetMetricStatistic if the dimensions are statically defined - though I think they intend to remove it completely and an issue was opened about it a few months ago.

I can think of few reasons to keep having both options available first:

  1. The first method, though slow and inefficient, is battle tested. We want users to opt-in to new the method at their pace. Let's say that this would be supported in version v14.x and in version v15.x we will swap the default, and on version v16.x we will remove the old method.
  2. My current implementation does not support pagination. The default page size is 500 metrics. If for intsance - the list of dimetions list is bigger than that - it's safer to use GetMetricStatistics until we will introduce such support. (You can, of course, request that the support in pagination would be part of this PR - I'll work on it later on).

@matthiasr
Copy link
Contributor

matthiasr commented Apr 20, 2022 via email

@or-shachar or-shachar force-pushed the or-shachar/getMetricStatistics-GetMetricData branch from 1da4760 to d861f66 Compare April 21, 2022 11:12
@or-shachar
Copy link
Contributor Author

I think I handled the batching. No need to support pagination actually because we only care about single datapoint.

IIUC - single call can retrieve up to 500 metrics - so I partitioned the queries to batches of up to 500 and generated requests accordingly.

How do we generally handle runtime errors here?

@matthiasr
Copy link
Contributor

A runtime error should cause the scrape to fail (return a non-2xx HTTP status).

@or-shachar
Copy link
Contributor Author

@matthiasr - I think my editor formatted the CloudwatchCollector using a different styling sheet. What do we use here? Google Style Guide? Do we have a convention?

The easiest for me would be to configure the right formatter and re-format but if there isn't any convention I'll try to change it so that the file won't include formatting changes.

@matthiasr
Copy link
Contributor

There isn't a consistent style so far. Feel free to format the file, but please do so in a separate commit so the actual changes can be reviewed separately

@or-shachar or-shachar force-pushed the or-shachar/getMetricStatistics-GetMetricData branch 15 times, most recently from e40e7fa to c106448 Compare April 25, 2022 14:26
@or-shachar or-shachar force-pushed the or-shachar/getMetricStatistics-GetMetricData branch 3 times, most recently from 8be4ba5 to cd4d9ef Compare April 25, 2022 15:03
@or-shachar
Copy link
Contributor Author

K - I think I cleaned the unrelated changes. I think it's even ready for the first code review :-)
I'll update the description to explain a little the mechanism

@or-shachar or-shachar marked this pull request as ready for review April 25, 2022 15:24
@or-shachar
Copy link
Contributor Author

One thing that I thought of is that we might want to add another separate internal metric to count the amount metrics requested from Cloudwatch. When we use GetMetricStatistic it is simpler because it is 1:1 ratio but in a single GetMetricData you can ask for up to 500 metrics.
Since the pricing of Cloudwatch is not per request but per metric -I guess some users would appreciate having this exposed.

Copy link
Contributor

@matthiasr matthiasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you for organizing the code cleanly!

I agree that we should have a metric for metrics requested that maps to the expected cost.

Additionally, could you add a description of the setting to the README, with the trade offs between the two variants? Something along the lines of "same cost, faster, less tested, will become the default in the future".

.gitignore Outdated Show resolved Hide resolved
Signed-off-by: or-shachar <or.shachar@wiz.io>
Signed-off-by: or-shachar <or.shachar@wiz.io>
Signed-off-by: or-shachar <or.shachar@wiz.io>
Signed-off-by: or-shachar <or.shachar@wiz.io>
Signed-off-by: or-shachar <or.shachar@wiz.io>
@or-shachar or-shachar force-pushed the or-shachar/getMetricStatistics-GetMetricData branch 2 times, most recently from a86517d to e2027e2 Compare April 28, 2022 21:44
@or-shachar
Copy link
Contributor Author

or-shachar commented Apr 28, 2022

Hey @matthiasr ! made another commit, and rebased on top of the current master.

  • Added cloudwatch_metrics_requsted_total metric (please see that you agree with the labels I chose).
  • Added a whole README section with a table describing the tradeoff.

Feel free to correct me on anything :-)
Naming and documenting is hard and I want this to be comprehensive and agreeable.

@or-shachar
Copy link
Contributor Author

Also - if you happen to merge it - I prefer the squash and merge of course :)

Signed-off-by: or-shachar <or.shachar@wiz.io>
@or-shachar or-shachar force-pushed the or-shachar/getMetricStatistics-GetMetricData branch from e2027e2 to 3caa750 Compare April 30, 2022 11:18
@or-shachar
Copy link
Contributor Author

@matthiasr - do you think you'll get to it by the end of this week? :-) (no pressure... I'm just excited for it to be merged)

@matthiasr
Copy link
Contributor

Awesome, thanks a lot!

@matthiasr matthiasr merged commit d87d281 into prometheus:master May 6, 2022
@or-shachar
Copy link
Contributor Author

Thanks for the latest release.
I installed it on our testing env and overall it works well! (~70% faster response time!)

I did open #422 which I don't think is critical and I have another edit to do after #421 is merged but I don't think they're critical as well.

@SuperQ
Copy link
Member

SuperQ commented May 10, 2022

FYI, I rolled this out to my exporters. It was a massive improvement to our ingestion. Previous gathering that was mostly failing is now consistently returning in a stable scrape duration with 10s of thousands of metrics per scrape.

@george-angel
Copy link

Great improvement for us, much faster scrapes and order of magnitude less calls:

2022-05-11-190331_2525x773_scrot
2022-05-11-190342_2521x732_scrot

@Sk1v
Copy link

Sk1v commented Feb 24, 2023

Please let me know if it is possible to go to GetMetricStatistics? Perhaps I do not understand how to enable this in the exporter config

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce number of api operations by using GetMetricData API instead of GetMetricStatistics API
5 participants