Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PrometheusMetricSampler query window time range #1717

Open
rkferreira opened this issue Oct 20, 2021 · 2 comments
Open

PrometheusMetricSampler query window time range #1717

rkferreira opened this issue Oct 20, 2021 · 2 comments
Labels
robustness Makes the project tolerate or handle perturbations.

Comments

@rkferreira
Copy link

rkferreira commented Oct 20, 2021

Hi,

During the setup of Cruise Control using "PrometheusMetricSampler" I see on issue collecting data from AWS MSK cluster.

CC VERSION="2.5.42"

DefaultPrometheusQuerySupplier.java proposes "BROKER_CPU_UTIL" query using time range window of 1 minute, but MSK reports CPU metrics as of 5 minutes window.

The result is you always get discarded metrics due to missing BROKER_CPU_UTIL metric:

#SamplingUtils.java
    if (brokerLoad == null || !brokerLoad.brokerMetricAvailable(BROKER_CPU_UTIL)) {
      // Broker load or its BROKER_CPU_UTIL metric is not available.
      LOG.debug("{}partition {} because {} metric for broker {} is unavailable.", SKIP_BUILDING_SAMPLE_PREFIX,
                tpDotNotHandled, BROKER_CPU_UTIL, leaderId);
      return true;

My current fix:

--- cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/monitor/sampling/prometheus/DefaultPrometheusQuerySupplier.java	2021-10-04 15:27:09.000000000 -0300
+++ cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/monitor/sampling/prometheus/DefaultPrometheusQuerySupplier_aws.java	2021-10-19 22:18:19.000000000 -0300
@@ -23,7 +23,7 @@
     static {
         // broker metrics
         TYPE_TO_QUERY.put(BROKER_CPU_UTIL,
-            "1 - avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[1m]))");
+            "1 - avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m]))");
         TYPE_TO_QUERY.put(ALL_TOPIC_BYTES_IN,
             "kafka_server_BrokerTopicMetrics_OneMinuteRate{name=\"BytesInPerSec\",topic=\"\"}");
         TYPE_TO_QUERY.put(ALL_TOPIC_BYTES_OUT,

An improvement would be a configurable prometheus window.

Thanks,
Rodrigo Kellermann Ferreira

@efeg
Copy link
Collaborator

efeg commented Nov 6, 2021

Hi @rkferreira Thanks for reporting this issue!
Would you like to contribute your fix and/or the proposed improvement?

@efeg efeg added the robustness Makes the project tolerate or handle perturbations. label Nov 6, 2021
@mohitpali
Copy link
Contributor

The reason why the metrics are getting discarded is because your scraping interval is > 30 seconds and 1 - avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[1m])) is an Prometheus iRate query requires at least 2 data points to calculate the rate.

I however agree that there should be a provision to change the duration based on your scraping interval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
robustness Makes the project tolerate or handle perturbations.
Projects
None yet
Development

No branches or pull requests

3 participants