PrometheusMetricSampler query window time range #1717

rkferreira · 2021-10-20T13:21:09Z

Hi,

During the setup of Cruise Control using "PrometheusMetricSampler" I see on issue collecting data from AWS MSK cluster.

CC VERSION="2.5.42"

DefaultPrometheusQuerySupplier.java proposes "BROKER_CPU_UTIL" query using time range window of 1 minute, but MSK reports CPU metrics as of 5 minutes window.

The result is you always get discarded metrics due to missing BROKER_CPU_UTIL metric:

#SamplingUtils.java
    if (brokerLoad == null || !brokerLoad.brokerMetricAvailable(BROKER_CPU_UTIL)) {
      // Broker load or its BROKER_CPU_UTIL metric is not available.
      LOG.debug("{}partition {} because {} metric for broker {} is unavailable.", SKIP_BUILDING_SAMPLE_PREFIX,
                tpDotNotHandled, BROKER_CPU_UTIL, leaderId);
      return true;

My current fix:

--- cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/monitor/sampling/prometheus/DefaultPrometheusQuerySupplier.java	2021-10-04 15:27:09.000000000 -0300
+++ cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/monitor/sampling/prometheus/DefaultPrometheusQuerySupplier_aws.java	2021-10-19 22:18:19.000000000 -0300
@@ -23,7 +23,7 @@
     static {
         // broker metrics
         TYPE_TO_QUERY.put(BROKER_CPU_UTIL,
-            "1 - avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[1m]))");
+            "1 - avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m]))");
         TYPE_TO_QUERY.put(ALL_TOPIC_BYTES_IN,
             "kafka_server_BrokerTopicMetrics_OneMinuteRate{name=\"BytesInPerSec\",topic=\"\"}");
         TYPE_TO_QUERY.put(ALL_TOPIC_BYTES_OUT,

An improvement would be a configurable prometheus window.

Thanks,
Rodrigo Kellermann Ferreira

The text was updated successfully, but these errors were encountered:

efeg · 2021-11-06T01:54:14Z

Hi @rkferreira Thanks for reporting this issue!
Would you like to contribute your fix and/or the proposed improvement?

mohitpali · 2022-05-17T17:45:13Z

The reason why the metrics are getting discarded is because your scraping interval is > 30 seconds and 1 - avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[1m])) is an Prometheus iRate query requires at least 2 data points to calculate the rate.

I however agree that there should be a provision to change the duration based on your scraping interval.

efeg added the robustness Makes the project tolerate or handle perturbations. label Nov 6, 2021

mohitpali mentioned this issue Jul 21, 2022

Make Prometheus broker cpu metric query configurable #1867

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PrometheusMetricSampler query window time range #1717

PrometheusMetricSampler query window time range #1717

rkferreira commented Oct 20, 2021 •

edited

Loading

efeg commented Nov 6, 2021

mohitpali commented May 17, 2022

PrometheusMetricSampler query window time range #1717

PrometheusMetricSampler query window time range #1717

Comments

rkferreira commented Oct 20, 2021 • edited Loading

efeg commented Nov 6, 2021

mohitpali commented May 17, 2022

rkferreira commented Oct 20, 2021 •

edited

Loading