Cloudwatch input changes for 1.20 are breaking #10027

cthiel42 · 2021-10-29T15:31:46Z

Relevent telegraf.conf

[agent]
    interval = "1m"

[[outputs.prometheus_client]]
    listen = ":9273"
    metric_version = 2
    path = "/metrics"
    expiration_interval = "30s"
    export_timestamp = true

[[inputs.cloudwatch]]
    region = "us-east-1"
    period = "1m"
    delay = "5m"
    interval = "5m"
    namespace = "AWS/RDS"
    statistic_include = [ "average", "sum", "minimum", "maximum", "sample_count" ]
    [[inputs.cloudwatch.metrics]]
        names = ["CPUUtilization","DatabaseConnections","FreeStorageSpace"]
        [[inputs.cloudwatch.metrics.dimensions]]
            name = "DBInstanceIdentifier"
            value = "*"

System info

1.20.3 Docker Image

Docker

No response

Steps to reproduce

Ensure you're using telegraf 1.20.3
Make a config file similar to the one above, in my experience it broke for all metrics.
Ensure the location you're running the telegraf agent from has the necessary IAM permissions from AWS to query cloudwatch metrics

Expected behavior

The cloudwatch input should query and return all the available metrics and their corresponding time series values based on your conditions.

Actual behavior

All of the time series are returned like normal, except they all contain a value of 0.

Additional info

After doing a docker pull for the latest telegraf image and refreshing the container, all cloudwatch metrics went to 0. I noticed in the change log that there were some changes to the cloudwatch input recently. I refreshed telegraf last on October 12th, so the only changes to the cloudwatch input according to the changelog would be as a result of #9647.

I resolved my issue by specifying version 1.19 in order to avoid all the cloudwatch input changes over the past month or two. I attached a screenshot of a panel from one of our dashboards. Note the behavior is that the metric is still returned, its value is just 0. This happened for all cloudwatch metrics collected by telegraf.

powersj · 2021-10-29T15:37:16Z

@cthiel42 Was this also an issue with 1.20.1 and 1.20.2? Are you doing anything else with the dockerfile? Note that in the 1.20.3 we stopped running telegraf as the root user.

cthiel42 · 2021-10-29T16:51:25Z

@powersj Given that 1.20.2 was released on October 7th and I did my last deployment on October 12th with no problems, it appears like the issue came about in 1.20.3. And I'm not doing anything with a dockerfile here, I'm just pulling and using your latest image.

I also wouldn't think this would be a user issue, as the plugin is able to communicate with AWS; it returns the available time series so it should be getting the correct values from AWS. But I also don't know what else goes on behind the scenes with this plugin.

powersj · 2021-10-29T17:06:16Z

@cthiel42 ok can you run this in debug mode (e.g. --debug) to get us some logs, please?

cthiel42 · 2021-10-29T18:27:37Z

@powersj Not much substance to the debug logs. I did confirm that I was getting the same issue while this was running.

2021-10-29T18:10:44Z I! Starting Telegraf 1.20.3
2021-10-29T18:10:44Z I! Using config file: /etc/telegraf/telegraf.conf
2021-10-29T18:10:44Z I! Loaded inputs: cloudwatch (3x)
2021-10-29T18:10:44Z I! Loaded aggregators: 
2021-10-29T18:10:44Z I! Loaded processors: 
2021-10-29T18:10:44Z I! Loaded outputs: prometheus_client
2021-10-29T18:10:44Z I! Tags enabled: host=ue1prometheus
2021-10-29T18:10:44Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ue1prometheus", Flush Interval:30s
2021-10-29T18:10:44Z D! [agent] Initializing plugins
2021-10-29T18:10:44Z D! [agent] Connecting outputs
2021-10-29T18:10:44Z D! [agent] Attempting connection to [outputs.prometheus_client]
2021-10-29T18:10:44Z I! [outputs.prometheus_client] Listening on http://[::]:9280/metrics
2021-10-29T18:10:44Z D! [agent] Successfully connected to outputs.prometheus_client
2021-10-29T18:10:44Z D! [agent] Starting service inputs
2021-10-29T18:11:14Z D! [outputs.prometheus_client] Wrote batch of 16 metrics in 153.911µs
2021-10-29T18:11:14Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:11:44Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:12:14Z D! [outputs.prometheus_client] Wrote batch of 16 metrics in 107.94µs
2021-10-29T18:12:14Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:12:44Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:13:14Z D! [outputs.prometheus_client] Wrote batch of 16 metrics in 107.159µs
2021-10-29T18:13:14Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:13:44Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:14:14Z D! [outputs.prometheus_client] Wrote batch of 16 metrics in 106.694µs
2021-10-29T18:14:14Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:14:44Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:15:14Z D! [outputs.prometheus_client] Wrote batch of 17 metrics in 224.662µs
2021-10-29T18:15:14Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:15:44Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:16:14Z D! [outputs.prometheus_client] Wrote batch of 16 metrics in 97.437µs
2021-10-29T18:16:14Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:16:44Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics

And here's the exact config I was using for this test

[agent]
    interval = "1m"
    round_interval = true
    metric_buffer_limit = 10000
    flush_buffer_when_full = true
    collection_jitter = "0s"
    flush_interval = "30s"
    flush_jitter = "0s"
    debug = true
    quiet = false
    logfile=""
    logfile_rotation_interval = "24h"
    logfile_rotation_max_archives = 5
    hostname = ""
[[outputs.prometheus_client]]
    listen = ":9280"
    metric_version = 2
    # basic_username = "Foo"
    # basic_password = "Bar"
    path = "/metrics"
    expiration_interval = "30s"
    # collectors_exclude = ["gocollector", "process"]
    export_timestamp = true
[[inputs.cloudwatch]]
    region = "us-east-1"
    period = "1m"
    delay = "5m"
    interval = "5m"
    namespace = "AWS/RDS"
    statistic_include = [ "average", "sum", "minimum", "maximum", "sample_count" ]
    [[inputs.cloudwatch.metrics]]

        names = ["CPUUtilization","DatabaseConnections","FreeableMemory","DiskQueueDepth","FreeStorageSpace","ReadLatency","WriteLatency","WriteIOPS","ReadIOPS","ReplicaLag"]

        [[inputs.cloudwatch.metrics.dimensions]]
            name = "DBInstanceIdentifier"
            value = "*"

[[inputs.cloudwatch]]
    region = "us-east-1"
    period = "1m"
    delay = "5m"
    interval = "1m"
    namespace = "AWS/FSx"
    statistic_include = [ "average", "sum", "minimum", "maximum", "sample_count" ]
    [[inputs.cloudwatch.metrics]]

        names = ["DataWriteBytes","FreeStorageCapacity","DataReadBytes"]

        [[inputs.cloudwatch.metrics.dimensions]]
            name = "FileSystemId"
            value = "*"

[[inputs.cloudwatch]]
    region = "us-east-1"
    period = "1m"
    delay = "5m"
    interval = "1m"
    namespace = "AWS/EC2"
    statistic_include = [ "average" ]
    [[inputs.cloudwatch.metrics]]

        names = ["StatusCheckFailed"]

        [[inputs.cloudwatch.metrics.dimensions]]
            name = "InstanceId"
            value = "*"

matthiassb · 2021-11-02T18:47:17Z

I am seeing this issue as well - it doesn't appear to be zero'd out but rather it appears that all values are for each metric is the same (which of course can be zero)

I reverted to 1.20.2 and this issue is not present on that version, it seems to be related to #9647

lbatalha · 2021-11-03T13:37:53Z

Same for us, all metric values seem to be the same, after restart, example screenshot from our dashboards:

When the CPU dropped from the steady high load was when it was restarted after the update.
Reverting to 1.20.2 from 1.20.3 returns it to working order.

powersj · 2021-11-03T18:54:11Z

Hi,

I have been able to reproduce this locally, I think. I have a PR up that adds some debugging code as well as updates the dependencies for AWS-related packages a bit further. I was started to get metrics from that.

Could someone please try the artifacts linked in the comment on PR #10051?

Thanks!

lbatalha · 2021-11-05T15:18:25Z

Could someone please try the artifacts linked in the comment on PR #10051?

I have tested the .deb artifact (1.21.0~9c4dcf97-0_amd64.deb), and it produces the same issue, metrics go to 0. Did not see any particular debugging info in the journal.

powersj · 2021-11-05T15:19:48Z

Can you confirm you were running with --debug?

lbatalha · 2021-11-05T15:29:44Z

Ah it just took a while to scrape, my bad. Most metrics do seem to be zero:

2021-11-05T15:25:02Z D! [inputs.cloudwatch] forwarding_master_dml_throughput_minimum: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[db_instance_identifier:stubbed-nft2-instance1 region:us-west-2]" fields: "forwarding_master_dml_throughput_minimum=%!s(float64=0)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] forwarding_master_dml_throughput_sum: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[db_instance_identifier:stubbed-nft2-instance1 region:us-west-2]" fields: "forwarding_master_dml_throughput_sum=%!s(float64=0)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] forwarding_master_dml_throughput_sample_count: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[db_instance_identifier:stubbed-nft2-instance1 region:us-west-2]" fields: "forwarding_master_dml_throughput_sample_count=%!s(float64=5)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] cpu_utilization_average: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[db_cluster_identifier:stubbed-nft2 region:us-west-2 role:WRITER]" fields: "cpu_utilization_average=%!s(float64=0)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] cpu_utilization_maximum: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[db_cluster_identifier:stubbed-nft2 region:us-west-2 role:WRITER]" fields: "cpu_utilization_maximum=%!s(float64=0)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] cpu_utilization_minimum: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[db_cluster_identifier:stubbed-nft2 region:us-west-2 role:WRITER]" fields: "cpu_utilization_minimum=%!s(float64=0)"

There seem to be some "samples", unsure what that means exactly:

2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[database_class:db.r5.2xlarge region:us-west-2]" fields: "insert_latency_minimum=%!s(float64=0)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] insert_latency_sum: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[database_class:db.r5.2xlarge region:us-west-2]" fields: "insert_latency_sum=%!s(float64=0)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] insert_latency_sample_count: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[database_class:db.r5.2xlarge region:us-west-2]" fields: "insert_latency_sample_count=%!s(float64=5)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] forwarding_replica_select_throughput_average: found 1 values

Swapping between the artifact and 1.20.2, for query latency as above:

* origin/master: (133 commits) chore: restart service if it is already running and upgraded via RPM (influxdata#9970) feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10237) fix: Handle duplicate registration of protocol-buffer files gracefully. (influxdata#10188) fix(http_listener_v2): fix panic on close (influxdata#10132) feat: add Vault input plugin (influxdata#10198) feat: support aws managed service for prometheus (influxdata#10202) fix: Make telegraf compile on Windows with golang 1.16.2 (influxdata#10246) Update changelog feat: Modbus add per-request tags (influxdata#10231) fix: Implement NaN and inf handling for elasticsearch output (influxdata#10196) feat: add nomad input plugin (influxdata#10106) fix: Print loaded plugins and deprecations for once and test (influxdata#10205) fix: eliminate MIB dependency for ifname processor (influxdata#10214) feat: Optimize locking for SNMP MIBs loading. (influxdata#10206) feat: Add SMART plugin concurrency configuration option, nvme-cli v1.14+ support and lint fixes. (influxdata#10150) feat: update configs (influxdata#10236) fix(inputs/kube_inventory): set TLS server name config properly (influxdata#9975) fix: Sudden close of Telegraf caused by OPC UA input plugin (influxdata#10230) fix: bump github.com/eclipse/paho.mqtt.golang from 1.3.0 to 1.3.5 (influxdata#9913) fix: json_v2 parser timestamp setting (influxdata#10221) fix: ensure graylog spec fields not prefixed with '_' (influxdata#10209) docs: remove duplicate links in CONTRIBUTING.md (influxdata#10218) fix: pool detection and metrics gathering for ZFS >= 2.1.x (influxdata#10099) fix: parallelism fix for ifname processor (influxdata#10007) chore: Forbids "log" package only for aggregators, inputs, outputs, parsers and processors (influxdata#10191) docs: address documentation gap when running telegraf in k8s (influxdata#10215) feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10211) fix: mqtt topic extracting no longer requires all three fields (influxdata#10208) fix: windows service - graceful shutdown of telegraf (influxdata#9616) feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10201) feat: Modbus support multiple slaves (gateway feature) (influxdata#9279) fix: Revert unintented corruption of the Makefile from influxdata#10200. (influxdata#10203) chore: remove triggering update-config bot in CI (influxdata#10195) Update changelog feat: Implement deprecation infrastructure (influxdata#10200) fix: extra lock on init for safety (influxdata#10199) fix: resolve influxdata#10027 (influxdata#10112) fix: register bigquery to output plugins influxdata#10177 (influxdata#10178) fix: sysstat use unique temp file vs hard-coded (influxdata#10165) refactor: snmp to use gosmi (influxdata#9518) ...

cthiel42 added the bug unexpected problem or unintended behavior label Oct 29, 2021

telegraf-tiger bot added the area/aws AWS plugins including cloudwatch, ecs, kinesis label Oct 29, 2021

powersj mentioned this issue Nov 3, 2021

fix: debug cloudwatch, update deps #10051

Closed

3 tasks

n2N8Z pushed a commit to n2N8Z/telegraf that referenced this issue Nov 16, 2021

fix influxdata#10027

9fed136

n2N8Z mentioned this issue Nov 16, 2021

fix: cloudwatch metrics collection #10027 #10112

Merged

3 tasks

n2N8Z mentioned this issue Nov 30, 2021

[cloudwatch-input] dataQuery generation always returns the same dimension for a metric #10122

Closed

powersj closed this as completed in #10112 Dec 1, 2021

powersj pushed a commit that referenced this issue Dec 1, 2021

fix: resolve #10027 (#10112)

69afb14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloudwatch input changes for 1.20 are breaking #10027

Cloudwatch input changes for 1.20 are breaking #10027

cthiel42 commented Oct 29, 2021

powersj commented Oct 29, 2021

cthiel42 commented Oct 29, 2021

powersj commented Oct 29, 2021

cthiel42 commented Oct 29, 2021

matthiassb commented Nov 2, 2021

lbatalha commented Nov 3, 2021 •

edited

powersj commented Nov 3, 2021

lbatalha commented Nov 5, 2021 •

edited

powersj commented Nov 5, 2021

lbatalha commented Nov 5, 2021 •

edited

Cloudwatch input changes for 1.20 are breaking #10027

Cloudwatch input changes for 1.20 are breaking #10027

Comments

cthiel42 commented Oct 29, 2021

Relevent telegraf.conf

System info

Docker

Steps to reproduce

Expected behavior

Actual behavior

Additional info

powersj commented Oct 29, 2021

cthiel42 commented Oct 29, 2021

powersj commented Oct 29, 2021

cthiel42 commented Oct 29, 2021

matthiassb commented Nov 2, 2021

lbatalha commented Nov 3, 2021 • edited

powersj commented Nov 3, 2021

lbatalha commented Nov 5, 2021 • edited

powersj commented Nov 5, 2021

lbatalha commented Nov 5, 2021 • edited

lbatalha commented Nov 3, 2021 •

edited

lbatalha commented Nov 5, 2021 •

edited

lbatalha commented Nov 5, 2021 •

edited