Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudwatch input changes for 1.20 are breaking #10027

Closed
cthiel42 opened this issue Oct 29, 2021 · 10 comments · Fixed by #10112
Closed

Cloudwatch input changes for 1.20 are breaking #10027

cthiel42 opened this issue Oct 29, 2021 · 10 comments · Fixed by #10112
Labels
area/aws AWS plugins including cloudwatch, ecs, kinesis bug unexpected problem or unintended behavior

Comments

@cthiel42
Copy link
Contributor

Relevent telegraf.conf

[agent]
    interval = "1m"

[[outputs.prometheus_client]]
    listen = ":9273"
    metric_version = 2
    path = "/metrics"
    expiration_interval = "30s"
    export_timestamp = true

[[inputs.cloudwatch]]
    region = "us-east-1"
    period = "1m"
    delay = "5m"
    interval = "5m"
    namespace = "AWS/RDS"
    statistic_include = [ "average", "sum", "minimum", "maximum", "sample_count" ]
    [[inputs.cloudwatch.metrics]]
        names = ["CPUUtilization","DatabaseConnections","FreeStorageSpace"]
        [[inputs.cloudwatch.metrics.dimensions]]
            name = "DBInstanceIdentifier"
            value = "*"

System info

1.20.3 Docker Image

Docker

No response

Steps to reproduce

  1. Ensure you're using telegraf 1.20.3
  2. Make a config file similar to the one above, in my experience it broke for all metrics.
  3. Ensure the location you're running the telegraf agent from has the necessary IAM permissions from AWS to query cloudwatch metrics

Expected behavior

The cloudwatch input should query and return all the available metrics and their corresponding time series values based on your conditions.

Actual behavior

All of the time series are returned like normal, except they all contain a value of 0.

Additional info

After doing a docker pull for the latest telegraf image and refreshing the container, all cloudwatch metrics went to 0. I noticed in the change log that there were some changes to the cloudwatch input recently. I refreshed telegraf last on October 12th, so the only changes to the cloudwatch input according to the changelog would be as a result of #9647.

I resolved my issue by specifying version 1.19 in order to avoid all the cloudwatch input changes over the past month or two. I attached a screenshot of a panel from one of our dashboards. Note the behavior is that the metric is still returned, its value is just 0. This happened for all cloudwatch metrics collected by telegraf.

Screen Shot 2021-10-29 at 10 14 57 AM

@cthiel42 cthiel42 added the bug unexpected problem or unintended behavior label Oct 29, 2021
@telegraf-tiger telegraf-tiger bot added the area/aws AWS plugins including cloudwatch, ecs, kinesis label Oct 29, 2021
@powersj
Copy link
Contributor

powersj commented Oct 29, 2021

@cthiel42 Was this also an issue with 1.20.1 and 1.20.2? Are you doing anything else with the dockerfile? Note that in the 1.20.3 we stopped running telegraf as the root user.

@cthiel42
Copy link
Contributor Author

@powersj Given that 1.20.2 was released on October 7th and I did my last deployment on October 12th with no problems, it appears like the issue came about in 1.20.3. And I'm not doing anything with a dockerfile here, I'm just pulling and using your latest image.

I also wouldn't think this would be a user issue, as the plugin is able to communicate with AWS; it returns the available time series so it should be getting the correct values from AWS. But I also don't know what else goes on behind the scenes with this plugin.

@powersj
Copy link
Contributor

powersj commented Oct 29, 2021

@cthiel42 ok can you run this in debug mode (e.g. --debug) to get us some logs, please?

@cthiel42
Copy link
Contributor Author

@powersj Not much substance to the debug logs. I did confirm that I was getting the same issue while this was running.

2021-10-29T18:10:44Z I! Starting Telegraf 1.20.3
2021-10-29T18:10:44Z I! Using config file: /etc/telegraf/telegraf.conf
2021-10-29T18:10:44Z I! Loaded inputs: cloudwatch (3x)
2021-10-29T18:10:44Z I! Loaded aggregators: 
2021-10-29T18:10:44Z I! Loaded processors: 
2021-10-29T18:10:44Z I! Loaded outputs: prometheus_client
2021-10-29T18:10:44Z I! Tags enabled: host=ue1prometheus
2021-10-29T18:10:44Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ue1prometheus", Flush Interval:30s
2021-10-29T18:10:44Z D! [agent] Initializing plugins
2021-10-29T18:10:44Z D! [agent] Connecting outputs
2021-10-29T18:10:44Z D! [agent] Attempting connection to [outputs.prometheus_client]
2021-10-29T18:10:44Z I! [outputs.prometheus_client] Listening on http://[::]:9280/metrics
2021-10-29T18:10:44Z D! [agent] Successfully connected to outputs.prometheus_client
2021-10-29T18:10:44Z D! [agent] Starting service inputs
2021-10-29T18:11:14Z D! [outputs.prometheus_client] Wrote batch of 16 metrics in 153.911µs
2021-10-29T18:11:14Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:11:44Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:12:14Z D! [outputs.prometheus_client] Wrote batch of 16 metrics in 107.94µs
2021-10-29T18:12:14Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:12:44Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:13:14Z D! [outputs.prometheus_client] Wrote batch of 16 metrics in 107.159µs
2021-10-29T18:13:14Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:13:44Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:14:14Z D! [outputs.prometheus_client] Wrote batch of 16 metrics in 106.694µs
2021-10-29T18:14:14Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:14:44Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:15:14Z D! [outputs.prometheus_client] Wrote batch of 17 metrics in 224.662µs
2021-10-29T18:15:14Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:15:44Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:16:14Z D! [outputs.prometheus_client] Wrote batch of 16 metrics in 97.437µs
2021-10-29T18:16:14Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
2021-10-29T18:16:44Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics

And here's the exact config I was using for this test

[agent]
    interval = "1m"
    round_interval = true
    metric_buffer_limit = 10000
    flush_buffer_when_full = true
    collection_jitter = "0s"
    flush_interval = "30s"
    flush_jitter = "0s"
    debug = true
    quiet = false
    logfile=""
    logfile_rotation_interval = "24h"
    logfile_rotation_max_archives = 5
    hostname = ""
[[outputs.prometheus_client]]
    listen = ":9280"
    metric_version = 2
    # basic_username = "Foo"
    # basic_password = "Bar"
    path = "/metrics"
    expiration_interval = "30s"
    # collectors_exclude = ["gocollector", "process"]
    export_timestamp = true
[[inputs.cloudwatch]]
    region = "us-east-1"
    period = "1m"
    delay = "5m"
    interval = "5m"
    namespace = "AWS/RDS"
    statistic_include = [ "average", "sum", "minimum", "maximum", "sample_count" ]
    [[inputs.cloudwatch.metrics]]

        names = ["CPUUtilization","DatabaseConnections","FreeableMemory","DiskQueueDepth","FreeStorageSpace","ReadLatency","WriteLatency","WriteIOPS","ReadIOPS","ReplicaLag"]

        [[inputs.cloudwatch.metrics.dimensions]]
            name = "DBInstanceIdentifier"
            value = "*"

[[inputs.cloudwatch]]
    region = "us-east-1"
    period = "1m"
    delay = "5m"
    interval = "1m"
    namespace = "AWS/FSx"
    statistic_include = [ "average", "sum", "minimum", "maximum", "sample_count" ]
    [[inputs.cloudwatch.metrics]]

        names = ["DataWriteBytes","FreeStorageCapacity","DataReadBytes"]

        [[inputs.cloudwatch.metrics.dimensions]]
            name = "FileSystemId"
            value = "*"

[[inputs.cloudwatch]]
    region = "us-east-1"
    period = "1m"
    delay = "5m"
    interval = "1m"
    namespace = "AWS/EC2"
    statistic_include = [ "average" ]
    [[inputs.cloudwatch.metrics]]

        names = ["StatusCheckFailed"]

        [[inputs.cloudwatch.metrics.dimensions]]
            name = "InstanceId"
            value = "*"

@matthiassb
Copy link

I am seeing this issue as well - it doesn't appear to be zero'd out but rather it appears that all values are for each metric is the same (which of course can be zero)

I reverted to 1.20.2 and this issue is not present on that version, it seems to be related to #9647

@lbatalha
Copy link

lbatalha commented Nov 3, 2021

Same for us, all metric values seem to be the same, after restart, example screenshot from our dashboards:
image
When the CPU dropped from the steady high load was when it was restarted after the update.
Reverting to 1.20.2 from 1.20.3 returns it to working order.

@powersj
Copy link
Contributor

powersj commented Nov 3, 2021

Hi,

I have been able to reproduce this locally, I think. I have a PR up that adds some debugging code as well as updates the dependencies for AWS-related packages a bit further. I was started to get metrics from that.

Could someone please try the artifacts linked in the comment on PR #10051?

Thanks!

@lbatalha
Copy link

lbatalha commented Nov 5, 2021

Could someone please try the artifacts linked in the comment on PR #10051?

I have tested the .deb artifact (1.21.0~9c4dcf97-0_amd64.deb), and it produces the same issue, metrics go to 0. Did not see any particular debugging info in the journal.

@powersj
Copy link
Contributor

powersj commented Nov 5, 2021

Can you confirm you were running with --debug?

@lbatalha
Copy link

lbatalha commented Nov 5, 2021

Ah it just took a while to scrape, my bad. Most metrics do seem to be zero:

2021-11-05T15:25:02Z D! [inputs.cloudwatch] forwarding_master_dml_throughput_minimum: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[db_instance_identifier:stubbed-nft2-instance1 region:us-west-2]" fields: "forwarding_master_dml_throughput_minimum=%!s(float64=0)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] forwarding_master_dml_throughput_sum: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[db_instance_identifier:stubbed-nft2-instance1 region:us-west-2]" fields: "forwarding_master_dml_throughput_sum=%!s(float64=0)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] forwarding_master_dml_throughput_sample_count: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[db_instance_identifier:stubbed-nft2-instance1 region:us-west-2]" fields: "forwarding_master_dml_throughput_sample_count=%!s(float64=5)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] cpu_utilization_average: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[db_cluster_identifier:stubbed-nft2 region:us-west-2 role:WRITER]" fields: "cpu_utilization_average=%!s(float64=0)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] cpu_utilization_maximum: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[db_cluster_identifier:stubbed-nft2 region:us-west-2 role:WRITER]" fields: "cpu_utilization_maximum=%!s(float64=0)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] cpu_utilization_minimum: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[db_cluster_identifier:stubbed-nft2 region:us-west-2 role:WRITER]" fields: "cpu_utilization_minimum=%!s(float64=0)"

There seem to be some "samples", unsure what that means exactly:

2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[database_class:db.r5.2xlarge region:us-west-2]" fields: "insert_latency_minimum=%!s(float64=0)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] insert_latency_sum: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[database_class:db.r5.2xlarge region:us-west-2]" fields: "insert_latency_sum=%!s(float64=0)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] insert_latency_sample_count: found 1 values
2021-11-05T15:25:02Z D! [inputs.cloudwatch] recording metric: "cloudwatch_aws_rds" tags: "map[database_class:db.r5.2xlarge region:us-west-2]" fields: "insert_latency_sample_count=%!s(float64=5)"
2021-11-05T15:25:02Z D! [inputs.cloudwatch] forwarding_replica_select_throughput_average: found 1 values

Swapping between the artifact and 1.20.2, for query latency as above:
image

n2N8Z pushed a commit to n2N8Z/telegraf that referenced this issue Nov 16, 2021
powersj pushed a commit that referenced this issue Dec 1, 2021
phemmer added a commit to phemmer/telegraf that referenced this issue Feb 18, 2022
* origin/master: (133 commits)
  chore: restart service if it is already running and upgraded via RPM (influxdata#9970)
  feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10237)
  fix: Handle duplicate registration of protocol-buffer files gracefully. (influxdata#10188)
  fix(http_listener_v2): fix panic on close (influxdata#10132)
  feat: add Vault input plugin (influxdata#10198)
  feat: support aws managed service for prometheus (influxdata#10202)
  fix: Make telegraf compile on Windows with golang 1.16.2 (influxdata#10246)
  Update changelog
  feat: Modbus add per-request tags (influxdata#10231)
  fix: Implement NaN and inf handling for elasticsearch output (influxdata#10196)
  feat: add nomad input plugin (influxdata#10106)
  fix: Print loaded plugins and deprecations for once and test (influxdata#10205)
  fix: eliminate MIB dependency for ifname processor (influxdata#10214)
  feat: Optimize locking for SNMP MIBs loading. (influxdata#10206)
  feat: Add SMART plugin concurrency configuration option, nvme-cli v1.14+ support and lint fixes. (influxdata#10150)
  feat: update configs (influxdata#10236)
  fix(inputs/kube_inventory): set TLS server name config properly (influxdata#9975)
  fix: Sudden close of Telegraf caused by OPC UA input plugin (influxdata#10230)
  fix: bump github.com/eclipse/paho.mqtt.golang from 1.3.0 to 1.3.5 (influxdata#9913)
  fix: json_v2 parser timestamp setting (influxdata#10221)
  fix: ensure graylog spec fields not prefixed with '_' (influxdata#10209)
  docs: remove duplicate links in CONTRIBUTING.md (influxdata#10218)
  fix: pool detection and metrics gathering for ZFS >= 2.1.x (influxdata#10099)
  fix: parallelism fix for ifname processor (influxdata#10007)
  chore: Forbids "log" package only for aggregators, inputs, outputs, parsers and processors (influxdata#10191)
  docs: address documentation gap when running telegraf in k8s (influxdata#10215)
  feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10211)
  fix: mqtt topic extracting no longer requires all three fields (influxdata#10208)
  fix: windows service - graceful shutdown of telegraf (influxdata#9616)
  feat: update etc/telegraf.conf and etc/telegraf_windows.conf (influxdata#10201)
  feat: Modbus support multiple slaves (gateway feature) (influxdata#9279)
  fix: Revert unintented corruption of the Makefile from influxdata#10200. (influxdata#10203)
  chore: remove triggering update-config bot in CI (influxdata#10195)
  Update changelog
  feat: Implement deprecation infrastructure (influxdata#10200)
  fix: extra lock on init for safety (influxdata#10199)
  fix: resolve influxdata#10027 (influxdata#10112)
  fix: register bigquery to output plugins influxdata#10177 (influxdata#10178)
  fix: sysstat use unique temp file vs hard-coded (influxdata#10165)
  refactor: snmp to use gosmi (influxdata#9518)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/aws AWS plugins including cloudwatch, ecs, kinesis bug unexpected problem or unintended behavior
Projects
None yet
4 participants