Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datadog Metrics get multiplied by 10 for some unknown reason #10944

Closed
jrimmer-housecallpro opened this issue Apr 6, 2022 · 5 comments · Fixed by #10979
Closed

Datadog Metrics get multiplied by 10 for some unknown reason #10944

jrimmer-housecallpro opened this issue Apr 6, 2022 · 5 comments · Fixed by #10979
Labels
area/aws AWS plugins including cloudwatch, ecs, kinesis bug unexpected problem or unintended behavior

Comments

@jrimmer-housecallpro
Copy link
Contributor

Relevant telegraf.conf

# Telegraf Configuration
#
# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.
#
# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.
#
# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.
#
# Environment variables can be used anywhere in this config file, simply surround
# them with ${}. For strings the variable must be within quotes (ie, "${STR_VAR}"),
# for numbers and booleans they should be plain (ie, ${INT_VAR}, ${BOOL_VAR})

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "30s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 2500

  ## Maximum number of unwritten metrics per output.  Increasing this value
  ## allows for longer periods of output downtime without dropping metrics at the
  ## cost of higher maximum memory usage.
  metric_buffer_limit = 25000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "30s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  precision = ""

  ## Log at debug level.
  # debug = false
  ## Log only error level messages.
  quiet = false

  ## Log target controls the destination for logs and can be one of "file",
  ## "stderr" or, on Windows, "eventlog".  When set to "file", the output file
  ## is determined by the "logfile" setting.
  # logtarget = "file"

  ## Name of the file to be logged to when using the "file" logtarget.  If set to
  ## the empty string then logs are written to stderr.
  # logfile = ""

  ## The logfile will be rotated after the time interval specified.  When set
  ## to 0 no time based rotation is performed.  Logs are rotated only when
  ## written to, if there is no log activity rotation may be delayed.
  # logfile_rotation_interval = "0d"

  ## The logfile will be rotated when it becomes larger than the specified
  ## size.  When set to 0 no size based rotation is performed.
  # logfile_rotation_max_size = "0MB"

  ## Maximum number of rotated archives to keep, any older logs are deleted.
  ## If set to -1, no archives are removed.
  # logfile_rotation_max_archives = 5

  ## Pick a timezone to use when logging or type 'local' for local time.
  ## Example: America/Chicago
  # log_with_timezone = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false

###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

# # Configuration for DataDog API to send metrics to.
[[outputs.datadog]]
  ## Datadog API key
  apikey = "${DD_API_KEY}"

  ## Connection timeout.
  # timeout = "5s"

  ## Write URL override; useful for debugging.
  # url = "https://app.datadoghq.com/api/v1/series"

  ## Set http_proxy (telegraf uses the system wide proxy settings if it isn't set)
  # http_proxy_url = "http://localhost:8888"


###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################


# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states
  report_active = false


# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]

  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]


# Read metrics about disk IO by device
[[inputs.diskio]]
  ## By default, telegraf will gather stats for all devices including
  ## disk partitions.
  ## Setting devices will restrict the stats to the specified devices.
  # devices = ["sda", "sdb", "vd*"]
  ## Uncomment the following line if you need disk serial numbers.
  # skip_serial_number = false
  #
  ## On systems which support it, device metadata can be added in the form of
  ## tags.
  ## Currently only Linux is supported via udev properties. You can view
  ## available properties for a device by running:
  ## 'udevadm info -q property -n /dev/sda'
  ## Note: Most, but not all, udev properties can be accessed this way. Properties
  ## that are currently inaccessible include DEVTYPE, DEVNAME, and DEVPATH.
  # device_tags = ["ID_FS_TYPE", "ID_FS_USAGE"]
  #
  ## Using the same metadata source as device_tags, you can also customize the
  ## name of the device via templates.
  ## The 'name_templates' parameter is a list of templates to try and apply to
  ## the device. The template may contain variables in the form of '$PROPERTY' or
  ## '${PROPERTY}'. The first template which does not contain any variables not
  ## present for the device is used as the device name tag.
  ## The typical use case is for LVM volumes, to get the VG/LV name instead of
  ## the near-meaningless DM-0 name.
  # name_templates = ["$ID_FS_LABEL","$DM_VG_NAME/$DM_LV_NAME"]


# Get kernel statistics from /proc/stat
[[inputs.kernel]]
  # no configuration


# Read metrics about memory usage
[[inputs.mem]]
  # no configuration


# Get the number of processes and group them by status
[[inputs.processes]]
  # no configuration


# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration


# Read metrics about system load & uptime
[[inputs.system]]
  ## Uncomment to remove deprecated metrics.
  # fielddrop = ["uptime_format"]


# Read metrics about ECS containers
[[inputs.ecs]]
  ## ECS metadata url.
  ## Metadata v2 API is used if set explicitly. Otherwise,
  ## v3 metadata endpoint API is used if available.
  # endpoint_url = ""

  ## Containers to include and exclude. Globs accepted.
  ## Note that an empty array for both will include all containers
  # container_name_include = []
  # container_name_exclude = []

  ## Container states to include and exclude. Globs accepted.
  ## When empty only containers in the "RUNNING" state will be captured.
  ## Possible values are "NONE", "PULLED", "CREATED", "RUNNING",
  ## "RESOURCES_PROVISIONED", "STOPPED".
  # container_status_include = []
  # container_status_exclude = []

  ## ecs labels to include and exclude as tags.  Globs accepted.
  ## Note that an empty array for both will include all labels as tags
  ecs_label_include = [ "com.amazonaws.ecs.*" ]
  ecs_label_exclude = []

  ## Timeout for queries.
  # timeout = "5s"

###############################################################################
#                            service input plugins                            #
###############################################################################

# Statsd UDP/TCP Server
[[inputs.statsd]]
  ## Protocol, must be "tcp", "udp", "udp4" or "udp6" (default=udp)
  protocol = "udp"

  ## MaxTCPConnection - applicable when protocol is set to tcp (default=250)
  max_tcp_connections = 250

  ## Enable TCP keep alive probes (default=false)
  tcp_keep_alive = false

  ## Specifies the keep-alive period for an active network connection.
  ## Only applies to TCP sockets and will be ignored if tcp_keep_alive is false.
  ## Defaults to the OS configuration.
  # tcp_keep_alive_period = "2h"

  ## Address and port to host UDP listener on
  service_address = ":${TELEGRAF_AGENT_PORT}"

  ## The following configuration options control when telegraf clears it's cache
  ## of previous values. If set to false, then telegraf will only clear it's
  ## cache when the daemon is restarted.
  ## Reset gauges every interval (default=true)
  delete_gauges = true
  ## Reset counters every interval (default=true)
  delete_counters = true
  ## Reset sets every interval (default=true)
  delete_sets = true
  ## Reset timings & histograms every interval (default=true)
  delete_timings = true

  ## Percentiles to calculate for timing & histogram stats
  percentiles = [50.0, 90.0, 99.0, 99.9, 99.95, 100.0]

  ## separator to use between elements of a statsd metric
  ## DO NOT CHANGE THIS
  metric_separator = "."

  ## Parses tags in the datadog statsd format
  ## http://docs.datadoghq.com/guides/dogstatsd/
  parse_data_dog_tags = true

  ## Parses datadog extensions to the statsd format
  datadog_extensions = true

  ## Parses distributions metric as specified in the datadog statsd format
  ## https://docs.datadoghq.com/developers/metrics/types/?tab=distribution#definition
  datadog_distributions = true

  ## Statsd data translation templates, more info can be read here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/TEMPLATE_PATTERN.md
  # templates = [
  #     "cpu.* measurement*"
  # ]

  ## Number of UDP messages allowed to queue up, once filled,
  ## the statsd server will start dropping packets
  allowed_pending_messages = 10000

  ## Number of timing/histogram values to track per-measurement in the
  ## calculation of percentiles. Raising this limit increases the accuracy
  ## of percentiles but also increases the memory usage and cpu time.
  percentile_limit = 1000

  ## Max duration (TTL) for each metric to stay cached/reported without being updated.
  #max_ttl = "1000h"

# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"
  ## Environment variables can be used as tags, and throughout the config file
  # user = "$USER"

Logs from Telegraf

2022-04-06T16:49:25.901-07:00	2022-04-06T23:49:25Z I! Starting Telegraf 1.21.4

2022-04-06T16:49:25.901-07:00	2022-04-06T23:49:25Z I! Using config file: /etc/telegraf/telegraf.conf

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! Loaded inputs: cpu disk diskio ecs kernel mem processes statsd swap system

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! Loaded aggregators:

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! Loaded processors:

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! Loaded outputs: datadog sumologic

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! Tags enabled: host=baf01b4f1d4a

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! [agent] Config: Interval:30s, Quiet:false, Hostname:"baf01b4f1d4a", Flush Interval:30s

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z W! [inputs.statsd] 'parse_data_dog_tags' config option is deprecated, please use 'datadog_extensions' instead

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! [inputs.statsd] UDP listening on "[::]:8126"

2022-04-06T16:49:25.902-07:00	2022-04-06T23:49:25Z I! [inputs.statsd] Started the statsd service on ":8126"

System info

Telegraf 1.21.4, Docker image 1.21-alpine, AWS ECS

Docker

Dockerfile:

FROM telegraf:1.21-alpine

COPY dist/telegraf.conf /etc/telegraf/telegraf.conf

RUN apk add --no-cache
curl
python3
py3-pip
&& pip3 install --upgrade pip
&& pip3 install --no-cache-dir
awscli
&& rm -rf /var/cache/apk/*

RUN aws --version

COPY custom_entrypoint.sh /custom_entrypoint.sh

COPY parse_tags.py /parse_tags.py

ENTRYPOINT ["/custom_entrypoint.sh"]

CMD ["telegraf"]


parse_tags.py:

#!/usr/bin/env python

import sys
import json

file = sys.argv[1]
f = open(file)
data = json.load(f)

for i in data['Tags']:
if i['Key'] and i['Value']:
output = "export EC2_TAG_" + i['Key'].upper() + '=' + i['Value']
print(output.replace(':', '_'))

f.close()

custom_entrypoint.sh:

#!/bin/sh

set -e

output_file="/tmp/ec2_tags.json"

instance_id=$(curl http://169.254.169.254/latest/meta-data/instance-id)

aws ec2 describe-tags --filters "Name=resource-id,Values=$instance_id" --region=us-west-2 > $output_file

eval python3 /parse_tags.py $output_file

exec /entrypoint.sh "$@"

Steps to reproduce

  1. Install telegraf and configure it to send data to Datadog
  2. Send custom metrics (counters) via statsd to Datadog
  3. watch as values get multiplied by 10 for no goddamned good reason

Expected behavior

If a counter is incremented 3 times and transmitted, Datadog should show a value of "3"

Actual behavior

If a counter is incremented 3 times and transmitted, Datadog instead shows a value of "30"

Additional info

No response

@jrimmer-housecallpro jrimmer-housecallpro added the bug unexpected problem or unintended behavior label Apr 6, 2022
@telegraf-tiger telegraf-tiger bot added area/aws AWS plugins including cloudwatch, ecs, kinesis platform/windows labels Apr 6, 2022
@jrimmer-housecallpro
Copy link
Contributor Author

☝️ I have no idea why the bot says "Platform/windows" when this was Linux.

@jrimmer-housecallpro
Copy link
Contributor Author

jrimmer-housecallpro commented Apr 7, 2022

Slight update. We've noticed that in Datadog, for the metrics in question, they are "rates" with an interval set to 10. If we set the interval to 1... the numbers look correct; however, this also divides the metrics collected by datadog's agent by 10, making them wrong.

Ideally we should see the same values for both Telegraf and Datadog's agent.

@jrimmer-housecallpro
Copy link
Contributor Author

OK, more information.

https://docs.datadoghq.com/metrics/dogstatsd_metrics_submission/#count

increment(<METRIC_NAME>, <SAMPLE_RATE>, )
Used to increment a COUNT metric. Stored as a RATE type in Datadog. Each value in the stored timeseries is a time-
normalized delta of the metric’s value over the StatsD flush period.

So what appears to be happening is that Datadog's agent is actually sending the data using the rate type. So if you have a count time series of [10, 5, 20] and send that through datadog/statsd, Datadog's agent is actually sending a rate time series as { interval=10; [1, 0.5, 2]}

@jrimmer-housecallpro
Copy link
Contributor Author

Sure enough:

https://docs.datadoghq.com/api/latest/metrics/#submit-metrics

There's a type field, which takes count, gauge or rate. And then there's an interval int64 that gets sent along with that.

I suspect that if we send counts properly typed, OR if we send the value for interval (hard-coded to 1), or both, then the issue will go away.

Working on testing this and will PR if it works.

@jrimmer-housecallpro
Copy link
Contributor Author

OK, we've figured out what we think is going on.

The datadog agent/dogstatsd takes any count type metric, and converts it to a rate metric. See here: https://docs.datadoghq.com/metrics/dogstatsd_metrics_submission/#count

If you've created a metric/custom metric as a count using Datadog's agent/dogstatsd, it will have the type of Rate interval=10 in Datadog by default. Not only does it transmit the values this way, it actually divides the values provided by 10 in the data series -- https://docs.datadoghq.com/api/latest/metrics/#submit-metrics.

Since telegraf's Datadog output plugin does not type the data, Datadog then interprets that data as being of the previously-created type -- and multiply everything by 10, assuming that the agent had previously divided everything by 10 on that end.

If telegraf sends the typing and interval information, Datadog interprets the data correctly, as long as there are no other datadog/dogstatsd agents communicating with Datadog at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/aws AWS plugins including cloudwatch, ecs, kinesis bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants