[0.11.1] Telegraf Not Flushing After flush_interval - Drops Metrics #914

s1m0 · 2016-03-22T22:12:38Z

I have a long-running exec input that generates 132 metrics a minute, after about 7.5 minutes, it fills up the buffer but keeps on accumulating metrics until, after 10 minutes when the script exits, it issues a warning then writes the last 1000 metrics:-

2016/03/22 21:49:49 WARNING: overwriting cached metrics, you may want to increase the metric_buffer_limit setting in your [agent] config if you do not wish to overwrite metrics.
2016/03/22 21:49:49 Gathered metrics, (separate 10m0s interval), from exec in 10m1.36394225s
2016/03/22 21:49:50 Gathered metrics, (separate 5s interval), from exec in 72.754315ms
2016/03/22 21:49:50 Wrote 0 metrics to output influxdb in 3.912129ms
2016/03/22 21:49:50 Wrote 0 metrics to output influxdb in 1.222339ms
2016/03/22 21:49:50 Wrote 0 metrics to output influxdb in 5.916385ms
2016/03/22 21:49:50 Wrote 28 metrics to output influxdb in 8.505011ms
2016/03/22 21:49:50 Wrote 1000 metrics to output influxdb in 20.320401ms

I have the following settings in telegraf.conf:-

[agent]
  interval = "5s"
  round_interval = true
  metric_buffer_limit = 1000
  flush_buffer_when_full = true
  collection_jitter = "0s"
  flush_interval = "5s"
  flush_jitter = "0s"

I was expecting the accumulated metrics to be flushed every flush_interval (5 seconds in this case) but even the flush_buffer_when_full isn't happening.

The text was updated successfully, but these errors were encountered:

sparrc · 2016-03-24T22:08:48Z

this seems very odd, can you post your full config & OS?

sparrc · 2016-03-24T22:11:39Z

@s1m0 Something is not quite adding up here, it's not possible for this message:

WARNING: overwriting cached metrics, you may want to increase the metric_buffer_limit setting in your [agent] config if you do not wish to overwrite metrics.

to print when you have flush_buffer_when_full = true set

s1m0 · 2016-03-24T23:40:11Z

I noticed that in the code too. I increased the metric_buffer_limit to 2000 and got 1320 metrics written every 10 minutes.

2016/03/24 21:10:50 Gathered metrics, (separate 10m0s interval), from exec in 10m1.406705022s
2016/03/24 21:10:55 Wrote 1320 metrics to output influxdb in 33.198581ms

Here's the config file with the original 1000 setting which I'm running with right now and we're back to 1000 metrics written every 10 minutes but I can't find the overwrite message in the logfile any more.

2016/03/24 23:06:16 Gathered metrics, (separate 10m0s interval), from exec in 10m1.532282103s
2016/03/24 23:06:16 Wrote 1001 metrics to output influxdb in 46.972771ms

So there's 3.5 minutes of data (320 metrics) missing.

The exec input that's generating that output is the last one in the config below, it's writing to our 3-node influxdb cluster (database = "FlexCache", user_agent = "OSS-FC")

# Telegraf configuration

# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.

# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.

# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.

# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "5s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will cache metric_buffer_limit metrics for each output, and will
  ## flush this buffer on a successful write.
  metric_buffer_limit = 1000
  ## Flush the buffer whenever full, regardless of flush_interval.
  flush_buffer_when_full = true

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "5s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## Run telegraf in debug mode
  debug = true
  ## Run telegraf in quiet mode
  quiet = false
  ## Override default hostname, if empty use os.Hostname()
  hostname = " "
###############################################################################
#                                  OUTPUTS                                    #
###############################################################################
[[outputs.influxdb]]
  urls = ["http://1.1.1.1:8086"]
  database = "SASData"
  precision = "s"
  retention_policy = "sixmonths"
  timeout = "5s"
  user_agent = "OSS-SAS"
  namepass = ["Services", "ShopMonitor"]

[[outputs.influxdb]]
  urls = ["http://ifdbcl1h1:8086", "http://ifdbcl1h2:8086", "http://ifdbcl1h3:8086"]
  database = "SASData"
  precision = "s"
  retention_policy = "default"
  timeout = "5s"
  username = "xxxxxx"
  password = "xxxxxx"
  user_agent = "OSS-SAS"
  namepass = ["Services", "ShopMonitor"]

[[outputs.influxdb]]
  urls = ["http://ifdbcl1h1:8086", "http://ifdbcl1h2:8086", "http://ifdbcl1h3:8086"]
  database = "FlexCache"
  precision = "s"
  retention_policy = "default"
  timeout = "5s"
  username = "xxxxxx"
  password = "xxxxxx"
  user_agent = "OSS-FC"
  namepass = ["Db*", "Flex*"]

[[outputs.influxdb]]
  urls = ["http://ifdbcl1h1:8086", "http://ifdbcl1h2:8086", "http://ifdbcl1h3:8086"]
  database = "Origin"
  precision = "s"
  retention_policy = "default"
  timeout = "5s"
  username = "xxxxxx"
  password = "xxxxxx"
  user_agent = "OSS-O"
  namepass = ["Snap*", "FcStats*", "Stats*"]

[[outputs.influxdb]]
  urls = ["http://ifdbcl1h1:8086", "http://ifdbcl1h2:8086", "http://ifdbcl1h3:8086"]
  database = "CapacityLogger"
  precision = "s"
  retention_policy = "default"
  timeout = "5s"
  username = "xxxxxx"
  password = "xxxxxx"
  user_agent = "OSS-CL"
  namepass = ["CapLog"]

###############################################################################
#                              SERVICE INPUTS                                 #
###############################################################################
[[inputs.exec]]
  command = "runuser abcdef -c /usr/local/bin/IFDBVolSizINode_telegraf.sh"
  data_format = "influx"
  interval = "1m"

[[inputs.exec]]
  command = "/usr/local/bin/IFDBFlip_telegraf.sh"
  data_format = "influx"
  interval = "1m"

[[inputs.exec]]
  command = "/usr/local/bin/IFDBShopMon_telegraf.sh"
  data_format = "influx"
  interval = "30s"

[[inputs.exec]]
  command = "/usr/local/bin/IFDBServMon_telegraf.sh"
  data_format = "influx"
  interval = "30s"

[[inputs.exec]]
  command = "/usr/local/bin/IFDB_CapLog_telegraf.sh"
  data_format = "influx"
  interval = "5s"

[[inputs.exec]]
  command = "runuser abcdef -c /usr/local/bin/IFDB_FCnfsv3_telegraf.sh"
  data_format = "influx"
  interval = "10m"

/etc/default/telegraf:-

STDOUT=/var/log/telegraf/telegraf.log

sparrc · 2016-03-25T03:28:28Z

so you're saying that the 10m exec script is generating 1320 metrics, but only 1000 metrics are getting written? Is there not any follow-up write that writes 320 metrics?

Or the problem is that your outputs are not flushing every 5s? (or both?)

s1m0 · 2016-03-25T13:59:26Z

Both. Telegraf writes whatever is in the buffer after 10 minutes so if the buffer is limited to 1000 metrics, I lose 2.5 minutes of data.

sparrc · 2016-03-29T23:08:55Z

@s1m0 do you happen to have smaller configs where you see a similar problem? I haven't been able to reproduce your issue.

Below I am running with a 2s collection interval, 5s flush interval, and a 20-metric buffer. I am seeing that the agent flushes every 5 seconds and whenever the buffer is full:

% telegraf -config ~/ws/telegraf.conf -input-filter cpu:mem:disk:net:processes:swap -output-filter influxdb
2016/03/29 17:04:16 Starting Telegraf (version 0.11.1-61-g4ad551b)
2016/03/29 17:04:16 Loaded outputs: influxdb
2016/03/29 17:04:16 Loaded inputs: cpu net disk mem swap
2016/03/29 17:04:16 Tags enabled: dc=us-east-1 host=tars
2016/03/29 17:04:16 Agent Config: Interval:2s, Debug:false, Quiet:false, Hostname:"tars", Flush Interval:5s
2016/03/29 17:04:18 Gathered metrics, (2s interval), from 5 inputs in 5.815599ms
2016/03/29 17:04:20 Gathered metrics, (2s interval), from 5 inputs in 5.677863ms
2016/03/29 17:04:22 Gathered metrics, (2s interval), from 5 inputs in 6.004962ms
2016/03/29 17:04:22 Wrote 21 metrics to output influxdb in 4.737673ms <------ FULL BUFFER
2016/03/29 17:04:23 Wrote 5 metrics to output influxdb in 2.668493ms <-------- 5s INTERVAL
2016/03/29 17:04:24 Gathered metrics, (2s interval), from 5 inputs in 5.77265ms
2016/03/29 17:04:26 Gathered metrics, (2s interval), from 5 inputs in 5.788931ms
2016/03/29 17:04:28 Gathered metrics, (2s interval), from 5 inputs in 6.081894ms
2016/03/29 17:04:28 Wrote 21 metrics to output influxdb in 4.142238ms
2016/03/29 17:04:28 Wrote 6 metrics to output influxdb in 3.02619ms
2016/03/29 17:04:30 Gathered metrics, (2s interval), from 5 inputs in 6.79805ms
2016/03/29 17:04:32 Gathered metrics, (2s interval), from 5 inputs in 6.066254ms
2016/03/29 17:04:33 Wrote 18 metrics to output influxdb in 2.869484ms <-------- 5s INTERVAL
2016/03/29 17:04:34 Gathered metrics, (2s interval), from 5 inputs in 6.172455ms
2016/03/29 17:04:36 Gathered metrics, (2s interval), from 5 inputs in 4.640113ms
2016/03/29 17:04:38 Gathered metrics, (2s interval), from 5 inputs in 6.893187ms
2016/03/29 17:04:38 Wrote 21 metrics to output influxdb in 5.851ms <------ FULL BUFFER
2016/03/29 17:04:38 Wrote 6 metrics to output influxdb in 2.957888ms <-------- 5s INTERVAL
2016/03/29 17:04:40 Gathered metrics, (2s interval), from 5 inputs in 6.025213ms
2016/03/29 17:04:42 Gathered metrics, (2s interval), from 5 inputs in 5.695666ms
2016/03/29 17:04:43 Wrote 18 metrics to output influxdb in 3.31518ms
2016/03/29 17:04:44 Gathered metrics, (2s interval), from 5 inputs in 5.763199ms
2016/03/29 17:04:46 Gathered metrics, (2s interval), from 5 inputs in 6.770064ms
2016/03/29 17:04:48 Gathered metrics, (2s interval), from 5 inputs in 5.330773ms
2016/03/29 17:04:48 Wrote 21 metrics to output influxdb in 5.543403ms
2016/03/29 17:04:48 Wrote 6 metrics to output influxdb in 2.822488ms
2016/03/29 17:04:50 Gathered metrics, (2s interval), from 5 inputs in 4.901441ms
2016/03/29 17:04:52 Gathered metrics, (2s interval), from 5 inputs in 6.142643ms
^C2016/03/29 17:04:52 Hang on, flushing any cached metrics before shutdown
2016/03/29 17:04:52 Wrote 18 metrics to output influxdb in 3.347653ms

sparrc · 2016-04-26T18:36:11Z

@s1m0 I believe I have reproduced your issue, I am getting some dropped metrics when processing over a few thousand per second.

From what I can tell the issue has been fixed by @PierreF with this PR: #1087

I'll keep it open for now and close with a buffer performance refactor I'm working on in #1096

closes #914 closes #967

sparrc added the bug unexpected problem or unintended behavior label Mar 24, 2016

sparrc added Need More Info and removed Need More Info labels Apr 20, 2016

sparrc added a commit that referenced this issue Apr 26, 2016

Significant performance refactor of running_output buffers

15e053c

closes #914 closes #967

sparrc added a commit that referenced this issue Apr 26, 2016

Performance refactor of running_output buffers

625f820

closes #914 closes #967

sparrc added a commit that referenced this issue Apr 26, 2016

Performance refactor of running_output buffers

72be04f

closes #914 closes #967

sparrc added a commit that referenced this issue Apr 26, 2016

Performance refactor of running_output buffers

b0b43bf

closes #914 closes #967

sparrc added a commit that referenced this issue Apr 26, 2016

Performance refactor of running_output buffers

92563e4

closes #914 closes #967

sparrc added a commit that referenced this issue Apr 26, 2016

Performance refactor of running_output buffers

53cdc9e

closes #914 closes #967

sparrc added a commit that referenced this issue Apr 26, 2016

Performance refactor of running_output buffers

6d17e56

closes #914 closes #967

sparrc added a commit that referenced this issue Apr 26, 2016

Performance refactor of running_output buffers

97a2c2f

closes #914 closes #967

sparrc closed this as completed in 4de75ce Apr 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.11.1] Telegraf Not Flushing After flush_interval - Drops Metrics #914

[0.11.1] Telegraf Not Flushing After flush_interval - Drops Metrics #914

s1m0 commented Mar 22, 2016

sparrc commented Mar 24, 2016

sparrc commented Mar 24, 2016

s1m0 commented Mar 24, 2016

sparrc commented Mar 25, 2016

s1m0 commented Mar 25, 2016

sparrc commented Mar 29, 2016

sparrc commented Apr 26, 2016

[0.11.1] Telegraf Not Flushing After flush_interval - Drops Metrics #914

[0.11.1] Telegraf Not Flushing After flush_interval - Drops Metrics #914

Comments

s1m0 commented Mar 22, 2016

sparrc commented Mar 24, 2016

sparrc commented Mar 24, 2016

s1m0 commented Mar 24, 2016

sparrc commented Mar 25, 2016

s1m0 commented Mar 25, 2016

sparrc commented Mar 29, 2016

sparrc commented Apr 26, 2016