Ensure buffer is written to Influx even when there's no network connection #4963

natejgardner · 2018-11-06T03:19:11Z

Feature Request

Opening a feature request kicks off a discussion.

Proposal:

For IOT and mobile devices, connectivity is not guaranteed to be consistent. It would be great if Telegraf could robustly handle sending events gathered while offline as soon as the connection is restored.

Current behavior:

Telegraf only writes to InfluxDB if InfluxDB is reachable. Events gathered when InfluxDB is unreachable are discarded and not written to InfluxDB when the connection is restored.

Desired behavior:

Telegraf holds all messages in the buffer while InfluxDB is not reachable and only removes them from the buffer when InfluxDB has responded that the writes were successful (as long as the buffer hasn't filled).

Use case: [Why is this important (helps with prioritizing requests)]

IOT devices, mobile devices, and basically everything that uses wifi or mobile networks deal with inconsistent connectivity. It'd be nice if that didn't imply losing all the data from those moments when the client is disconnected-- especially when behavior while disconnected is what one is trying to analyze!

glinton · 2018-11-06T17:23:31Z

I'm not positive, but I believe the 1.9RC addresses this if you don't mind trying it out: 1.9.0-rc1

danielnelson · 2018-11-06T22:18:49Z

Metrics collected when Telegraf is offline are added to the metric buffer and sent when a connection is re-established, this is true in the current release, <=1.8.3, as well as the latest release candidate.

I assume this is not what you are seeing, can you please add reproduction steps and we can look into what might cause your problem?

natejgardner · 2019-04-01T20:30:13Z

@danielnelson,

Nope, this isn't what I'm seeing. I'm not really sure what repro steps to provide. No matter what configuration I use for Telegraf, when the device is disconnected from the InfluxDB server, no data is collected. I've tried increasing the buffer size, modifying the default buffer flush, etc. It appears that as soon as Telegraf attempts to send data to InfluxDB, the data is removed from the buffer. If the request fails, it would appear the data is not returned to the buffer to be sent later. Only repro steps for this I can recommend are running Telegraf, e.g. to collect CPU data, disconnecting the device, and reconnecting a minute or two later, then checking InfluxDB to see the gap in data during the downtime.

Since Telegraf can consume quite a bit of data and is intended for use on lots of devices, including single-board-computers, this feature would possibly require writing the buffer to disk until connection is reestablished due to the small amount of memory available on many devices. I've never seen Telegraf consume more memory or perform disk I/O when unable to reach InfluxDB. If this is the intended behavior, I'm not sure it works on a clean install with default config files.

danielnelson · 2019-04-01T20:39:10Z

Can you show your configuration file and let me know what version of Telegraf you are using, and then also run your repro steps with debug = true in the agent config and also attach the logfile from during the test.

Telegraf does keep all metrics in memory, it never saves them to disk, so it does use additional memory when the output can not send but only up the metric_buffer_limit.

natejgardner · 2019-04-01T20:41:22Z

What are the units of metric_buffer_limit? Is it number of individual records stored?

danielnelson · 2019-04-01T20:45:28Z

Yes, each record is called a metric in Telegraf and it is defined as a measurement name + set of tags + one or more fields and a timestamp. It more or less corresponds to a single line of InfluxDB line protocol.

natejgardner · 2019-04-01T20:50:56Z

Got it. It's possible the buffer simply overflows too quickly for me to notice. I've set it pretty high (around 1 million) but still apparently missed all the data from the offline time window. I'll experiment with the buffer limit and confirm, while also running with the debug flag to generate the log. Will Telegraf gracefully flush the buffer if it runs out of memory?

danielnelson · 2019-04-01T21:39:36Z

Will Telegraf gracefully flush the buffer if it runs out of memory?

No, either the oom_killer terminates the process, which cannot be handled by a process at all, or the process panics and exits. We don't try to handle the panic as it is usually is impossible to write data without the ability to allocate memory.

phgogo · 2020-07-20T09:58:08Z

I would really aprreciate a feature that allows for longer Caching (on-Disk).

My current Workaround is using the "exec" output plugin with a custom python script that tries to transmit the data and caches it locally for the event of a connection loss.

yonas124 · 2020-10-04T13:14:20Z

@natejgardner is the buffer working then?

jrc · 2021-05-24T12:27:38Z

Relating #802

glinton added the feature request Requests for new plugin and for new features to existing plugins label Nov 6, 2018

danielnelson added the need more info label Nov 6, 2018

danielnelson closed this as completed Mar 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure buffer is written to Influx even when there's no network connection #4963

Ensure buffer is written to Influx even when there's no network connection #4963

natejgardner commented Nov 6, 2018

glinton commented Nov 6, 2018

danielnelson commented Nov 6, 2018

natejgardner commented Apr 1, 2019

danielnelson commented Apr 1, 2019

natejgardner commented Apr 1, 2019

danielnelson commented Apr 1, 2019 •

edited

Loading

natejgardner commented Apr 1, 2019

danielnelson commented Apr 1, 2019

phgogo commented Jul 20, 2020

yonas124 commented Oct 4, 2020

jrc commented May 24, 2021

Ensure buffer is written to Influx even when there's no network connection #4963

Ensure buffer is written to Influx even when there's no network connection #4963

Comments

natejgardner commented Nov 6, 2018

Feature Request

Proposal:

Current behavior:

Desired behavior:

Use case: [Why is this important (helps with prioritizing requests)]

glinton commented Nov 6, 2018

danielnelson commented Nov 6, 2018

natejgardner commented Apr 1, 2019

danielnelson commented Apr 1, 2019

natejgardner commented Apr 1, 2019

danielnelson commented Apr 1, 2019 • edited Loading

natejgardner commented Apr 1, 2019

danielnelson commented Apr 1, 2019

phgogo commented Jul 20, 2020

yonas124 commented Oct 4, 2020

jrc commented May 24, 2021

danielnelson commented Apr 1, 2019 •

edited

Loading