Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure buffer is written to Influx even when there's no network connection #4963

Closed
natejgardner opened this issue Nov 6, 2018 · 11 comments
Closed
Labels
feature request Requests for new plugin and for new features to existing plugins

Comments

@natejgardner
Copy link

Feature Request

Opening a feature request kicks off a discussion.

Proposal:

For IOT and mobile devices, connectivity is not guaranteed to be consistent. It would be great if Telegraf could robustly handle sending events gathered while offline as soon as the connection is restored.

Current behavior:

Telegraf only writes to InfluxDB if InfluxDB is reachable. Events gathered when InfluxDB is unreachable are discarded and not written to InfluxDB when the connection is restored.

Desired behavior:

Telegraf holds all messages in the buffer while InfluxDB is not reachable and only removes them from the buffer when InfluxDB has responded that the writes were successful (as long as the buffer hasn't filled).

Use case: [Why is this important (helps with prioritizing requests)]

IOT devices, mobile devices, and basically everything that uses wifi or mobile networks deal with inconsistent connectivity. It'd be nice if that didn't imply losing all the data from those moments when the client is disconnected-- especially when behavior while disconnected is what one is trying to analyze!

@glinton glinton added the feature request Requests for new plugin and for new features to existing plugins label Nov 6, 2018
@glinton
Copy link
Contributor

glinton commented Nov 6, 2018

I'm not positive, but I believe the 1.9RC addresses this if you don't mind trying it out: 1.9.0-rc1

@danielnelson
Copy link
Contributor

Metrics collected when Telegraf is offline are added to the metric buffer and sent when a connection is re-established, this is true in the current release, <=1.8.3, as well as the latest release candidate.

I assume this is not what you are seeing, can you please add reproduction steps and we can look into what might cause your problem?

@natejgardner
Copy link
Author

@danielnelson,

Nope, this isn't what I'm seeing. I'm not really sure what repro steps to provide. No matter what configuration I use for Telegraf, when the device is disconnected from the InfluxDB server, no data is collected. I've tried increasing the buffer size, modifying the default buffer flush, etc. It appears that as soon as Telegraf attempts to send data to InfluxDB, the data is removed from the buffer. If the request fails, it would appear the data is not returned to the buffer to be sent later. Only repro steps for this I can recommend are running Telegraf, e.g. to collect CPU data, disconnecting the device, and reconnecting a minute or two later, then checking InfluxDB to see the gap in data during the downtime.

Since Telegraf can consume quite a bit of data and is intended for use on lots of devices, including single-board-computers, this feature would possibly require writing the buffer to disk until connection is reestablished due to the small amount of memory available on many devices. I've never seen Telegraf consume more memory or perform disk I/O when unable to reach InfluxDB. If this is the intended behavior, I'm not sure it works on a clean install with default config files.

@danielnelson
Copy link
Contributor

Can you show your configuration file and let me know what version of Telegraf you are using, and then also run your repro steps with debug = true in the agent config and also attach the logfile from during the test.

Telegraf does keep all metrics in memory, it never saves them to disk, so it does use additional memory when the output can not send but only up the metric_buffer_limit.

@natejgardner
Copy link
Author

What are the units of metric_buffer_limit? Is it number of individual records stored?

@danielnelson
Copy link
Contributor

danielnelson commented Apr 1, 2019

Yes, each record is called a metric in Telegraf and it is defined as a measurement name + set of tags + one or more fields and a timestamp. It more or less corresponds to a single line of InfluxDB line protocol.

@natejgardner
Copy link
Author

Got it. It's possible the buffer simply overflows too quickly for me to notice. I've set it pretty high (around 1 million) but still apparently missed all the data from the offline time window. I'll experiment with the buffer limit and confirm, while also running with the debug flag to generate the log. Will Telegraf gracefully flush the buffer if it runs out of memory?

@danielnelson
Copy link
Contributor

Will Telegraf gracefully flush the buffer if it runs out of memory?

No, either the oom_killer terminates the process, which cannot be handled by a process at all, or the process panics and exits. We don't try to handle the panic as it is usually is impossible to write data without the ability to allocate memory.

@phgogo
Copy link

phgogo commented Jul 20, 2020

I would really aprreciate a feature that allows for longer Caching (on-Disk).

My current Workaround is using the "exec" output plugin with a custom python script that tries to transmit the data and caches it locally for the event of a connection loss.

@yonas124
Copy link

yonas124 commented Oct 4, 2020

@natejgardner is the buffer working then?

@jrc
Copy link

jrc commented May 24, 2021

Relating #802

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requests for new plugin and for new features to existing plugins
Projects
None yet
Development

No branches or pull requests

6 participants