Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telegraf dropped/purged/truncated its output buffer on SIGHUP #2679

Closed
jasonkeller opened this issue Apr 17, 2017 · 13 comments
Closed

telegraf dropped/purged/truncated its output buffer on SIGHUP #2679

jasonkeller opened this issue Apr 17, 2017 · 13 comments
Labels
area/configuration bug unexpected problem or unintended behavior waiting for response waiting for response from contributor

Comments

@jasonkeller
Copy link

jasonkeller commented Apr 17, 2017

System info:

telegraf 1.2, RHEL7

Steps to reproduce:

  1. Have inputs (like SNMP) running
  2. pkill -1 telegraf or pkill -SIGHUP telegraf

Expected behavior:

Soft configuration reload

Actual behavior:

Truncated output buffer and soft configuration reload, causing huge derivative dips in Grafana.

Additional info:

https://community.influxdata.com/t/refreshing-telegraf-shows-dips-on-graphs-in-grafana/525/4
Opening this bug per daniel

First mentioned at the bottom of this bug by another user...
#69

And I think JZ even asked this on the forums but got no response...
https://community.influxdata.com/t/reload-config-telegraf-config-file-without-restarting-process/62

@danielnelson danielnelson added the bug unexpected problem or unintended behavior label Apr 19, 2017
@jasonkeller
Copy link
Author

Any motion on this?

@danielnelson
Copy link
Contributor

danielnelson commented Jul 6, 2017

This is somewhat tricky.

Each output currently has its own metric buffer and may not be able to flush at this time. The metric buffers for the outputs may have more differences among them than just a position, due to filtering options transforming the points.

It may be easiest to stop the world and allow the outputs a time period to flush everything, and then perform the reload. If the output could not complete in this time the buffered points would be lost for it.

Another way we could potentially handle this is by moving to a shared output buffer and performing filtering on flush. This would use less memory when there are multiple outputs, but filtering would need to be done each time in case of failure, or only failures could be buffered per output.

I think I'll do the stop the world reload first, and perhaps do the shared output buffer at some later date.

@jasonkeller
Copy link
Author

@danielnelson are there any updates to this? I keep running into this and getting weird dip/spikes on my graphs in grafana on derivative functions due to the missing datapoints.

If we don't have a good way to update the telegraf instance with new endpoints to poll without losing data, that really nerfs when we can realistically begin polling new devices.

@jasonkeller
Copy link
Author

@danielnelson I'll back up a second and get this out there so people realize the other implications of restarting/refreshing the telegraf process.

So part of the issue is dropping data (which if you flush more frequently than poll, you can get around it with careful timing), but another issue that may inevitably bite you is interval skew. If you don't restart at the relative point in the interval that you did previously, telegraf will begin polling at a different point in your interval at the same cadence, leading to a frame-shift of points that will cause a spike/dip on your graph.

@danielnelson
Copy link
Contributor

I'm hoping to fix this as part of the configuration overhaul to support kv config stores. #272

I've heard about reload causing issues with plugin specific collection interval #2839, is it happening also with the global interval? Also what round interval is set to.

@jasonkeller
Copy link
Author

jasonkeller commented Oct 23, 2017

Round interval is set to true, with default interval in our agent section set to 60s. All our probe intervals are set to 300s though. Does round_interval only interact with the global interval in the agent section?

#2839 sounds exactly like what has been happening. I wrote a shell script now to calculate and time process refresh/restart using 'at' to avoid further incident.

@danielnelson
Copy link
Contributor

I haven't investigated the issue closely yet, but it is supposed to work in either case.

@rdxmb
Copy link
Contributor

rdxmb commented Jan 2, 2018

similar problem here: https://community.influxdata.com/t/telegraf-should-reconnect-after-influxdb-timeouts/3550 . I guess this is the same issue.

@danielnelson
Copy link
Contributor

@rdxmb That doesn't look like a similar problem to the one reported on this issue.

@voiprodrigo
Copy link
Contributor

voiprodrigo commented Dec 8, 2018

@danielnelson Could this be an incentive to add support to persist buffers on disk? :)

@rdxmb
Copy link
Contributor

rdxmb commented Dec 10, 2018

I think I confused this issue with another. I am sorry.

@srebhan
Copy link
Member

srebhan commented Jul 26, 2023

@jasonkeller is this still an issue with the latest version of Telegraf? If so, is there any simple way to reproduce the issue?

@srebhan srebhan added the waiting for response waiting for response from contributor label Jul 26, 2023
@telegraf-tiger
Copy link
Contributor

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/configuration bug unexpected problem or unintended behavior waiting for response waiting for response from contributor
Projects
None yet
Development

No branches or pull requests

5 participants