Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf stops publishing metrics via wavefront output, all plugins take too long to collect - no meaningful error message #3710

Closed
pberlowski opened this issue Jan 22, 2018 · 7 comments
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@pberlowski
Copy link
Contributor

Bug report

After a certain amount of time (usually 5-10 minutes) under a load-test, the telegraf agent stops sending metrics onward via wavefront output. In addition to that, the log fills up with "took to long to collect" messages from all plugins.

No relevant error message is available in the logs.

This issue seems similar to #3629 however no aggregation plugin is in use.
There is a suspicion that the output may be too slow, even though the buffer doesn't seem to overfill and drop metrics.

Load test description

Measurements are submitted via http_listener from a load-generator host that replays a snapshot of data at selectable throughput. At 5000pps (fields), the error happens within minutes of starting the agent.

Relevant telegraf.conf:

http-proxy.conf.txt
telegraf.conf.txt

System info:

Telegraf version: 1.4.5
OS: Centos 7.3

Expected behavior:

Telegraf logs a meaningful error to describe a reason for overload and keeps buffering metrics.

Actual behavior:

Telegraf seizes operations silently and then starts logging failures to collect on time.

Additional info:

Stacktrace:
telegraf.stacktrace.txt

Hardware and software metrics screenshots for the failure window:
telegraf failure - hardware
telegraf failure - metrics
telegraf failure - network

@danielnelson
Copy link
Contributor

This sounds like the wavefront plugin is waiting on a network response without the proper timeout.

@puckpuck Can you take a look at this one? It is important that outputs return from the Write() function fairly quickly to avoid backing up the inputs, usually no more than 5s.

@puckpuck
Copy link
Contributor

@pberlowski 2 things

  1. You are using an earlier version of telegraf, likely provided by Wavefront, compiled with an older version of the plugin pre-release. Can you try again with telegraf 1.5.x?
  2. I believe someone in your organization opened a similar ticket with Wavefront support, where recommendations were made to check on configuration and network. Can you please follow up with that support reply, if not can you open a new support case with Wavefront and they will follow up with your accordingly.

@danielnelson Wavefront uses net.Dial("tcp", uri) with default timeouts. Other output plugins do very similar, so I'm not sure if this is the issue or not here, however it would make sense to force a 3s (or 5s) timeout. Also looking at the stacktrace, seemed like the HTTP listener was more of the culprit here. It had about 250 go routines tied to it. Happy to investigate here further as needed.

@danielnelson
Copy link
Contributor

@puckpuck While the outputs are sending, telegraf stops reading messages from the Accumulator, Once the Accumulator's channel is full then inputs will block when adding new fields. This is most likely why there are so many goroutines from http listener. There is more info about this and the work we need to do to address it on #2919. The best way to handle this now is to make sure the outputs return quickly, so I'd recommend adding tighter timeouts on all network operations.

@puckpuck puckpuck mentioned this issue Jan 22, 2018
3 tasks
@puckpuck
Copy link
Contributor

Added a 5s deadline to writes on output plugin, so if Write ever takes more than 5s it will close and resume. This may cause metrics to get re-written, on a subsequent output plugin Write, since there is no way that I know of to tell telegraf which (if any) of the metrics within the batch were successfully written.

@pberlowski please still follow through with the steps I asked previously so we can track this internally as well.

@pberlowski
Copy link
Contributor Author

pberlowski commented Jan 22, 2018

Hey guys, many thanks for a blazing fast response!

@puckpuck We're investigating the network configuration separately. The interaction between our telegraf and the Wavefront proxy was indeed ticketed in Wavefront and we'll follow up as soon as we have full picture. Also, we'll indeed look into upgrading to 1.5.1 ASAP. We are considering that a likely root cause of this issue.

At the same time, I feel uneasy about coming to such deadlock in case of a possible network fissure or similar condition. As stated in the ticket, I feel the expected outcome here is a warning message in the telegraf log and standard point buffering behavior to minimize the impact. Is this a correct assumption?

@danielnelson Is there a way to output some descriptive log message to communicate that the stoppage might be caused by the full accumulator channel? Or maybe just a "channel full" warning of some sort? This would possibly help debug the issue and ease ticketing/communication around such condition?

@danielnelson danielnelson added this to the 1.5.2 milestone Jan 25, 2018
@danielnelson danielnelson added the bug unexpected problem or unintended behavior label Jan 25, 2018
@danielnelson
Copy link
Contributor

@pberlowski You should have also seen an error message like:

W! Skipping a scheduled flush because there is already a flush ongoing.

It isn't always a problem if the channel fills so we wouldn't want to log, however we do have plans to address this issue by adding a system to ack messages on output. We will also likely change the Output interface in the future to pass in a Context, this should help plugin authors add timeouts.

@pberlowski
Copy link
Contributor Author

@danielnelson The "skipping flush" warning message had not come up in the logs - we were going straight into "too long to collect" loop.

I'll be watching for any changes in this area. Thanks for good communications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants