Telegraf stops publishing metrics via wavefront output, all plugins take too long to collect - no meaningful error message #3710

pberlowski · 2018-01-22T15:19:47Z

Bug report

After a certain amount of time (usually 5-10 minutes) under a load-test, the telegraf agent stops sending metrics onward via wavefront output. In addition to that, the log fills up with "took to long to collect" messages from all plugins.

No relevant error message is available in the logs.

This issue seems similar to #3629 however no aggregation plugin is in use.
There is a suspicion that the output may be too slow, even though the buffer doesn't seem to overfill and drop metrics.

Load test description

Measurements are submitted via http_listener from a load-generator host that replays a snapshot of data at selectable throughput. At 5000pps (fields), the error happens within minutes of starting the agent.

Relevant telegraf.conf:

http-proxy.conf.txt
telegraf.conf.txt

System info:

Telegraf version: 1.4.5
OS: Centos 7.3

Expected behavior:

Telegraf logs a meaningful error to describe a reason for overload and keeps buffering metrics.

Actual behavior:

Telegraf seizes operations silently and then starts logging failures to collect on time.

Additional info:

Stacktrace:
telegraf.stacktrace.txt

Hardware and software metrics screenshots for the failure window:

danielnelson · 2018-01-22T18:54:59Z

This sounds like the wavefront plugin is waiting on a network response without the proper timeout.

@puckpuck Can you take a look at this one? It is important that outputs return from the Write() function fairly quickly to avoid backing up the inputs, usually no more than 5s.

puckpuck · 2018-01-22T20:30:14Z

@pberlowski 2 things

You are using an earlier version of telegraf, likely provided by Wavefront, compiled with an older version of the plugin pre-release. Can you try again with telegraf 1.5.x?
I believe someone in your organization opened a similar ticket with Wavefront support, where recommendations were made to check on configuration and network. Can you please follow up with that support reply, if not can you open a new support case with Wavefront and they will follow up with your accordingly.

@danielnelson Wavefront uses net.Dial("tcp", uri) with default timeouts. Other output plugins do very similar, so I'm not sure if this is the issue or not here, however it would make sense to force a 3s (or 5s) timeout. Also looking at the stacktrace, seemed like the HTTP listener was more of the culprit here. It had about 250 go routines tied to it. Happy to investigate here further as needed.

danielnelson · 2018-01-22T20:55:01Z

@puckpuck While the outputs are sending, telegraf stops reading messages from the Accumulator, Once the Accumulator's channel is full then inputs will block when adding new fields. This is most likely why there are so many goroutines from http listener. There is more info about this and the work we need to do to address it on #2919. The best way to handle this now is to make sure the outputs return quickly, so I'd recommend adding tighter timeouts on all network operations.

puckpuck · 2018-01-22T22:14:52Z

Added a 5s deadline to writes on output plugin, so if Write ever takes more than 5s it will close and resume. This may cause metrics to get re-written, on a subsequent output plugin Write, since there is no way that I know of to tell telegraf which (if any) of the metrics within the batch were successfully written.

@pberlowski please still follow through with the steps I asked previously so we can track this internally as well.

pberlowski · 2018-01-22T23:40:54Z

Hey guys, many thanks for a blazing fast response!

@puckpuck We're investigating the network configuration separately. The interaction between our telegraf and the Wavefront proxy was indeed ticketed in Wavefront and we'll follow up as soon as we have full picture. Also, we'll indeed look into upgrading to 1.5.1 ASAP. We are considering that a likely root cause of this issue.

At the same time, I feel uneasy about coming to such deadlock in case of a possible network fissure or similar condition. As stated in the ticket, I feel the expected outcome here is a warning message in the telegraf log and standard point buffering behavior to minimize the impact. Is this a correct assumption?

@danielnelson Is there a way to output some descriptive log message to communicate that the stoppage might be caused by the full accumulator channel? Or maybe just a "channel full" warning of some sort? This would possibly help debug the issue and ease ticketing/communication around such condition?

danielnelson · 2018-01-25T22:22:05Z

@pberlowski You should have also seen an error message like:

W! Skipping a scheduled flush because there is already a flush ongoing.

It isn't always a problem if the channel fills so we wouldn't want to log, however we do have plans to address this issue by adding a system to ack messages on output. We will also likely change the Output interface in the future to pass in a Context, this should help plugin authors add timeouts.

pberlowski · 2018-01-31T10:14:37Z

@danielnelson The "skipping flush" warning message had not come up in the logs - we were going straight into "too long to collect" loop.

I'll be watching for any changes in this area. Thanks for good communications.

puckpuck mentioned this issue Jan 22, 2018

Update wavefront.go #3711

Merged

3 tasks

danielnelson added this to the 1.5.2 milestone Jan 25, 2018

danielnelson added the bug unexpected problem or unintended behavior label Jan 25, 2018

danielnelson closed this as completed Jan 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telegraf stops publishing metrics via wavefront output, all plugins take too long to collect - no meaningful error message #3710

Telegraf stops publishing metrics via wavefront output, all plugins take too long to collect - no meaningful error message #3710

pberlowski commented Jan 22, 2018

danielnelson commented Jan 22, 2018

puckpuck commented Jan 22, 2018

danielnelson commented Jan 22, 2018

puckpuck commented Jan 22, 2018

pberlowski commented Jan 22, 2018 •

edited

Loading

danielnelson commented Jan 25, 2018

pberlowski commented Jan 31, 2018

Telegraf stops publishing metrics via wavefront output, all plugins take too long to collect - no meaningful error message #3710

Telegraf stops publishing metrics via wavefront output, all plugins take too long to collect - no meaningful error message #3710

Comments

pberlowski commented Jan 22, 2018

Bug report

Load test description

Relevant telegraf.conf:

System info:

Expected behavior:

Actual behavior:

Additional info:

danielnelson commented Jan 22, 2018

puckpuck commented Jan 22, 2018

danielnelson commented Jan 22, 2018

puckpuck commented Jan 22, 2018

pberlowski commented Jan 22, 2018 • edited Loading

danielnelson commented Jan 25, 2018

pberlowski commented Jan 31, 2018

pberlowski commented Jan 22, 2018 •

edited

Loading