Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock when Telegraf is aligning aggregators #5612

Merged
merged 1 commit into from
Mar 20, 2019

Conversation

adrianlzt
Copy link
Contributor

If a Telegraf is asked to stop (SIGTERM, SIGINT) while it is aligning an
aggregator, internal.SleepContext will fail and the goroutine will
return without closing the aggregations channel getting stuck
forever in for metric := range aggregations

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

Sample config file to test the problem:

[global_tags]
[agent]
  interval = "60s"
  debug = true

[[aggregators.basicstats]]
  period = "1800s"

[[inputs.mem]]

[[outputs.file]]
  files = [ "stdout" ]

Run telegraf:

telegraf -config telegraf.conf

Kill it before 60s:

pkill telegraf

It won't finish.

Here telegraf.log is the log of the execution mixed with the kill -ABRT used to check the status of goroutines.

goroutine 58 (agent.runAggregators) is the problematic one. It is stuck so it is not closing the channel used by the outputs.

goroutine 59 (agent.runOutputs) is waiting till the aggregator close the input channel.

goroutine 60 (agent.flush): this one should be stopped by runOutputs

goroutine 36 (ticker) is going to stop when agent.flushOnce finishes. Meanwhile is keeping one goroutine running so Go could not detect a deadlock.

goroutine 51 (opencensus) is strange. I have not configured nothing related with opencensus. Looking at the code, that lib have a init() function that starts its own goroutine with a time Ticker. At first glance do not look like a good pattern to start a goroutine in the init() function, in fact using the init() looks like is not a good pattern.
Removing this import does not solve the not-detected-deadlock problem.
Maybe could be a good idea to check for no init() functions in the vendor libs, but maybe too much.

If a Telegraf is asked to stop (SIGTERM, SIGINT) while it is aligning an
aggregator, ``internal.SleepContext`` will fail and the goroutine will
return without closing the ``aggregations`` channel getting stuck
forever in ``for metric := range aggregations``
@danielnelson danielnelson self-assigned this Mar 20, 2019
@danielnelson danielnelson added this to the 1.10.2 milestone Mar 20, 2019
@danielnelson danielnelson added the fix pr to fix corresponding bug label Mar 20, 2019
@danielnelson danielnelson merged commit 33ee309 into influxdata:master Mar 20, 2019
@danielnelson
Copy link
Contributor

Thanks for the fix and also for opening the issue with opencensus-go, I agree this should be a requirement going forward. We also need to be careful not to bring in any libraries that do excessive static initialization as well.

danielnelson pushed a commit that referenced this pull request Mar 20, 2019
@danielnelson
Copy link
Contributor

I should clarify that I think using init() is fine, but we shouldn't bring in libraries that start goroutines as a side effect of importing them and we need to be careful about libraries that are initializing large amounts of data.

dupondje pushed a commit to dupondje/telegraf that referenced this pull request Apr 22, 2019
hwaastad pushed a commit to hwaastad/telegraf that referenced this pull request Jun 13, 2019
bitcharmer pushed a commit to bitcharmer/telegraf that referenced this pull request Oct 18, 2019
athoune pushed a commit to bearstech/telegraf that referenced this pull request Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix pr to fix corresponding bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants