Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RabbitMQ input reports forever that it did not complete in time after a single timeout #6814

Closed
jacquesh opened this issue Dec 19, 2019 · 1 comment · Fixed by #6819
Closed
Labels
area/rabbitmq bug unexpected problem or unintended behavior
Milestone

Comments

@jacquesh
Copy link
Contributor

Relevant telegraf.conf:

[agent]
  interval = "10s"
  logfile = "Z:/tick/telegraf/telegraf.log"
[[outputs.influxdb]]
  urls = ["http://127.0.0.1:8086"]
[[inputs.rabbitmq]]
  username = "guest"
  password = "guest"

System info:

Windows Server 2016 64-bit, Telegraf 1.13.0

Steps to reproduce:

  1. Run telegraf and collect RabbitMQ node stats.
  2. Somehow get the requestJSON call to fail when making a request to /api/healthchecks/node/<nodename>
  3. Watch the log file

I achieved step 2 by modifying the RabbitMQ source file such that on the 3rd gather only it changes the username, and then made requestJSON return an error when it saw that username (to mimic a network failure).

Expected behavior:

A single gather times out but future gathers succeed without error (as soon as the broker has recovered from whatever caused the original failure).

Actual behavior:

After a single failure collecting node stats, the gather call to the RabbitMQ plugin times out every time.

Additional info:

Here is an example of what my log file looks like:

2019-12-18T05:48:38Z I! Loaded inputs: win_perf_counters rabbitmq internal
2019-12-18T05:48:38Z I! Loaded aggregators: 
2019-12-18T05:48:38Z I! Loaded processors: 
2019-12-18T05:48:38Z I! Loaded outputs: influxdb
2019-12-18T05:48:38Z I! Tags enabled: dc=London host=sms-main-1
2019-12-18T05:48:38Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"sms-main-1", Flush Interval:10s
2019-12-19T00:03:03Z E! [inputs.rabbitmq] Error in plugin: Get http://localhost:15672/api/healthchecks/node/rabbit@sms-main-1: net/http: timeout awaiting response headers
2019-12-19T00:03:10Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:03:20Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:03:30Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:03:40Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:03:50Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:00Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:10Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:20Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:30Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:40Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-19T00:04:50Z W! [agent] [inputs.rabbitmq] did not complete within its interval
<etc>

Looking at the source for the Rabbit input, the gatherNodes function starts a goroutine for each node that it needs to run health checks against and then waits on a channel for the same number of results. If that health check call fails, it returns an error from inside the goroutine but never sends anything to the channel, so I expect the problem is that the initial failing call to gatherNodes() is stuck blocking forever waiting on that channel for a health check that it will never receive.

@thesimj
Copy link

thesimj commented Dec 23, 2019

I have the same issue, on Ubuntu Linux 18.04 and Telegraf 1.13.0.
Only telegraf restart help, but after some few hours it's freeze again.

.....
2019-12-23T14:11:20Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-23T14:11:30Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-23T14:11:40Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-23T14:11:50Z W! [agent] [inputs.rabbitmq] did not complete within its interval
2019-12-23T14:12:00Z W! [agent] [inputs.rabbitmq] did not complete within its interval

@danielnelson danielnelson added area/rabbitmq bug unexpected problem or unintended behavior labels Dec 23, 2019
@danielnelson danielnelson added this to the 1.13.1 milestone Dec 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rabbitmq bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants