-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow agent to start when input or output cannot be connected to #3723
Comments
This was by design however I think we should provide a config/cli option that tells Telegraf to continue if any ServiceInput or Output does not successfully connect. |
+1 Request that Telegraf prints a warning but stays up and periodically retries (configurable retry setting?) I am seeing this with OpenTSDB output plugin, OpenTSDB takes a while to start (booting everything in docker compose), so Telegraf gets a connection refused and quits:
|
+1 A good retry period would likely be the (flush_)interval. In the case of output plugins, it would also be ideal if points were still buffered. |
After some more thought maybe we should change outputs so that Perhaps in the next version of the Output interface we should have both a |
hey @danielnelson was a new version of the output released? |
No this has not been worked on yet. |
This is actively an issue for me. Apparently we have some VMs come up when the network isn't quite ready. We rely on telegraf to relay important metrics throughout our infrastructure, upon which our alerting is based. When telegraf tries once and then just quits, it looks like the host has gone away. I definitely vote for a fix here. Further details--it seems like it is the Wavefront Output plugin that is the cause for us, and a temporary DNS resolution issue. |
Is there a way to work-around this? There were some changes in Elastic Search 7 which cause the output to fail. This unfortunately causes telegraf to restart continuously which breaks my kafka output which is working OK. For now I guess I'll need to comment out the Elastic Search output. |
@sgreszcz regarding the elasticsearch 7, can you test out the nightly build, there were some changes that got merged for ES7 |
Plus one for this feature |
Any update on this? This becomes problematic if you want (for example) update the Telegraf configuration during an outage of the Kafka cluster.. |
There was some discussion on this on slack: https://influxcommunity.slack.com/archives/C019JDRJAE7/p1604309611104100 |
Hmm... +1 same issue for missing influxdb connections as well. Ideally, a flag to switch on switch off with retry numbers (-1 for infinities) would be nice. |
I got a similar problem. 2021-02-24T09:32:18Z E! [inputs.opcua] Error in plugin: Get Data Failed: Status not OK: Bad (0x800000000) How to solve this problem? |
I need the same feature but for a different reason. I use telegraf to collect metrics on an mobile node, if the node moves outside of the coverage of the mobile network telegraf stops trying eventually and never starts again until restarted. (just to add another usecase for this feature) |
Seems like the same issue has been fixed specifically in I'd really like to see a more general solution as already proposed because we had many instances of missing metrics in InfluxDB because an unrelated (ElasticSearch) output from the same Telegraf wasn't working. |
I now have a similar problem when I would like to start a second telegraf instance with the same config. (The config has a service input that listens to a specific port) Being able to tell telegraf not to crash(!) when it cannot bind to the specified port would be so useful. |
There are 2 failures that we see in Telegraf 1.20-rc0 just in the Kafka plugin, despite #9051 that was supposed to fix this plugin: 1. If the Kafka backends are just down Use this config to test:
Make sure the client cant talk to server[1-3]; we did ip route add x via 127.0.0.1 to null route it but you could use a firewall or just point it to IPs that are not running Kafka. What we expect:
What actually happens:
2. If the Kafka sasl_password is wrong and SASL auth enabled This is trivial to reproduce - just change the sasl_password for a working config. What we expect:
What actually happens:
It would be really awesome to
|
@daviesalex Thanks for your report. Could you open a new issue specific to the problems you're having? Continuing to comment on this three year old issue isn't a good way to track what you're seeing and plan a fix. Please mention this issue and #9051 for context in the new issue you open. Given your example I would expect the cpu data to appear on stdout. I would also expect the kafka output to retry 100 times since you have the max_retry set to 100. After a quick look at the code, I see that the setting is passed to sarama, the kafka library telegraf uses. What you're seeing may be a problem with sarama. I'm not sure what retry values it allows. This will not be fixed in 1.20.0 GA. That release was scheduled for Sept 15 so it is already two days late and we are currently working on getting it officially released. Since there is also no fix ready for this issue it is unreasonable to ask 1.20.0 GA to be held up for this. 1.20.1 is the absolute earliest you could expect a fix to be in an official release. 1.20.1 is scheduled for Oct 6. You'll be able to test an alpha build as soon as someone is able to debug your issue and provide a PR that passes CI tests |
The Telegraf, as it is now, would attempt the output plugin's Connect function twice per To specifically allow the kafka output more connection attempts, then exposing the following kafka client (sarama) config options would give some flexibility, but would still be limited to two attempts:
|
@reimda new issue submitted for this particular situation: #9778 I personally think that this goes to show the value of this issue (albeit its 4 years old). Playing whack-a-mole with each plugin to catch every possible failure (with each one being treated as its own totally separate issue) is not optimal. This example shows that even InfluxData developers trying to fix a specific plugin failing in a specific and trivial to reproduce case find this difficult to get right. Telegraf could really benefit from an architectural change that prevents plugin A blocking plugin B, regardless of missed exception handling deep in the third party dependencies of plugin A - because at scale you really dont want your CPU metrics to stop because some other third party system (of the huge number that telegraf now has plugins for) started doing something odd. The alternative I guess is to run one telegraf per plugin, but the overhead of that for us would be enormous. |
In #12111 a new config option was added to the kafka plugins to allow for retrying connections on failure. This means that the plugin can start even if the connection is not successful. While we will not add a global config option to let any and all plugins to start on failure, we are more than happy to see a plugin-by-plugin options to allow connection failures on start. If there is another plugin you are interested in seeing this for, please open a new issue (assuming one does not already exist), requesting something similar. As a result I am going to go ahead and close this issue due to #12111 landing. Thanks! |
I have installed telegraf v1.4.4 via rpm and configured a input for kafka_consumer as follows:
It works well for gathering kafka metrics. Unfortunately, when the kafka broker is down abnormally, It is failed to restart telegraf. The telegraf log hints this:
Expected behavior:
Telegraf restart successfully regardless of some inputs internal error.
Actual behavior:
Telegraf restart failed owing to a input plugin kafka_consumer 's internal error.
The text was updated successfully, but these errors were encountered: