Allow agent to start when input or output cannot be connected to #3723

kelein · 2018-01-26T13:09:09Z

I have installed telegraf v1.4.4 via rpm and configured a input for kafka_consumer as follows:

## Read metrics from Kafka topic(s)
[[inputs.kafka_consumer]]
  brokers = ["localhost:9092"]
  topics = ["telegraf"]

It works well for gathering kafka metrics. Unfortunately, when the kafka broker is down abnormally, It is failed to restart telegraf. The telegraf log hints this:

$ service telegraf status
 
Redirecting to /bin/systemctl status telegraf.service
● telegraf.service - The plugin-driven server agent for reporting metrics into InfluxDB
   Loaded: loaded (/usr/lib/systemd/system/telegraf.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Fri 2018-01-26 20:18:16 CST; 45min ago
     Docs: https://github.com/influxdata/telegraf
  Process: 8954 ExecStart=/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d $TELEGRAF_OPTS (code=exited, status=0/SUCCESS)
 Main PID: 8954 (code=exited, status=0/SUCCESS)

/var/log/telegraf/telegraf.log

2018-01-26T10:33:56Z I! Starting Telegraf v1.4.4
2018-01-26T10:33:56Z I! Loaded outputs: influxdb opentsdb prometheus_client
2018-01-26T10:33:56Z I! Loaded inputs: inputs.kernel inputs.system inputs.filestat inputs.mongodb inputs.kafka_consumer inputs.diskio inputs.swap inputs.disk inputs.docker inputs.internal inputs.kernel_vmstat inputs.cpu inputs.mem inputs.processes inputs.net inputs.zookeeper inputs.logparser
2018-01-26T10:33:56Z I! Tags enabled: host=127.0.0.1 user=telegraf
2018-01-26T10:33:56Z I! Agent Config: Interval:10s, Quiet:false, Flush Interval:10s 
2018-01-26T10:33:57Z E! Error when creating Kafka Consumer, brokers: [localhost:9092], topics: [telegraf]
2018-01-26T10:33:57Z E! Service for input inputs.kafka_consumer failed to start, exiting
kafka: client has run out of available brokers to talk to (Is your cluster reachable?)

Expected behavior:

Telegraf restart successfully regardless of some inputs internal error.

Actual behavior:

Telegraf restart failed owing to a input plugin kafka_consumer 's internal error.

The text was updated successfully, but these errors were encountered:

danielnelson · 2018-01-26T19:47:23Z

This was by design however I think we should provide a config/cli option that tells Telegraf to continue if any ServiceInput or Output does not successfully connect.

HariSekhon · 2018-01-31T16:29:46Z

+1

Request that Telegraf prints a warning but stays up and periodically retries (configurable retry setting?)

I am seeing this with OpenTSDB output plugin, OpenTSDB takes a while to start (booting everything in docker compose), so Telegraf gets a connection refused and quits:

opentsdb-telegraf_1  | 2018-01-31T16:18:10Z E! Failed to connect to output opentsdb, retrying in 15s, error was 'OpenTSDB Telnet connect fail: dial tcp 172.30.0.2:4242: getsockopt: 
connection refused'
opentsdb-telegraf_1  | 2018-01-31T16:18:25Z E! OpenTSDB Telnet connect fail: dial tcp 172.30.0.2:4242: getsockopt: connection refused
docker_opentsdb-telegraf_1 exited with code 1

PhoenixRion · 2018-01-31T17:49:39Z

+1
Ran into similar issue while testing with Jolokia plugin if agent wasn't started before Telegraf.

A good retry period would likely be the (flush_)interval. In the case of output plugins, it would also be ideal if points were still buffered.

danielnelson · 2018-11-21T19:29:57Z

After some more thought maybe we should change outputs so that Connect does not return an error unless the plugin will never be able to start up. Outputs all need to have reconnection logic anyway, the idea of connecting once at startup is quite a large simplification, and the initial connection does not indicate that metrics will be sendable when the time comes. It is nice to catch misconfigured outputs that will never work but I think the test shouldn't be ran during normal startup.

Perhaps in the next version of the Output interface we should have both a Check() and Start() function instead. Check would optionally connect or do some other network validation, but would only be ran when specifically asked for. Start could be used to start needed background goroutines but would only error on fatal issues that cannot be recovered from.

arnoldyahad · 2019-03-14T09:29:17Z

hey @danielnelson was a new version of the output released?

danielnelson · 2019-03-19T00:50:04Z

No this has not been worked on yet.

randallt · 2019-05-23T15:45:24Z

This is actively an issue for me. Apparently we have some VMs come up when the network isn't quite ready. We rely on telegraf to relay important metrics throughout our infrastructure, upon which our alerting is based. When telegraf tries once and then just quits, it looks like the host has gone away. I definitely vote for a fix here.

Further details--it seems like it is the Wavefront Output plugin that is the cause for us, and a temporary DNS resolution issue.

sgreszcz · 2019-07-05T13:55:00Z

Is there a way to work-around this? There were some changes in Elastic Search 7 which cause the output to fail. This unfortunately causes telegraf to restart continuously which breaks my kafka output which is working OK. For now I guess I'll need to comment out the Elastic Search output.

glinton · 2019-07-05T15:27:09Z

@sgreszcz regarding the elasticsearch 7, can you test out the nightly build, there were some changes that got merged for ES7

joe-holder · 2020-01-08T09:47:49Z

Plus one for this feature

AtakanColak · 2020-03-12T07:59:38Z

For those who desperately need it, I made a risky workaround, in agent.go;

Hipska · 2020-11-02T09:45:46Z

Any update on this? This becomes problematic if you want (for example) update the Telegraf configuration during an outage of the Kafka cluster..

reimda · 2020-11-03T15:59:08Z

There was some discussion on this on slack: https://influxcommunity.slack.com/archives/C019JDRJAE7/p1604309611104100

vipinvkmenon · 2020-12-15T04:48:29Z

Hmm... +1

same issue for missing influxdb connections as well.

Ideally, a flag to switch on switch off with retry numbers (-1 for infinities) would be nice.

cnjlq84 · 2021-02-24T09:43:45Z

I got a similar problem.
Comunication errors with any opc ua server could cause the whole [[inputs.opcua]] plugin failed. the error message is like this:

2021-02-24T09:32:18Z E! [inputs.opcua] Error in plugin: Get Data Failed: Status not OK: Bad (0x800000000)

How to solve this problem?

derpeter · 2021-04-16T14:03:19Z

Hmm... +1

same issue for missing influxdb connections as well.

Ideally, a flag to switch on switch off with retry numbers (-1 for infinities) would be nice.

I need the same feature but for a different reason. I use telegraf to collect metrics on an mobile node, if the node moves outside of the coverage of the mobile network telegraf stops trying eventually and never starts again until restarted. (just to add another usecase for this feature)

lesinigo · 2021-07-02T10:19:09Z

Seems like the same issue has been fixed specifically in outputs.kafka with PR #9051 , merged in Telegraf v1.19.0.

I'd really like to see a more general solution as already proposed because we had many instances of missing metrics in InfluxDB because an unrelated (ElasticSearch) output from the same Telegraf wasn't working.

Hipska · 2021-07-02T11:09:40Z

I now have a similar problem when I would like to start a second telegraf instance with the same config. (The config has a service input that listens to a specific port)

Being able to tell telegraf not to crash(!) when it cannot bind to the specified port would be so useful.

daviesalex · 2021-09-17T10:57:00Z

There are 2 failures that we see in Telegraf 1.20-rc0 just in the Kafka plugin, despite #9051 that was supposed to fix this plugin:

1. If the Kafka backends are just down

Use this config to test:

[agent]
  interval = "1s"
  flush_interval = "1s"
  omit_hostname = true
  collection_jitter = "0s"
  flush_jitter = "0s"

[[outputs.kafka]]
  brokers = ["server1:9092","server2:9092","server3:9092"]
  topic = "xx"
  client_id = "telegraf-metrics-foo"
  version = "2.4.0"
  routing_tag = "host"
  required_acks = 1
  max_retry = 100
  sasl_mechanism = "SCRAM-SHA-256"
  sasl_username = "foo"
  sasl_password = "bar"
  exclude_topic_tag = true
  compression_codec = 4
  data_format = "msgpack"

[[inputs.cpu]]

[[outputs.file]]
  files = ["stdout"]

Make sure the client cant talk to server[1-3]; we did ip route add x via 127.0.0.1 to null route it but you could use a firewall or just point it to IPs that are not running Kafka.

What we expect:

Kafka output fails and tries to reconnect 100 times
I can still see the CPU input plugin sending data to stdout
Once Kafka manages to connect then I see the data there as well

What actually happens:

Kafka tries to connect a couple of times
CPU input plugin data is never passed to the stdout
After kafka fails, Telegraf exits with an error

2. If the Kafka sasl_password is wrong and SASL auth enabled

This is trivial to reproduce - just change the sasl_password for a working config.

What we expect:

Kafka output fails and tries to reconnect X times
Everything else works fine

What actually happens:

Telegraf immediately fails to start with this error (process exits):

[root@x ~]# /usr/local/telegraf/bin/telegraf -config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/conf.d
2021-09-16T11:23:51Z I! Starting Telegraf build-50
...
2021-09-16T11:23:51Z E! [agent] Failed to connect to [outputs.kafka], retrying in 15s, error was 'kafka server: SASL Authentication failed.'

It would be really awesome to

Fix these two particular cases in Telegraf 1.20 before GA
Get some sort of generic change in so this cant happen (this game of whack-a-mole where we see these failures in production, telegraf stops working, and we wait for a release to fix each case is not good)

reimda · 2021-09-17T17:48:50Z

@daviesalex Thanks for your report. Could you open a new issue specific to the problems you're having? Continuing to comment on this three year old issue isn't a good way to track what you're seeing and plan a fix. Please mention this issue and #9051 for context in the new issue you open.

Given your example I would expect the cpu data to appear on stdout. I would also expect the kafka output to retry 100 times since you have the max_retry set to 100. After a quick look at the code, I see that the setting is passed to sarama, the kafka library telegraf uses. What you're seeing may be a problem with sarama. I'm not sure what retry values it allows.

This will not be fixed in 1.20.0 GA. That release was scheduled for Sept 15 so it is already two days late and we are currently working on getting it officially released. Since there is also no fix ready for this issue it is unreasonable to ask 1.20.0 GA to be held up for this.

1.20.1 is the absolute earliest you could expect a fix to be in an official release. 1.20.1 is scheduled for Oct 6. You'll be able to test an alpha build as soon as someone is able to debug your issue and provide a PR that passes CI tests

powersj · 2021-09-17T18:39:24Z

The max_retry kafka output config option is specific to the number of attempts telegraf will make to send data, not attempts to connect to a system.

Telegraf, as it is now, would attempt the output plugin's Connect function twice per connectOutput in agent.go. This means currently we attempt to connect to a kafka system 5 times, with 250ms between each try. Then telegraf sleeps for 15 seconds and attempts to connect 5 times with 250ms between each try.

To specifically allow the kafka output more connection attempts, then exposing the following kafka client (sarama) config options would give some flexibility, but would still be limited to two attempts:

Metadata.Retry.Max (default: 5)
Metadata.Retry.Backoff (default: 250ms)

daviesalex · 2021-09-18T10:39:27Z

@reimda new issue submitted for this particular situation: #9778

I personally think that this goes to show the value of this issue (albeit its 4 years old). Playing whack-a-mole with each plugin to catch every possible failure (with each one being treated as its own totally separate issue) is not optimal. This example shows that even InfluxData developers trying to fix a specific plugin failing in a specific and trivial to reproduce case find this difficult to get right.

Telegraf could really benefit from an architectural change that prevents plugin A blocking plugin B, regardless of missed exception handling deep in the third party dependencies of plugin A - because at scale you really dont want your CPU metrics to stop because some other third party system (of the huge number that telegraf now has plugins for) started doing something odd. The alternative I guess is to run one telegraf per plugin, but the overhead of that for us would be enormous.

powersj · 2023-01-23T20:43:45Z

In #12111 a new config option was added to the kafka plugins to allow for retrying connections on failure. This means that the plugin can start even if the connection is not successful.

While we will not add a global config option to let any and all plugins to start on failure, we are more than happy to see a plugin-by-plugin options to allow connection failures on start. If there is another plugin you are interested in seeing this for, please open a new issue (assuming one does not already exist), requesting something similar.

As a result I am going to go ahead and close this issue due to #12111 landing. Thanks!

danielnelson added the feature request Requests for new plugin and for new features to existing plugins label Jan 26, 2018

danielnelson changed the title ~~Telegraf v1.4.4 failed to boot while inputs.kafka_consumer occurs error~~ Allow agent to start when input or output cannot be connected to Jan 26, 2018

danielnelson mentioned this issue Apr 17, 2018

Telegraf doesn't start and systemctl status does not show why #2709

Closed

danielnelson mentioned this issue Jun 29, 2018

telegraf fails to start if its wavefront output fails to resolve hostname #4357

Closed

danielnelson mentioned this issue Sep 21, 2018

Optional argument to ignore bad configs #4729

Closed

danielnelson added the area/agent label Nov 12, 2018

danielnelson mentioned this issue Nov 21, 2018

Telegraf won't continue execution when a socket_writer or riemann output is unavailable at the time of telegraf startup #5017

Closed

danielnelson self-assigned this Sep 17, 2019

danielnelson mentioned this issue Nov 7, 2019

Elasticsearch output plugin doesnt gracefully exit #6634

Closed

serrj-sv mentioned this issue Feb 7, 2020

Modbus Input plugin #6154

Merged

3 tasks

danielnelson mentioned this issue Apr 16, 2020

feature-7338 Retry failed remote HTTP Config on startup #7349

Closed

danielnelson mentioned this issue May 12, 2020

vsphere plugin failure causing telegraf service to fail #7500

Closed

danielnelson mentioned this issue May 26, 2020

Delay processing input #7582

Closed

Hipska mentioned this issue Nov 2, 2020

Kafka output does not recover from outage #8349

Closed

ssoroka assigned ssoroka and unassigned danielnelson Nov 2, 2020

ssoroka added this to the Planned milestone Nov 6, 2020

This was referenced Jan 25, 2021

Agent won't start if remote HTTP config endpoint is down #7338

Closed

OPC UA input plugin stops Telegraf if OPC server not found #8632

Closed

sjwang90 removed this from the Planned milestone Jan 29, 2021

Hipska mentioned this issue Mar 26, 2021

Moved samara config out of init into connect #9051

Merged

helenosheaa unassigned ssoroka Sep 20, 2021

eran-gil2 mentioned this issue Nov 16, 2021

added an option to skip failed outputs in agent startup #10111

Closed

3 tasks

srebhan mentioned this issue Jun 22, 2022

feat(agent): add ignore_error_inputs option for inputs #11313

Closed

3 tasks

powersj closed this as completed Jan 23, 2023

smokhov mentioned this issue Jan 21, 2024

allow Telegraf to start even if nvidia-smi or rocm-smi inputs are not available on a cluster #14603

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow agent to start when input or output cannot be connected to #3723

Allow agent to start when input or output cannot be connected to #3723

kelein commented Jan 26, 2018 •

edited

Loading

danielnelson commented Jan 26, 2018

HariSekhon commented Jan 31, 2018

PhoenixRion commented Jan 31, 2018

danielnelson commented Nov 21, 2018

arnoldyahad commented Mar 14, 2019

danielnelson commented Mar 19, 2019

randallt commented May 23, 2019

sgreszcz commented Jul 5, 2019

glinton commented Jul 5, 2019

joe-holder commented Jan 8, 2020

AtakanColak commented Mar 12, 2020

Hipska commented Nov 2, 2020

reimda commented Nov 3, 2020

vipinvkmenon commented Dec 15, 2020 •

edited

Loading

cnjlq84 commented Feb 24, 2021

derpeter commented Apr 16, 2021

lesinigo commented Jul 2, 2021

Hipska commented Jul 2, 2021

daviesalex commented Sep 17, 2021 •

edited

Loading

reimda commented Sep 17, 2021

powersj commented Sep 17, 2021

daviesalex commented Sep 18, 2021

powersj commented Jan 23, 2023

Allow agent to start when input or output cannot be connected to #3723

Allow agent to start when input or output cannot be connected to #3723

Comments

kelein commented Jan 26, 2018 • edited Loading

Expected behavior:

Actual behavior:

danielnelson commented Jan 26, 2018

HariSekhon commented Jan 31, 2018

PhoenixRion commented Jan 31, 2018

danielnelson commented Nov 21, 2018

arnoldyahad commented Mar 14, 2019

danielnelson commented Mar 19, 2019

randallt commented May 23, 2019

sgreszcz commented Jul 5, 2019

glinton commented Jul 5, 2019

joe-holder commented Jan 8, 2020

AtakanColak commented Mar 12, 2020

Hipska commented Nov 2, 2020

reimda commented Nov 3, 2020

vipinvkmenon commented Dec 15, 2020 • edited Loading

cnjlq84 commented Feb 24, 2021

derpeter commented Apr 16, 2021

lesinigo commented Jul 2, 2021

Hipska commented Jul 2, 2021

daviesalex commented Sep 17, 2021 • edited Loading

reimda commented Sep 17, 2021

powersj commented Sep 17, 2021

daviesalex commented Sep 18, 2021

powersj commented Jan 23, 2023

kelein commented Jan 26, 2018 •

edited

Loading

vipinvkmenon commented Dec 15, 2020 •

edited

Loading

daviesalex commented Sep 17, 2021 •

edited

Loading