Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature to "pause" input message queue consumers while output(s) are down #2240

Closed
biker73 opened this issue Jan 9, 2017 · 7 comments
Closed
Milestone

Comments

@biker73
Copy link

biker73 commented Jan 9, 2017

Bug report

Relevant telegraf.conf:

[global_tags]
env="UAT"

[agent]
interval = "5s"
round_interval = true
metric_batch_size = 5000
metric_buffer_limit = 20000
collection_jitter = "0s"
flush_jitter = "0s
precision = "ms"
debug = true
quiet = false
omit_hostname = false

[[outputs.influxdb]]
urls = ["host:port"]
database = "historical"
retention_policy = "events.10d"
write_consistency = "any"
timeout = "5s"
username = "userid"
password = "password"

System info:

[Include Telegraf version, operating system name, and other relevant details]

Steps to reproduce:

  1. start telegraf using kafka input and an influxdb output
  2. Ensure a stream of new data is sent to the kafka topic with a valid timestamp for the time the data was generated (ie do NOT use the influx auto generated one)
  3. Verify messages are on kafka queues and that these appear influx via telegraf
  4. Stop influx while new messages being generated on to the kafka topic being read
  5. After 2-3 m sample time start influx
  6. Verify there is a data gap in the data in influx for the period it was stopped
  7. Confirm in telegraf logs metrics were dropped due to no influxdb output available

Expected behaviour:

Telegraf should retain the kafka offset of the last successful write. If the influxdb (or other) output is not available it should stop reading data and pause poll until the output once again becomes available. Once available it should start to read from kafka form the stored offset of the last successful write.

Actual behaviour:

Messages are dropped / ignored. As a time series platform telegraf / influx need to cope with outages otherwise there are huge gaps in data from where influx has been unavailable.

Use case: [Why is this important (helps with prioritizing requests)]

To ensure no gaps / loss of data in a time series platform, Kafka stores the data so telegraf should ensure it detects influx is not available and stop trying to send metrics and subsequently dropping them. It should resume form the last good kafka offset when influx becomes available.

@sparrc
Copy link
Contributor

sparrc commented Jan 9, 2017

telegraf does buffer up to metric_buffer_limit messages.

It's true that Kafka in particular could be handled differently. Currently there is no notification system informing an input plugin what has happened on the output end of telegraf. Thus far, we have designed telegraf inputs to be independent of the outputs. Implementing this feature would have to fundamentally change that.

@sparrc sparrc added this to the Future Milestone milestone Jan 9, 2017
@biker73
Copy link
Author

biker73 commented Jan 9, 2017

Thanks for the quick response, I was unsure whether to log as a bug or a feature, I suspect now more a feature. I understand that the buffer helps but this is possibly more for handling momentary network glitches / latency changes handling / data bursts etc. which will help for a few seconds / minutes. An outage of hours / a weekend due to maintenance of an unexpected issue is not good for this. Maybe this can be changed to a feature request or longer term integration in some way ? I am sure many others would benefit from this too. As a workaround I guess I can manually re-load data from a specific offset from kafka for the now - I'll need to see if telegraf logs the offsets so i know where to load from.

Thanks

Jason

@sparrc
Copy link
Contributor

sparrc commented Jan 9, 2017

see also #802

@biker73
Copy link
Author

biker73 commented Jan 9, 2017

I think #802 can be resolved using Kafka :) All we then need is for telegraf to auto recover from where it left off in the event of a platform failure somewhere :)

@sparrc sparrc changed the title Bug - telegraf drops input data if influx output becomes unavailable Implement option of "pausing" telegraf message queue consumers while output(s) are down Jan 13, 2017
@sparrc
Copy link
Contributor

sparrc commented Jan 13, 2017

I'm changing the title of this issue because I think it's a good general feature to have for all message queue input plugins.

Basically the idea would be that telegraf could signal to some input plugins (namely message queues, which have their own persistent storage) to stop accepting new messages until all output plugins are operational again.

My thoughts on this are that it would only apply to message queues, and not apply to plugins that don't have a clear datastore behind them, like mem, cpu, statsd, tcp_listener, etc.

See also #2265

@biker73
Copy link
Author

biker73 commented Jan 13, 2017

This would also work very effectively (and preferred I think), I raised #2265 as this could be a easier solution in terms of re-work / coding and leave the effort on the implementor to manage. However it would be for kafka only whereas a pause mechanism would be universal.

@sparrc sparrc changed the title Implement option of "pausing" telegraf message queue consumers while output(s) are down Feature to "pause" telegraf message queue consumers while output(s) are down Jan 13, 2017
@sparrc sparrc changed the title Feature to "pause" telegraf message queue consumers while output(s) are down Feature to "pause" input message queue consumers while output(s) are down Jan 13, 2017
@danielnelson danielnelson removed this from the Future Milestone milestone Jun 14, 2017
@danielnelson
Copy link
Contributor

@biker73 I have added this functionality to 1.9 (currently in rc). The queue consumers, including kafka_consumer have a new option max_undelivered_messages that limits how many messages will be pulled from the queue before sending. If you could try it out and let me know if you run into any issues that would be really valuable.

@danielnelson danielnelson added this to the 1.9.0 milestone Nov 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants