Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMQP Consumer Stops function with bad messages #5285

Closed
Esity opened this issue Jan 14, 2019 · 10 comments · Fixed by #5286
Closed

AMQP Consumer Stops function with bad messages #5285

Esity opened this issue Jan 14, 2019 · 10 comments · Fixed by #5286
Labels
area/rabbitmq bug unexpected problem or unintended behavior regression something that used to work, but is now broken
Milestone

Comments

@Esity
Copy link

Esity commented Jan 14, 2019

System info:

RHEL 7
Telegraf 1.9.X

Steps to reproduce:

Set a reasonable max_undelivered_messages inside the amqp_input
Set Output to InfluxDB
Send bad metrics over RMQ.

Expected behavior:

Telegraf should either allow for a dead letter exchange from RabbitMQ or it should "consume" the message but drop the metric

Actual behavior:

After the AMQP has grabbed X messages(either prefetch or max_undelivered_messages, it will stop functioning. Basically Telegraf will say "I have 50 messages and my max is 50" but the messages are bad so it won't write them to Influx causing Telegraf to just stop doing anything

Additional Context

This occured inside our environment once we upgraded our main Telegraf writers(amqp_consumer and influx output) to 1.9.X

We realized that with different groups using [[inputs.mysql]], some had metric_version not set, some had it set to 1 and some had it set to 2. This is what was causing the error buildup in the writers

@danielnelson
Copy link
Contributor

This should be fixed in 1.9.2: #5170

@Esity
Copy link
Author

Esity commented Jan 14, 2019

@danielnelson Is this different? That appears to if you send a single message that is empty? Just want to confirm since the #5170 doesn't have a ton of info.

@danielnelson
Copy link
Contributor

Oh, I think I misunderstood the issue, it is stopped because it is trying to send to InfluxDB but unable to make progress?

@Esity
Copy link
Author

Esity commented Jan 14, 2019

Telegraf will throw an error in the log, then it won't ack the message or drop it but just hang on to it. Telegraf as a service will continue to run but once it has done that enough times, it will stop grabbing messages. If you look in RabbitMQ, the Unacked stat will either = the max_undelivery_messages or the prefetch depending on what is lower.

So if your prefetch is 50, and you have 100 messages in the queue, the first 50 are messages that will cause an error while writing, Telegraf will never pick up the next 50 messages because it will hang onto those first 50 that are bad.

Not sure if I am explaining it clearly

@Esity
Copy link
Author

Esity commented Jan 14, 2019

I can also retest this against 1.9.2 to see if it is still an issue. I know this impacts 1.9.0 and 1.9.1 but doesn't impact <1.9 because we ack messages immediately

@danielnelson
Copy link
Contributor

Can you show the log output?

@Esity
Copy link
Author

Esity commented Jan 14, 2019

2019-01-07T17:44:26Z E! [inputs.amqp_consumer]: Error in plugin: metric parse error: expected field at offset 49167:

It is worth noting these messages can be very large as currently all telegraf agents collecting metrics batch them into a single amqp message before sending them

@danielnelson danielnelson reopened this Jan 14, 2019
@danielnelson danielnelson added bug unexpected problem or unintended behavior regression something that used to work, but is now broken area/rabbitmq labels Jan 14, 2019
@danielnelson danielnelson added this to the 1.9.3 milestone Jan 14, 2019
@danielnelson
Copy link
Contributor

Looks like we are hitting the prefetch limit because the message is neither acked or rejected when a parse error occurs.

@Esity
Copy link
Author

Esity commented Jan 14, 2019

@danielnelson do we have an expected release date for 1.9.3?

@danielnelson
Copy link
Contributor

Should be on the 22nd, I can get you a pre-release sooner though if it would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rabbitmq bug unexpected problem or unintended behavior regression something that used to work, but is now broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants