Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error starting telegraf for azure eventhub_consumer with persistence file #13322

Open
AntonSigur opened this issue May 23, 2023 · 12 comments
Open
Labels
area/azure Azure plugins including eventhub_consumer, azure_storage_queue, azure_monitor bug unexpected problem or unintended behavior plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins upstream bug or issues that rely on dependency fixes

Comments

@AntonSigur
Copy link

I have successfully created a telegraf stream for device data from azure iot hub, using the eventhub_consumer.

However, to not re-reading and consuming again all (millions) of messages in the 7d buffer of the stream, I opted in using file persistence and not only in memory persistence - as per documentation.

After careful configuration and multiple tests and code review there is a bug.

I get....

E! [telegraf] Error running agent: starting input inputs.eventhub_consumer: creating receiver for partition "1": open [FILEPATH]/[IOTHUB-NAME]-24996945-d1f3232547_hanp1iottest_influxdb_0: no such file or directory

..in the log. (tested with multiple file locations and permissions)

The problem is, the persister does not create inital files for persistance, and then can't open them. Not sure where in the code the bug is, if it could be a result of some sort of a race condition or not.

Found a "silly" Workaround: Created the mentioned files in the log, withing the directory, with the content {} (empty JSON) and it seems to work as expected now, the files are persisting the state between restarts. This could easily break when adding new partitions in the eventhub, so you need to add new files as you add partitions.

Using latest telegraf agent in ubuntu 22. Telegraf 1.26.3 (git: HEAD@90f4eb29) @ Ubuntu 22.04.2 LTS

@AntonSigur
Copy link
Author

This is probably the breaking change from upstream library: Azure/azure-event-hubs-go@2a12765

Reading the persistance before writing any will result in an err instead of nil before. ...

@srebhan
Copy link
Member

srebhan commented May 24, 2023

Upstream bug report Azure/azure-event-hubs-go#280.

@srebhan srebhan added bug unexpected problem or unintended behavior upstream bug or issues that rely on dependency fixes area/azure Azure plugins including eventhub_consumer, azure_storage_queue, azure_monitor plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins labels May 24, 2023
@NuMove-JonathanSchmidt
Copy link

Given the length of time this report has been open on azure-event-hub-go, would it be acceptable to filter this specific error and handle it in the plugin ?

@powersj
Copy link
Contributor

powersj commented Oct 23, 2023

There are at least 2 open issues around azure event hub that need to be sorted:

  1. Your new issue from today Event Hub output plugin does not reconnect after a link is closed because of transient credit issues. #14162
  2. Error starting telegraf for azure eventhub_consumer with persistence file #13322 (this issue)

Both issues stem from the azure-event-hubs-go library and client. It should handle or provide a method to handle these types of issues to avoid needing our code to handle specific error conditions.

What we should instead focus on is migrating away from the azure-event-hubs-go library and use the azeventhubs library as the library's readme recommends. I don't think we can expect support or updates from the existing library anyway.

I would be very happy to review a PR that first migrates to the new library. Then we can have you re-test to see if these issues still exist. If they do, they should have new issues opened in the new repo to gauge the response.

@NuMove-JonathanSchmidt
Copy link

I agree with the root cause.

If I was fluent in Go, or had someone from my team that was, I'd have been very happy to provide such a PR. No such luck, however.

I assume this plugin isn't high on your priority list ?

@powersj
Copy link
Contributor

powersj commented Oct 24, 2023

I assume this plugin isn't high on your priority list ?

The cloud services plugins are much more difficult to test and ensure compatibility as we are not direct end-users of them. However, I do know I can find users via these issues and you seem to have the ability to easily test these out.

@NuMove-JonathanSchmidt
Copy link

If you'd like a limited Event Hub sandbox with a read and write key to test things out, please don't hesitate to contact me directly. Influx sales has my contact info. Otherwise I'd be happy to test things out in a production-like environment.

@NuMove-JonathanSchmidt
Copy link

Further possibility, given the exposition of a Kafka surface by EventHub, do you think it would be worthwhile to switch to a Kafka output plugin instead ?

@powersj
Copy link
Contributor

powersj commented Nov 2, 2023

I was unaware of that. That might be an option worth trying, but I cannot say I know enough about EventHub in general to know any possible trade-off.

@NuMove-IT
Copy link

NuMove-IT commented Dec 4, 2023

Hi @powersj,

The trade-off should be minimal, as the behavior of both services are quite close.

I was able to connect to EventHub for a producer with the following configuration :

[[outputs.kafka]]
brokers = ["<namespace>.servicebus.windows.net:9093"]
topic = "<topic-name>"
routing_tag = "host"
compression_codec = 0
required_acks = -1 #Set to 0 if anecdotal data loss is acceptable
max_retry = 3
max_message_bytes = 1000000
enable_tls = true
insecure_skip_verify = true
sasl_mechanism = "PLAIN"
sasl_username = "$$ConnectionString"
sasl_password = "{The actual connection string}"
sasl_version = 0
data_format = "influx"

And with a consumer with the following :

[[inputs.kafka_consumer]]
brokers = ["<namespace>.windows.net:9093"]
topics = ["<topic name>"]
version = "1.0.0"
sasl_mechanism = "PLAIN"
sasl_username = "$$ConnectionString"
sasl_password = "{The actual connection string}"
enable_tls = true
sasl_version = 0
consumer_group = "<consumer group name>"
compression_codec = 0
offset = "oldest"
connection_strategy = "startup"
max_message_len = 1000000
data_format = "influx"

All of the variables can either be inferred from the connection string or, like the consumer group, are also parameters for the Event Hub consumer.

Assuming you want to move forward with a migration of the Event Hub consumer, it should be fairly straightforward to parse the connection string and replicate the functionality with a maintained library behind the plugin.

Alternatively, a deprecation of the Event Hub plugin with the necessary configuration to use it with the kafka API would prevent people from having the same issues.

The only outstanding question is with index persistence, as I don't know if that functionality is supported by the Kafka consumer input plugin.

Best regards,
Jonathan Schmidt

@NuMove-IT
Copy link

Persistence in this case is handled directly by the Kafka API on a per-consumer-group basis.

The one functionality that couldn't be replicated is the ability to start consuming from an arbitrary datetime. Kafka is limited to earliest or oldest offset only when there is no persistence data to fetch.

@powersj
Copy link
Contributor

powersj commented Dec 5, 2023

Thank you very, very much for digging into this! It is fantastic to see that a user can use the existing Kafka plugins. In either case, this is something we should document. Would you mind putting up a PR with a brief explanation?

The only outstanding question is with index persistence, as I don't know if that functionality is supported by the Kafka consumer input plugin.

I briefly looked at the Event Hubs docs, including their kafka migration guide, and didn't see this called out. That doesn't mean it might be problematic though :\

Assuming you want to move forward with a migration of the Event Hub consumer, it should be fairly straightforward to parse the connection string and replicate the functionality with a maintained library behind the plugin.

I also looked at what the new azeventhubs library provides with respect to the event hubs output plugin. It looks like the new client can use azeventhubs.NewProducerClientFromConnectionString to create a producer client, and generate batches to send. One difference is that the partition key is set per batch, and not metric, unless I missed something. I assume we would need to do some grouping then.

For the input plugin, I have not looked into it in detail, but the number of clients we create is a bit more involved. However, the key points would be to:

  • Can create a consumer client with the same options (e.g. persistence, agent, offset)
  • Can parse event data for similar values to set tags

I do think we should try to migrate, even if we have to deprecate or ignore some options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/azure Azure plugins including eventhub_consumer, azure_storage_queue, azure_monitor bug unexpected problem or unintended behavior plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins upstream bug or issues that rely on dependency fixes
Projects
None yet
Development

No branches or pull requests

5 participants