Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(inputs.mqtt_consumer): Implement startup error behaviors #15486

Merged
merged 14 commits into from
Jun 17, 2024

Conversation

srebhan
Copy link
Member

@srebhan srebhan commented Jun 11, 2024

Summary

This PR allows to use the startup-error-behavior options error, retry and ignore. It furthermore provided integration test for the different options and one for a "normal" startup.

Checklist

  • No AI generated code was used in this PR

Related issues

resolves #10694

@telegraf-tiger telegraf-tiger bot added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins labels Jun 11, 2024
@srebhan srebhan self-assigned this Jun 11, 2024
@srebhan srebhan added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Jun 13, 2024
@srebhan srebhan assigned powersj and DStrand1 and unassigned srebhan Jun 13, 2024
@da-phil
Copy link

da-phil commented Jun 13, 2024

Cool feature, looks like it could fix the issue I'm experiencing, which I described here:
https://community.influxdata.com/t/mqtt-consumer-connection-timeout-setting-not-working/34455/1

@da-phil
Copy link

da-phil commented Jun 13, 2024

Cool feature, looks like it could fix the issue I'm experiencing, which I described here: https://community.influxdata.com/t/mqtt-consumer-connection-timeout-setting-not-working/34455/1

I just did a quick test with the build artifacts above and additionally setting

  keep_alive = "5m"
  ping_timeout = "50s"

but I'm still getting the following log msgs:

[inputs.mqtt_consumer] Error in plugin: connection lost: EOF
[inputs.mqtt_consumer]  Disconnected [mqtt://127.0.0.1:1883]
[outputs.influxdb_v2]  Wrote batch of 4 metrics in 8.874236ms
[outputs.influxdb_v2]  Buffer fullness: 0 / 10000 metrics
[inputs.mqtt_consumer]  Connecting [mqtt://127.0.0.1:1883]
[inputs.mqtt_consumer] Error in plugin: network Error : dial tcp 127.0.0.1:1883: connect: connection refused

How are the changes in this PR supposed to change the behaviour in case the connection to the MQTT broker gets interrupted for a while?

@srebhan
Copy link
Member Author

srebhan commented Jun 13, 2024

@da-phil the initial connection is influenced by the startup_error_behavior setting. For reconnection, the connection loss needs to be detected first. This is done by sending keep-alive messages and check if they are answered. In your configuration, a keep-alive is sent every 5 minutes and then the server has 50 seconds to answer that ping, so in total the connection loss can only be detected after around 6 minutes... I'm pretty sure that's not what you intent...

supports options for specifying the behavior when experiencing startup errors
using the `startup_error_behavior` setting. Available values are:

- `error`: Telegraf with stop and exit in case of startup errors. This is the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `error`: Telegraf with stop and exit in case of startup errors. This is the
- `error`: Telegraf will stop and exit in case of startup errors. This is the

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@da-phil could you please put up a PR against docs/includes/startup_error_behavior.md as this is included here and run make docs in the telegraf root dir!?

Comment on lines 33 to 39
## Interval and ping timeout for keep-alive messages
## The sum of those options defines when a connection loss is detected.
## Note: The keep-alive interval needs to be in second granularity e.g. 1m
## but not 100ms.
# keep_alive = "60s"
# ping_timeout = "10s"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think startup_error_behavior should be also mentioned in this example config.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't think so. We do not mention global options but rather link to them in the README. This is to not run into cases where we do change/add global settings and then have to check and update all 300+ plugins...

@da-phil
Copy link

da-phil commented Jun 14, 2024

@da-phil the initial connection is influenced by the startup_error_behavior setting. For reconnection, the connection loss needs to be detected first. This is done by sending keep-alive messages and check if they are answered. In your configuration, a keep-alive is sent every 5 minutes and then the server has 50 seconds to answer that ping, so in total the connection loss can only be detected after around 6 minutes... I'm pretty sure that's not what you intent...

@srebhan gotcha, so I set startup_error_behavior = "retry" now to make sure the plug-in will keep trying to connect no matter what. But this did not change the behavior I'm expecting.

Let me again explain the issue I'm facing:
In a nutshell I want to make sure that persisting MQTT message traffic in an influxdb database is as robust as possible.
Therefore my IoT nodes start to queue their MQTT messages until the network or MQTT broker is available again.
Now some MQTT clients are very fast to re-establish their connection to the MQTT broker once the connection is availale again, but Telegraf needs at least 6-7s until it re-connects to the MQTT broker and might have missed a couple of messages which were pushed out of the queues of the IoT nodes in the meantime.
In case of the mosquitto_sub client, it never misses any message and re-connects immediately. How can I do that with Telegraf? Does this new feature make it possible?
I'm not super familiar with network sockets, but somehow I have the feeling that mosquitto_sub uses a blocking socket.

@srebhan
Copy link
Member Author

srebhan commented Jun 17, 2024

@da-phil could we please get this PR merged first and not overload it with additional requests? Unfortunately this PR touches too much code already... Please open an issue stating what you need and I can take a look after this is merged. I guess we need to experiment with the options and e.g. use auto-reconnect...

@da-phil
Copy link

da-phil commented Jun 17, 2024

@srebhan sure, I'm not blocking the PR, just wanted to give some feedback to behaviour which might be related to the changes in this PR.

Then you guys can merge the PR and I'll create an issue once I'm back from vacation.

@powersj powersj removed their assignment Jun 17, 2024
@telegraf-tiger
Copy link
Contributor

Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip.
Downloads for additional architectures and packages are available below.

🥳 This pull request decreases the Telegraf binary size by -1.55 % for linux amd64 (new size: 239.8 MB, nightly size 243.5 MB)

📦 Click here to get additional PR build artifacts

Artifact URLs

DEB RPM TAR GZ ZIP
amd64.deb aarch64.rpm darwin_amd64.tar.gz windows_amd64.zip
arm64.deb armel.rpm darwin_arm64.tar.gz windows_arm64.zip
armel.deb armv6hl.rpm freebsd_amd64.tar.gz windows_i386.zip
armhf.deb i386.rpm freebsd_armv7.tar.gz
i386.deb ppc64le.rpm freebsd_i386.tar.gz
mips.deb riscv64.rpm linux_amd64.tar.gz
mipsel.deb s390x.rpm linux_arm64.tar.gz
ppc64el.deb x86_64.rpm linux_armel.tar.gz
riscv64.deb linux_armhf.tar.gz
s390x.deb linux_i386.tar.gz
linux_mips.tar.gz
linux_mipsel.tar.gz
linux_ppc64le.tar.gz
linux_riscv64.tar.gz
linux_s390x.tar.gz

@DStrand1 DStrand1 merged commit 986b385 into influxdata:master Jun 17, 2024
27 checks passed
@github-actions github-actions bot added this to the v1.32.0 milestone Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/mqtt feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

inputs.mqtt/mqtt_consumer: allow connection errors on start
4 participants