Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(service): Change systemd's KillMode to "mixed" #13849

Merged
merged 1 commit into from
Sep 5, 2023

Conversation

knollet
Copy link
Contributor

@knollet knollet commented Aug 31, 2023

Currently all processes created by Telegraf are killed as-soon-as a the service is stopped. This triggers some errors in Telegraf as its child processes died unexpectedly. By changing the KillMode to mixed Telegraf is given a chance to shut-down its child-processes cleanly while keeping the property of a clean exit if it fails.

fixes #13842

@telegraf-tiger telegraf-tiger bot added the fix pr to fix corresponding bug label Aug 31, 2023
@srebhan srebhan changed the title fix: change the systemd KillMode for telegraf from control-group to m… fix(service): Change systemd's KillMode to "mixed" Aug 31, 2023
Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for the quick fix @knollet!

@srebhan srebhan added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Aug 31, 2023
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@powersj powersj added this to the v1.28.0 milestone Sep 5, 2023
@powersj powersj merged commit b39ea2e into influxdata:master Sep 5, 2023
25 checks passed
athornton pushed a commit to lsst-sqre/telegraf that referenced this pull request Sep 8, 2023
@Hipska
Copy link
Contributor

Hipska commented Apr 16, 2024

I was wondering for a while why telegraf wasn't shutting down gracefully anymore.

It appears because of the added TimeoutStopSec parameter which gives telegraf only 5 seconds to shut down. This is problematic when having inputs that take more time than that to complete their interval (e.g. snmp with slow or very big machines can take more than a minute to complete). So when such a service is running when reloading or stopping telegraf, systemd will give telegraf only 5 seconds to complete input gather and flush the outputs and then kill it having all unsent metrics with it.

What is the reason behind this 5s value?

@knollet
Copy link
Contributor Author

knollet commented Apr 16, 2024

I am good with a larger value.

@Hipska
Copy link
Contributor

Hipska commented Apr 16, 2024

Is the systemd default not good enough? Why was the directive added?

@knollet
Copy link
Contributor Author

knollet commented Apr 16, 2024

KillMode=mixed was important to me, else systemd kills telegraf's subprocesses.
The systemd documentation references the TimeoutStopSec as the timeout after which it, again, goes about killing telegraf's subprocesses, so I wanted to define a value.
But I think that can be omitted. The default seems to be 90s.

@powersj
Copy link
Contributor

powersj commented Apr 16, 2024

@knollet was there a reason you set the TimeoutStopSec=5 rather than keep the default? Would you be good with us removing that option and keep to the default?

@knollet
Copy link
Contributor Author

knollet commented Apr 16, 2024

I am good with the default, yes. I added it because the documentation referenced it. But having a closer look, it seems not to be necessary to explicitly set it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix pr to fix corresponding bug ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Systemd seems to be causing telegraf to log errors on restart, but should not.
4 participants