New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential fix for #12427 #12616
Potential fix for #12427 #12616
Conversation
s/fix/workaround/, as this is possibly a runtime bug. |
I did not check the code through, is that the only place where children are created? If yes, then perhaps there is no need to handle SIGCHLD, as subprocess module should clean up everything after a process exists. If there are other places where children can be created, then indeed a reaping process is needed. |
If SIG_CHLD is unhandled zombie processes are created when the service reloads. |
@rkojedzinszky @TheMysteriousX any consensus on a fix? I am not knowledgeable enough to judge. |
After finding this docbug ( https://bugs.python.org/issue32985 ) I've cracked the issue that required the detailed signal handling in the first place. When forking off for the subprocess, This results in signals being received for children - I used psutil to filter these out. With the commit I'm about to push, subprocesses should be fully and completely isolated, and all deadlocks eliminated without the need for the custom reap method, and the process should restart cleanly with no zombies left around from previous invocations. Unfortunately, the fix is only good until 3.13 at latest, as per https://bugs.python.org/issue38435 but it seems like the Python dev's are aware of the issue. |
Then perhaps the reload logic should be refactored. It is usually not recommended to mix threads and exec() calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide test scripts for these? I suspect that resetting signals affect the child processes, not their parents, and not preventing their parents receiving SIGCHLD upon exit.
Still dont understand what kind of deadlocks may arise.
|
Hmm, I have a thought, what if we just let systemd restart it? However this does mean if you run the script by hand it will just exit when maintenance runs. |
We could strip out even more code. The initial implementation of the service just got restarted by the process supervisor, but that was decided to be undesirable as it'd abort polls/discovers in flight. It'd just be a case of stripping out the remaining signal code, removing the args from check_call that prevent signal propogation and removing parts of the restart action and any subsequent dead code, if you want to do that instead. |
At least for short term a solution would be nice to be merged. I can say, we are affected, and it seems there are other installations too where pollers stop randomly. It seems that my really short patch solves the issue, but of course, in the long term, exec() should be avoided, and let some process manager do this job. |
Merged #12593 for the release. I think we still need to look at this. |
My thoughts:
|
That'd work - you'll need to add 'KillMode=process' to the unit file otherwise systemd cleans up the entire cgroup regardless of parentage. I think it should also work with other inits too - docker may need some thought/changes, I don't know how the poller has been implemented there (does it even use the service?). You can either implement the restart relying on the Restart=always in the unit and just have the poller terminate itself as it does now (not portable to other inits) or using a timer (still not portable but timers can be replaced with cron scripts). You'll still need 'start_new_session=True' from this changeset, and to remove the same blocks of code (plus a bit more). |
Is this all needed because of some automatic update mechanism? What if the actual update process is left to the user? |
@rkojedzinszky that is already possible. Auto update is optional, but I would say a very large percentage of users have auto updates enabled. |
First of all, thank you for your contribution! We're closing this due to inactivity, feel free to reopen if you find the time to continue on this! |
When testing, check that:
If anyone is running a poller on Windows, macOS or BSD, it'd be interesting to see if this is portable.
Please note
Testers
If you would like to test this pull request then please run:
./scripts/github-apply <pr_id>
, i.e./scripts/github-apply 5926
After you are done testing, you can remove the changes with
./scripts/github-remove
. If there are schema changes, you can ask on discord how to revert.