Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential fix for #12427 #12616

Closed
wants to merge 3 commits into from

Conversation

TheMysteriousX
Copy link
Contributor

When testing, check that:

  • the process restarts overnight with no polling gaps
  • the main process does not run at 100% load
  • there are no 'defunct' processes

If anyone is running a poller on Windows, macOS or BSD, it'd be interesting to see if this is portable.

Please note

Please read this information carefully. You can run ./lnms dev:check to check your code before submitting.

  • Have you followed our code guidelines?
  • If my Pull Request does some changes/fixes/enhancements in the WebUI, I have inserted a screenshot of it.

Testers

If you would like to test this pull request then please run: ./scripts/github-apply <pr_id>, i.e ./scripts/github-apply 5926
After you are done testing, you can remove the changes with ./scripts/github-remove. If there are schema changes, you can ask on discord how to revert.

@TheMysteriousX
Copy link
Contributor Author

s/fix/workaround/, as this is possibly a runtime bug.

@rkojedzinszky
Copy link
Contributor

I did not check the code through, is that the only place where children are created? If yes, then perhaps there is no need to handle SIGCHLD, as subprocess module should clean up everything after a process exists. If there are other places where children can be created, then indeed a reaping process is needed.

@TheMysteriousX
Copy link
Contributor Author

If SIG_CHLD is unhandled zombie processes are created when the service reloads.

@murrant
Copy link
Member

murrant commented Mar 15, 2021

@rkojedzinszky @TheMysteriousX any consensus on a fix? I am not knowledgeable enough to judge.

@TheMysteriousX
Copy link
Contributor Author

TheMysteriousX commented Mar 15, 2021

After finding this docbug ( https://bugs.python.org/issue32985 ) I've cracked the issue that required the detailed signal handling in the first place.

When forking off for the subprocess, restore_signals (a default parameter) restores only the signals that the intepreter itself modified, i.e. sigint, sigterm and not sigchld.

This results in signals being received for children - I used psutil to filter these out. With the commit I'm about to push, subprocesses should be fully and completely isolated, and all deadlocks eliminated without the need for the custom reap method, and the process should restart cleanly with no zombies left around from previous invocations.

Unfortunately, the fix is only good until 3.13 at latest, as per https://bugs.python.org/issue38435 but it seems like the Python dev's are aware of the issue.

@rkojedzinszky
Copy link
Contributor

If SIG_CHLD is unhandled zombie processes are created when the service reloads.

Then perhaps the reload logic should be refactored. It is usually not recommended to mix threads and exec() calls.

Copy link
Contributor

@rkojedzinszky rkojedzinszky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide test scripts for these? I suspect that resetting signals affect the child processes, not their parents, and not preventing their parents receiving SIGCHLD upon exit.

@rkojedzinszky
Copy link
Contributor

After finding this docbug ( https://bugs.python.org/issue32985 ) I've cracked the issue that required the detailed signal handling in the first place.

When forking off for the subprocess, restore_signals (a default parameter) restores only the signals that the intepreter itself modified, i.e. sigint, sigterm and not sigchld.

This results in signals being received for children - I used psutil to filter these out. With the commit I'm about to push, subprocesses should be fully and completely isolated, and all deadlocks eliminated without the need for the custom reap method, and the process should restart cleanly with no zombies left around from previous invocations.

Still dont understand what kind of deadlocks may arise.

Unfortunately, the fix is only good until 3.13 at latest, as per https://bugs.python.org/issue38435 but it seems like the Python dev's are aware of the issue.

@murrant
Copy link
Member

murrant commented Mar 15, 2021

Hmm, I have a thought, what if we just let systemd restart it? However this does mean if you run the script by hand it will just exit when maintenance runs.

@TheMysteriousX
Copy link
Contributor Author

TheMysteriousX commented Mar 15, 2021

We could strip out even more code.

The initial implementation of the service just got restarted by the process supervisor, but that was decided to be undesirable as it'd abort polls/discovers in flight.

It'd just be a case of stripping out the remaining signal code, removing the args from check_call that prevent signal propogation and removing parts of the restart action and any subsequent dead code, if you want to do that instead.

@rkojedzinszky
Copy link
Contributor

At least for short term a solution would be nice to be merged. I can say, we are affected, and it seems there are other installations too where pollers stop randomly. It seems that my really short patch solves the issue, but of course, in the long term, exec() should be avoided, and let some process manager do this job.

@murrant
Copy link
Member

murrant commented Mar 18, 2021

Merged #12593 for the release. I think we still need to look at this.

@murrant
Copy link
Member

murrant commented Mar 19, 2021

My thoughts:

  1. Move poller finishing in to php code (lock release, etc)
  2. Exit parent process without waiting
  3. Children are orphaned onto pid 1
  4. supervisor restarts python process, and starts queuing jobs. (or other master is still queuing jobs)
  5. locks are not expired for polling processes that have not exited, so new jobs won't be queued
  6. orphans finish polling (updating last_polled), release locks, and exit (pid 1 reaps orphans)
  7. jobs are queued normally at next interval

@TheMysteriousX
Copy link
Contributor Author

TheMysteriousX commented Mar 19, 2021

That'd work - you'll need to add 'KillMode=process' to the unit file otherwise systemd cleans up the entire cgroup regardless of parentage. I think it should also work with other inits too - docker may need some thought/changes, I don't know how the poller has been implemented there (does it even use the service?).

You can either implement the restart relying on the Restart=always in the unit and just have the poller terminate itself as it does now (not portable to other inits) or using a timer (still not portable but timers can be replaced with cron scripts).

You'll still need 'start_new_session=True' from this changeset, and to remove the same blocks of code (plus a bit more).

@rkojedzinszky
Copy link
Contributor

My thoughts:

  1. Move poller finishing in to php code (lock release, etc)
  2. Exit parent process without waiting
  3. Children are orphaned onto pid 1
  4. supervisor restarts python process, and starts queuing jobs. (or other master is still queuing jobs)
  5. locks are not expired for polling processes that have not exited, so new jobs won't be queued
  6. orphans finish polling (updating last_polled), release locks, and exit (pid 1 reaps orphans)
  7. jobs are queued normally at next interval

Is this all needed because of some automatic update mechanism? What if the actual update process is left to the user?

@murrant
Copy link
Member

murrant commented Apr 6, 2021

@rkojedzinszky that is already possible. Auto update is optional, but I would say a very large percentage of users have auto updates enabled.

@Jellyfrog
Copy link
Member

First of all, thank you for your contribution!

We're closing this due to inactivity, feel free to reopen if you find the time to continue on this!

@Jellyfrog Jellyfrog closed this Nov 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants