Potential fix for #12427 #12616

TheMysteriousX · 2021-03-14T00:27:49Z

When testing, check that:

the process restarts overnight with no polling gaps
the main process does not run at 100% load
there are no 'defunct' processes

If anyone is running a poller on Windows, macOS or BSD, it'd be interesting to see if this is portable.

Please note

Please read this information carefully. You can run ./lnms dev:check to check your code before submitting.

Have you followed our code guidelines?
If my Pull Request does some changes/fixes/enhancements in the WebUI, I have inserted a screenshot of it.

Testers

If you would like to test this pull request then please run: ./scripts/github-apply <pr_id>, i.e ./scripts/github-apply 5926
After you are done testing, you can remove the changes with ./scripts/github-remove. If there are schema changes, you can ask on discord how to revert.

TheMysteriousX · 2021-03-14T00:28:59Z

s/fix/workaround/, as this is possibly a runtime bug.

rkojedzinszky · 2021-03-14T18:46:41Z

I did not check the code through, is that the only place where children are created? If yes, then perhaps there is no need to handle SIGCHLD, as subprocess module should clean up everything after a process exists. If there are other places where children can be created, then indeed a reaping process is needed.

TheMysteriousX · 2021-03-14T19:53:59Z

If SIG_CHLD is unhandled zombie processes are created when the service reloads.

murrant · 2021-03-15T01:39:06Z

@rkojedzinszky @TheMysteriousX any consensus on a fix? I am not knowledgeable enough to judge.

TheMysteriousX · 2021-03-15T04:30:55Z

After finding this docbug ( https://bugs.python.org/issue32985 ) I've cracked the issue that required the detailed signal handling in the first place.

When forking off for the subprocess, restore_signals (a default parameter) restores only the signals that the intepreter itself modified, i.e. sigint, sigterm and not sigchld.

This results in signals being received for children - I used psutil to filter these out. With the commit I'm about to push, subprocesses should be fully and completely isolated, and all deadlocks eliminated without the need for the custom reap method, and the process should restart cleanly with no zombies left around from previous invocations.

Unfortunately, the fix is only good until 3.13 at latest, as per https://bugs.python.org/issue38435 but it seems like the Python dev's are aware of the issue.

rkojedzinszky · 2021-03-15T08:28:17Z

If SIG_CHLD is unhandled zombie processes are created when the service reloads.

Then perhaps the reload logic should be refactored. It is usually not recommended to mix threads and exec() calls.

rkojedzinszky

Can you provide test scripts for these? I suspect that resetting signals affect the child processes, not their parents, and not preventing their parents receiving SIGCHLD upon exit.

rkojedzinszky · 2021-03-15T08:32:42Z

After finding this docbug ( https://bugs.python.org/issue32985 ) I've cracked the issue that required the detailed signal handling in the first place.

When forking off for the subprocess, restore_signals (a default parameter) restores only the signals that the intepreter itself modified, i.e. sigint, sigterm and not sigchld.

This results in signals being received for children - I used psutil to filter these out. With the commit I'm about to push, subprocesses should be fully and completely isolated, and all deadlocks eliminated without the need for the custom reap method, and the process should restart cleanly with no zombies left around from previous invocations.

Still dont understand what kind of deadlocks may arise.

Unfortunately, the fix is only good until 3.13 at latest, as per https://bugs.python.org/issue38435 but it seems like the Python dev's are aware of the issue.

murrant · 2021-03-15T20:25:54Z

Hmm, I have a thought, what if we just let systemd restart it? However this does mean if you run the script by hand it will just exit when maintenance runs.

TheMysteriousX · 2021-03-15T23:05:45Z

We could strip out even more code.

The initial implementation of the service just got restarted by the process supervisor, but that was decided to be undesirable as it'd abort polls/discovers in flight.

It'd just be a case of stripping out the remaining signal code, removing the args from check_call that prevent signal propogation and removing parts of the restart action and any subsequent dead code, if you want to do that instead.

rkojedzinszky · 2021-03-16T07:14:52Z

At least for short term a solution would be nice to be merged. I can say, we are affected, and it seems there are other installations too where pollers stop randomly. It seems that my really short patch solves the issue, but of course, in the long term, exec() should be avoided, and let some process manager do this job.

murrant · 2021-03-18T21:42:07Z

Merged #12593 for the release. I think we still need to look at this.

murrant · 2021-03-19T03:13:31Z

My thoughts:

Move poller finishing in to php code (lock release, etc)
Exit parent process without waiting
Children are orphaned onto pid 1
supervisor restarts python process, and starts queuing jobs. (or other master is still queuing jobs)
locks are not expired for polling processes that have not exited, so new jobs won't be queued
orphans finish polling (updating last_polled), release locks, and exit (pid 1 reaps orphans)
jobs are queued normally at next interval

TheMysteriousX · 2021-03-19T04:10:40Z

That'd work - you'll need to add 'KillMode=process' to the unit file otherwise systemd cleans up the entire cgroup regardless of parentage. I think it should also work with other inits too - docker may need some thought/changes, I don't know how the poller has been implemented there (does it even use the service?).

You can either implement the restart relying on the Restart=always in the unit and just have the poller terminate itself as it does now (not portable to other inits) or using a timer (still not portable but timers can be replaced with cron scripts).

You'll still need 'start_new_session=True' from this changeset, and to remove the same blocks of code (plus a bit more).

rkojedzinszky · 2021-03-19T06:15:14Z

My thoughts:

Move poller finishing in to php code (lock release, etc)

Exit parent process without waiting

Children are orphaned onto pid 1

supervisor restarts python process, and starts queuing jobs. (or other master is still queuing jobs)

locks are not expired for polling processes that have not exited, so new jobs won't be queued

orphans finish polling (updating last_polled), release locks, and exit (pid 1 reaps orphans)

jobs are queued normally at next interval

Is this all needed because of some automatic update mechanism? What if the actual update process is left to the user?

murrant · 2021-04-06T01:12:32Z

@rkojedzinszky that is already possible. Auto update is optional, but I would say a very large percentage of users have auto updates enabled.

Jellyfrog · 2021-11-02T22:56:12Z

First of all, thank you for your contribution!

We're closing this due to inactivity, feel free to reopen if you find the time to continue on this!

Potential fix for librenms#12427

7979ea6

Remove more code

51b5890

Ensure we don't receive child process signals

56f1fe9

rkojedzinszky reviewed Mar 15, 2021

View reviewed changes

Jellyfrog closed this Nov 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential fix for #12427 #12616

Potential fix for #12427 #12616

TheMysteriousX commented Mar 14, 2021

TheMysteriousX commented Mar 14, 2021

rkojedzinszky commented Mar 14, 2021

TheMysteriousX commented Mar 14, 2021

murrant commented Mar 15, 2021

TheMysteriousX commented Mar 15, 2021 •

edited

rkojedzinszky commented Mar 15, 2021

rkojedzinszky left a comment

rkojedzinszky commented Mar 15, 2021

murrant commented Mar 15, 2021

TheMysteriousX commented Mar 15, 2021 •

edited

rkojedzinszky commented Mar 16, 2021

murrant commented Mar 18, 2021

murrant commented Mar 19, 2021 •

edited

TheMysteriousX commented Mar 19, 2021 •

edited

rkojedzinszky commented Mar 19, 2021

murrant commented Apr 6, 2021

Jellyfrog commented Nov 2, 2021

Potential fix for #12427 #12616

Potential fix for #12427 #12616

Conversation

TheMysteriousX commented Mar 14, 2021

Please note

Testers

TheMysteriousX commented Mar 14, 2021

rkojedzinszky commented Mar 14, 2021

TheMysteriousX commented Mar 14, 2021

murrant commented Mar 15, 2021

TheMysteriousX commented Mar 15, 2021 • edited

rkojedzinszky commented Mar 15, 2021

rkojedzinszky left a comment

Choose a reason for hiding this comment

rkojedzinszky commented Mar 15, 2021

murrant commented Mar 15, 2021

TheMysteriousX commented Mar 15, 2021 • edited

rkojedzinszky commented Mar 16, 2021

murrant commented Mar 18, 2021

murrant commented Mar 19, 2021 • edited

TheMysteriousX commented Mar 19, 2021 • edited

rkojedzinszky commented Mar 19, 2021

murrant commented Apr 6, 2021

Jellyfrog commented Nov 2, 2021

TheMysteriousX commented Mar 15, 2021 •

edited

TheMysteriousX commented Mar 15, 2021 •

edited

murrant commented Mar 19, 2021 •

edited

TheMysteriousX commented Mar 19, 2021 •

edited