Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch orphaned "running" jobs from MIA workers #31

Open
dd32 opened this issue Dec 7, 2016 · 6 comments
Open

Catch orphaned "running" jobs from MIA workers #31

dd32 opened this issue Dec 7, 2016 · 6 comments

Comments

@dd32
Copy link
Contributor

dd32 commented Dec 7, 2016

Similar to #18, we ran into an issue on WordPress.org where one of cavalcade daemons was killed unexpectantly which left a bunch of jobs in an unknown state.

The jobs were marked as running, but there were no workers to manage those jobs anymore. This resulted in jobs not running for a few hours/days until they were detected and restarted.

There should be some way for a job to be detected as no-longer-running or that it's daemon is MIA.

In this case, I simply restarted the jobs: UPDATE wp_cavalcade_jobs SET status = 'waiting' WHERE status = 'running' AND nextrun <= '2016-12-06'

@larssn
Copy link

larssn commented Dec 7, 2016

A daemon being killed off while running jobs is especially relevant in horizontally scaled setups. If a node is taken offline, then the jobs can get stuck in this state, and have to be restarted manually.

@rmccue
Copy link
Member

rmccue commented Dec 7, 2016

@larssn Indeed, we've been working on solving this ourselves. It's tough to come up with a solution to it. Mostly, we've been focusing on ensuring the daemon safely shuts down the workers.

If anyone has better ideas, all ears. :)

@willmot
Copy link
Member

willmot commented Dec 7, 2016

Could you make use of the flock side-effect of clearing file locks when the PHP process dies? We've used this on BackUpWordPress (xibodevelopment/backupwordpress#1025) to detect when a long running process has crashed / is killed so we can update it's status accordingly rather than having it forever show as (incorrectly) running.

@rmccue
Copy link
Member

rmccue commented Dec 7, 2016

Unfortunately not, due to the horizontal distribution. IIRC NFS doesn't support file locks either.

@dd32
Copy link
Contributor Author

dd32 commented Dec 8, 2016

The "correct" approach here for horizontally distributed apps would be for each daemon to have a DB row listing it's status, and where it's running from, with a date stamp bumped every minute or so.
Jobs would then have to be listed as status = running, server = ID#4.

Other daemons would need to periodically check the table to see if any other daemons had started, accepted jobs, and gone away without marking themselves as shutdown, and initiate a cleanup.

It'd probably need to be implemented as a wp-cli job which each daemon fires off every ~5mins (ie. it can't be a listed job, it'd have to be a custom thing, as you want it on each server) which performs the health checks.

@rmccue
Copy link
Member

rmccue commented Mar 27, 2024

See humanmade/Cavalcade-Runner#75 for a solution for system shutdown, where we will pass SIGTERM to the workers and SIGKILL if they fail to respond.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants