Catch orphaned "running" jobs from MIA workers #31

dd32 · 2016-12-07T01:14:48Z

Similar to #18, we ran into an issue on WordPress.org where one of cavalcade daemons was killed unexpectantly which left a bunch of jobs in an unknown state.

The jobs were marked as running, but there were no workers to manage those jobs anymore. This resulted in jobs not running for a few hours/days until they were detected and restarted.

There should be some way for a job to be detected as no-longer-running or that it's daemon is MIA.

In this case, I simply restarted the jobs: UPDATE wp_cavalcade_jobs SET status = 'waiting' WHERE status = 'running' AND nextrun <= '2016-12-06'

The text was updated successfully, but these errors were encountered:

larssn · 2016-12-07T08:53:04Z

A daemon being killed off while running jobs is especially relevant in horizontally scaled setups. If a node is taken offline, then the jobs can get stuck in this state, and have to be restarted manually.

rmccue · 2016-12-07T09:42:14Z

@larssn Indeed, we've been working on solving this ourselves. It's tough to come up with a solution to it. Mostly, we've been focusing on ensuring the daemon safely shuts down the workers.

If anyone has better ideas, all ears. :)

willmot · 2016-12-07T11:44:39Z

Could you make use of the flock side-effect of clearing file locks when the PHP process dies? We've used this on BackUpWordPress (xibodevelopment/backupwordpress#1025) to detect when a long running process has crashed / is killed so we can update it's status accordingly rather than having it forever show as (incorrectly) running.

rmccue · 2016-12-07T12:07:39Z

Unfortunately not, due to the horizontal distribution. IIRC NFS doesn't support file locks either.

dd32 · 2016-12-08T01:13:00Z

The "correct" approach here for horizontally distributed apps would be for each daemon to have a DB row listing it's status, and where it's running from, with a date stamp bumped every minute or so.
Jobs would then have to be listed as status = running, server = ID#4.

Other daemons would need to periodically check the table to see if any other daemons had started, accepted jobs, and gone away without marking themselves as shutdown, and initiate a cleanup.

It'd probably need to be implemented as a wp-cli job which each daemon fires off every ~5mins (ie. it can't be a listed job, it'd have to be a custom thing, as you want it on each server) which performs the health checks.

rmccue · 2024-03-27T18:12:16Z

See humanmade/Cavalcade-Runner#75 for a solution for system shutdown, where we will pass SIGTERM to the workers and SIGKILL if they fail to respond.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch orphaned "running" jobs from MIA workers #31

Catch orphaned "running" jobs from MIA workers #31

dd32 commented Dec 7, 2016

larssn commented Dec 7, 2016

rmccue commented Dec 7, 2016

willmot commented Dec 7, 2016

rmccue commented Dec 7, 2016

dd32 commented Dec 8, 2016

rmccue commented Mar 27, 2024

Catch orphaned "running" jobs from MIA workers #31

Catch orphaned "running" jobs from MIA workers #31

Comments

dd32 commented Dec 7, 2016

larssn commented Dec 7, 2016

rmccue commented Dec 7, 2016

willmot commented Dec 7, 2016

rmccue commented Dec 7, 2016

dd32 commented Dec 8, 2016

rmccue commented Mar 27, 2024