GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
If the worker is sharply disconnected to redis, it becomes a ghost.
How to reproduce:
My approach for this would probably be to:
If this makes sense and if there are no objections, I could try producing a patch.
I like your approach. I have a few suggestions, though.
Workers will have to predict how long they will be alive by sending heartbeats. When a job is received, that job's timeout value (by default 180) is a good estimation of when the worker will still be alive. Therefore, let's set the worker's expiry time to the job's timeout plus a fixed period of, say, 60 seconds in the future.
When the worker is idle (i.e. blocking in BLPOP), we can use another expiry time—say your proposed 6 minutes (plus a safety fixed period of 60 seconds).
360 + 60 = 420
job.timeout + 60
In the main loop, right before invoking BLPOP we don't know if we'll become idle or not, so this means that there can be multiple EXPIRE commands per loop iteration. For example:
BLPOP ... TIMEOUT 360
(The EXPIRE 240 here assumes the default job timeout.)
I'm not a fan of your point 6, though, as it requires a thorough understanding of RQ's implementation details, which should not be necessary for job implementors. By using the job's timeout + 60 seconds, I don't think this is necessary at all.
What do you think?
I agree with everything you corrected, excluding the following:
job_default_timeout * 2
Isn't your problem that you'd want to run jobs without timeouts? Why wouldn't you specify a timeout of 9999999?
On the other hand, If my job becomes unresponsive, I want it to timeout quickly and be killed. If my worker dies a tragic death (whole computer failed), I want it to be known that the the worker became defunct quickly. So I like short timeouts.
On the other hand, my jobs typically take a long time.
I think I quite like the simplicity of job.heartbeat(), although it might be purer to model it onto the worker instance. We should therefore expose an API to get a handle on the current Worker, which currently does not exist. For a reference implementation, see how I implemented the get_current_job() function, that uses context locals. Note: this isn't pulled into master yet, currently.
This is already venturing pretty far from the original issue's subject, but I would also like jobs to have the ability to "yield" intermediate results (think about deferred APIs with progress support, like jQuery has). Maybe heartbeats should really be empty yielded values? I'm not trying to derail the discussion, but I think it would be a shame to add worker.heartbeat() only to add job.progress() (or similar) shortly after.
@yaniv-aknin: Could you elaborate on the yielding of values in a separate GitHub issue? Do you have a proposal for an actual implementation of this? This won't be anything that will go into 0.4 probably, but I'd love to have a bigger picture / roadmap for RQ.
See pull request 173.
(I wish there was a better way to attach pull requests to issues)
Yaniv, thanks a lot! I want you to know I saw this and I will look at with deeper scrutiny some other time this week. It might just take me a few days. Sorry for that, but I will come back to this!
Any progress on the pull request perhaps?
I haven't forgotten, just have not had the time to look at it. Sorry, it's been a few crazy busy weeks for me :(
Sure thing, I'm getting much more than what I pay for... I'd just hate to fork and use my own rq in my project if this will end up merged in a week or two; I'll bug you again some other time. :)
Sorry for the delay! It's a great patch. I've added a few more commits on top of it and just merged it into master.
I'm still looking for the best way to deal with the current mess that RQ created everywhere. Everybody potentially has ghost workers floating around. One solution I could think of is, upon worker startup, loop over all worker keys and expire the ones that have no expiry time yet. This won't immediately remove them, but would eventually have all the ghost workers cleaned up, without RQ users having to reside to custom scripting to clean up the mess. I cannot think of a case where this would actually do any harm.
I did add this, so running rqworker will automatically clean up the legacy mess: 223e09f
Your solution looks good to me. A few versions from now I think we can make the cleanup optional. Thanks for merging!
Yes, I think we'll do that in 0.5 or 0.6 or so. Thanks for the patch.
I think I'll release a small version with the latest small patches as 0.3.6 on Monday.
Great, I'll start using it right away.
This issue is fixed.