worker becomes a ghost after disconnected #138

Closed
mekza opened this Issue Sep 10, 2012 · 20 comments

Comments

Projects
None yet
4 participants

mekza commented Sep 10, 2012

If the worker is sharply disconnected to redis, it becomes a ghost.

How to reproduce:

  1. ssh your box
  2. start your own worker ./worker.py (child of your shell)
  3. kill ssh connection (cmd+Q)
  4. worker is a ghost in rq-dashboard
Contributor

yaniv-aknin commented Nov 20, 2012

My approach for this would probably be to:

  1. Set a reasonable and configurable failsafe ttl on worker keys (say, one hour by default).
  2. Workers will refresh their ttl every time they complete take a job.
  3. Workers will refresh their ttl every ttl/10 (so every six minutes by default). This means workers will no longer block forever using blpop, but rather use blpop with a timeout of ttl/10.
  4. Any code iterating rq:workers (i.e., rqinfo etc) will silently discard missing workers, much like missing jobs in queues are discarded.
  5. When queuing a task, it should be possible to denote a special ttl for the worker executing it.
  6. [optional] An API will be added thus that task code will be allowed to extend its worker's ttl. So cooperating tasks can use a small ttl but periodically update that their worker is still alive.

If this makes sense and if there are no objections, I could try producing a patch.

Owner

nvie commented Nov 20, 2012

I like your approach. I have a few suggestions, though.

Workers will have to predict how long they will be alive by sending heartbeats. When a job is received, that job's timeout value (by default 180) is a good estimation of when the worker will still be alive. Therefore, let's set the worker's expiry time to the job's timeout plus a fixed period of, say, 60 seconds in the future.

When the worker is idle (i.e. blocking in BLPOP), we can use another expiry time—say your proposed 6 minutes (plus a safety fixed period of 60 seconds).

So:

  • When idle: 360 + 60 = 420 seconds
  • When a job is received: job.timeout + 60 seconds

In the main loop, right before invoking BLPOP we don't know if we'll become idle or not, so this means that there can be multiple EXPIRE commands per loop iteration. For example:

  • When becoming idle, EXPIRE 420 -> BLPOP ... TIMEOUT 360 -> (no jobs, idling) -> next loop
  • When not idle, EXPIRE 420 -> BLPOP ... TIMEOUT 360 -> (job received) -> EXPIRE 240 (overwrites/tightens the previously set expiry time)

(The EXPIRE 240 here assumes the default job timeout.)

I'm not a fan of your point 6, though, as it requires a thorough understanding of RQ's implementation details, which should not be necessary for job implementors. By using the job's timeout + 60 seconds, I don't think this is necessary at all.

What do you think?

Contributor

yaniv-aknin commented Nov 20, 2012

I agree with everything you corrected, excluding the following:

  1. Assuming the idle timeout is configurable explicitly, let's make it job_default_timeout * 2. This makes the configuration reasonable by default yet flexible.
  2. The problem I have with job.timeout + 60, which is usually a very good mechanism, is the case of very long jobs which might get stuck or fail miserably (which is my unfortunate usecase). How about the worker timeout when a job is received will be job.timeout + 60, but also we'll add an API to extend a job's timeout as it is running (and, implicitly, the worker timeout). This mechanism doesn't have to expose that much of RQ's implementation at all - job.heartbeat() should be enough.

Your thoughts?

Owner

nvie commented Nov 20, 2012

Isn't your problem that you'd want to run jobs without timeouts? Why wouldn't you specify a timeout of 9999999?

Contributor

yaniv-aknin commented Nov 20, 2012

On the other hand, If my job becomes unresponsive, I want it to timeout quickly and be killed. If my worker dies a tragic death (whole computer failed), I want it to be known that the the worker became defunct quickly. So I like short timeouts.
On the other hand, my jobs typically take a long time.

Owner

nvie commented Nov 25, 2012

I think I quite like the simplicity of job.heartbeat(), although it might be purer to model it onto the worker instance. We should therefore expose an API to get a handle on the current Worker, which currently does not exist. For a reference implementation, see how I implemented the get_current_job() function, that uses context locals. Note: this isn't pulled into master yet, currently.

TODO's:

  • Expose a get_current_worker API function
  • Add the .heartbeat() method to the Worker class
  • Write tests
  • Write documentation
Contributor

yaniv-aknin commented Nov 25, 2012

This is already venturing pretty far from the original issue's subject, but I would also like jobs to have the ability to "yield" intermediate results (think about deferred APIs with progress support, like jQuery has). Maybe heartbeats should really be empty yielded values? I'm not trying to derail the discussion, but I think it would be a shame to add worker.heartbeat() only to add job.progress() (or similar) shortly after.

Your thoughts?

Owner

nvie commented Jan 23, 2013

@yaniv-aknin: Could you elaborate on the yielding of values in a separate GitHub issue? Do you have a proposal for an actual implementation of this? This won't be anything that will go into 0.4 probably, but I'd love to have a bigger picture / roadmap for RQ.

Contributor

yaniv-aknin commented Jan 26, 2013

See pull request 173.

(I wish there was a better way to attach pull requests to issues)

Owner

nvie commented Jan 28, 2013

Yaniv, thanks a lot! I want you to know I saw this and I will look at with deeper scrutiny some other time this week. It might just take me a few days. Sorry for that, but I will come back to this!

Contributor

yaniv-aknin commented Feb 12, 2013

Any progress on the pull request perhaps?

Owner

nvie commented Feb 13, 2013

I haven't forgotten, just have not had the time to look at it. Sorry, it's been a few crazy busy weeks for me :(

Contributor

yaniv-aknin commented Feb 13, 2013

Sure thing, I'm getting much more than what I pay for... I'd just hate to fork and use my own rq in my project if this will end up merged in a week or two; I'll bug you again some other time. :)

Owner

nvie commented Feb 15, 2013

Sorry for the delay! It's a great patch. I've added a few more commits on top of it and just merged it into master.

I'm still looking for the best way to deal with the current mess that RQ created everywhere. Everybody potentially has ghost workers floating around. One solution I could think of is, upon worker startup, loop over all worker keys and expire the ones that have no expiry time yet. This won't immediately remove them, but would eventually have all the ghost workers cleaned up, without RQ users having to reside to custom scripting to clean up the mess. I cannot think of a case where this would actually do any harm.

Owner

nvie commented Feb 15, 2013

I did add this, so running rqworker will automatically clean up the legacy mess: 223e09f

Contributor

yaniv-aknin commented Feb 16, 2013

Your solution looks good to me. A few versions from now I think we can make the cleanup optional. Thanks for merging!

Owner

nvie commented Feb 16, 2013

Yes, I think we'll do that in 0.5 or 0.6 or so. Thanks for the patch.

Owner

nvie commented Feb 16, 2013

I think I'll release a small version with the latest small patches as 0.3.6 on Monday.

Contributor

yaniv-aknin commented Feb 16, 2013

Great, I'll start using it right away.

Collaborator

selwin commented Apr 19, 2017

This issue is fixed.

selwin closed this Apr 19, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment