New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout SIGALRM not enough to interrupt on some kind of worker #323
Comments
Another similar case and solution: |
Could we fix this by putting the SIGALRM code that is currently handled by the worker into the parent? And having the parent kill and cleanup the child on timeout? I can probably get a PR together but I'd like some input from the core devs on recommended approaches before I give it a shot |
I'm also hitting this issue when the alarm should fire during a SQLAlchemy commit(). The job never detects an alarm signal (it is unclear whether or not the signal is getting fired at this point), and I end up with jobs running way past the timeout. |
Some libraries or tasks may catch SIGALRM so it never reaches the worker. I think having the parent process monitor the forked process is a decent approach. This way, tasks will still be killed on time even if it misbehaves or becomes unresponsive. What do you think @nvie ? |
I've had luck overriding execute_job and replacing the call to waitpid with this function. I can submit a PR if we like this approach
|
One thing we discovered with a little more research is that the current mode (use SIGALRM to raise an exception) ties in nicely with in-workhorse exception handling. An external SIGKILL will not allow any cleanup to run, which could introduce other wrinkles. The best solution for us, I think, would be a staged timeout, where the timeout is monitored both in the child and parent processes. The parent process timeout would be greater than the child timeout, and would be used as a failsafe if for some reason the child did not respond to the SIGALRM. That way, the interface presented (jobs have "live" control over timeout via try/except) remains as is but the hole is closed. |
Agree with @aroberts' observation. We could configure the parent process to send a kill signal to the child process if the child process is still alive 10 seconds after timeout. |
Hopefully this could be helpful to anyone else that stumbles onto this bug/feature. I've subclassed the worker and added a staged hard kill in case the forked process doesn't observe the timeout and/or gets really stuck. In my case, paramiko has an outstanding deadlock that has been reported, but not yet fixed.
The only changes are:
This is needed because the job started_at isn't filled out on the parent process.
By giving it 10% more time, I'm hoping that for most if not all cases, the forked process will timeout first if it can. If not, the parent process will forcibly kill it preventing the rq worker hang. |
I think having the main worker process monitor the horse is a good fallback to have. Could you please open a PR for this so we can merge it into RQ proper?
… On 11 Dec 2019, at 03.31, mr-trouble ***@***.***> wrote:
Hopefully this could be helpful to anyone else that stumbles onto this bug/feature. I've subclassed the worker and added a staged hard kill in case the forked process doesn't observe the timeout and/or gets really stuck. In my case, paramiko has an outstanding deadlock that has been reported, but not yet fixed.
class MyWorker(Worker):
def monitor_work_horse(self, job):
...
try:
job.started_at = job.started_at or datetime.utcnow()
with UnixSignalDeathPenalty(self.job_monitoring_interval, HorseMonitorTimeoutException):
retpid, ret_val = os.waitpid(self._horse_pid, 0)
break`
except HorseMonitorTimeoutException:
# Horse has not exited yet and is still running.
# Send a heartbeat to keep the worker alive.
self.heartbeat(self.job_monitoring_interval + 5)
# kill the job from this side if something is really wrong (interpreter lock/etc)
if (datetime.utcnow() - job.started_at).total_seconds() > (job.timeout * 1.1):
os.kill(self._horse_pid)
break
...
The only changes are:
job.started_at = job.started_at or datetime.utcnow()
This is needed because the job started_at isn't filled out on the parent process.
# kill the job from this side if something is really wrong (interpreter lock/etc)
if (datetime.utcnow() - job.started_at).total_seconds() > (job.timeout * 1.1):
os.kill(self._horse_pid)
break
By giving it 10% more time, I'm hoping that for most if not all cases, the forked process will timeout first if it can. If not, the parent process will forcibly kill it preventing the rq worker hang.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@selwin Can this be closed? Looks like the fix was merged. |
closed :) |
Hi,
My RQ task has some native ext library that do not respect SIGALRM when timeout reached. So non-responding worker hang for a while long beyond the timeout I expect.
The parent process rqworker is still waiting for return from pipeline, but the timeout is reached, that worker is expected to be free not busy. Some weird status happen in RQ dashboard, the worker may be disappearing or waiting for blank queue.
Can we have an enhancement to let the parent rqworker monitor the timeout? If the timeout is reached, try send some signal or just kill the child.
Best Regards,
V.E.O
The text was updated successfully, but these errors were encountered: