Timeout SIGALRM not enough to interrupt on some kind of worker #323

V-E-O · 2014-03-14T11:48:17Z

Hi,

My RQ task has some native ext library that do not respect SIGALRM when timeout reached. So non-responding worker hang for a while long beyond the timeout I expect.

The parent process rqworker is still waiting for return from pipeline, but the timeout is reached, that worker is expected to be free not busy. Some weird status happen in RQ dashboard, the worker may be disappearing or waiting for blank queue.

Can we have an enhancement to let the parent rqworker monitor the timeout? If the timeout is reached, try send some signal or just kill the child.

Best Regards,
V.E.O

V-E-O · 2014-03-14T11:53:14Z

Another similar case and solution:
http://gavintech.blogspot.com/2008/02/python-sigalrm-oddness.html

hhuuggoo · 2014-11-05T22:41:57Z

Could we fix this by putting the SIGALRM code that is currently handled by the worker into the parent? And having the parent kill and cleanup the child on timeout? I can probably get a PR together but I'd like some input from the core devs on recommended approaches before I give it a shot

aroberts · 2015-03-30T15:46:40Z

I'm also hitting this issue when the alarm should fire during a SQLAlchemy commit(). The job never detects an alarm signal (it is unclear whether or not the signal is getting fired at this point), and I end up with jobs running way past the timeout.

selwin · 2015-03-31T00:59:47Z

Some libraries or tasks may catch SIGALRM so it never reaches the worker. I think having the parent process monitor the forked process is a decent approach. This way, tasks will still be killed on time even if it misbehaves or becomes unresponsive. What do you think @nvie ?

hhuuggoo · 2015-03-31T02:31:20Z

I've had luck overriding execute_job and replacing the call to waitpid with this function. I can submit a PR if we like this approach

import os
import threading
import time
import signal

def waitpid_timeout(pid, timeout=5.0):
    result = []
    def helper():
        result.append(os.waitpid(pid, 0))
    st = time.time()
    t = threading.Thread(target=helper)
    t.start()
    t.join(timeout=timeout)
    if t.isAlive():
        os.kill(pid, signal.SIGKILL)
        t.join()
    return result[0]

if __name__ == "__main__":
    import subprocess
    p = subprocess.Popen("python -c 'import time; time.sleep(1)'", shell=True)
    st = time.time()
    print waitpid_timeout(p.pid)
    ed = time.time()
    print ed - st
    p = subprocess.Popen("python -c 'import time; time.sleep(10)'", shell=True)
    st = time.time()
    print waitpid_timeout(p.pid)
    ed = time.time()
    print ed - st

aroberts · 2015-03-31T02:35:47Z

One thing we discovered with a little more research is that the current mode (use SIGALRM to raise an exception) ties in nicely with in-workhorse exception handling. An external SIGKILL will not allow any cleanup to run, which could introduce other wrinkles.

The best solution for us, I think, would be a staged timeout, where the timeout is monitored both in the child and parent processes. The parent process timeout would be greater than the child timeout, and would be used as a failsafe if for some reason the child did not respond to the SIGALRM. That way, the interface presented (jobs have "live" control over timeout via try/except) remains as is but the hole is closed.

selwin · 2015-03-31T02:44:06Z

Agree with @aroberts' observation. We could configure the parent process to send a kill signal to the child process if the child process is still alive 10 seconds after timeout.

bartonology · 2019-12-10T20:31:15Z

Hopefully this could be helpful to anyone else that stumbles onto this bug/feature. I've subclassed the worker and added a staged hard kill in case the forked process doesn't observe the timeout and/or gets really stuck. In my case, paramiko has an outstanding deadlock that has been reported, but not yet fixed.

class MyWorker(Worker):
    def monitor_work_horse(self, job):
        ...
            try:
                job.started_at = job.started_at or datetime.utcnow()
                with UnixSignalDeathPenalty(self.job_monitoring_interval, HorseMonitorTimeoutException):
                    retpid, ret_val = os.waitpid(self._horse_pid, 0)
                break`

            except HorseMonitorTimeoutException:
                # Horse has not exited yet and is still running.
                # Send a heartbeat to keep the worker alive.
                self.heartbeat(self.job_monitoring_interval + 5)

                # kill the job from this side if something is really wrong (interpreter lock/etc)
                if (datetime.utcnow() - job.started_at).total_seconds() > (job.timeout * 1.1):
                    os.kill(self._horse_pid)
                    break
           ...

The only changes are:

job.started_at = job.started_at or datetime.utcnow()

This is needed because the job started_at isn't filled out on the parent process.

# kill the job from this side if something is really wrong (interpreter lock/etc)
if (datetime.utcnow() - job.started_at).total_seconds() > (job.timeout * 1.1):
    os.kill(self._horse_pid)
    break

By giving it 10% more time, I'm hoping that for most if not all cases, the forked process will timeout first if it can. If not, the parent process will forcibly kill it preventing the rq worker hang.

selwin · 2019-12-11T00:00:56Z

I think having the main worker process monitor the horse is a good fallback to have. Could you please open a PR for this so we can merge it into RQ proper?

…

On 11 Dec 2019, at 03.31, mr-trouble ***@***.***> wrote: Hopefully this could be helpful to anyone else that stumbles onto this bug/feature. I've subclassed the worker and added a staged hard kill in case the forked process doesn't observe the timeout and/or gets really stuck. In my case, paramiko has an outstanding deadlock that has been reported, but not yet fixed. class MyWorker(Worker): def monitor_work_horse(self, job): ... try: job.started_at = job.started_at or datetime.utcnow() with UnixSignalDeathPenalty(self.job_monitoring_interval, HorseMonitorTimeoutException): retpid, ret_val = os.waitpid(self._horse_pid, 0) break` except HorseMonitorTimeoutException: # Horse has not exited yet and is still running. # Send a heartbeat to keep the worker alive. self.heartbeat(self.job_monitoring_interval + 5) # kill the job from this side if something is really wrong (interpreter lock/etc) if (datetime.utcnow() - job.started_at).total_seconds() > (job.timeout * 1.1): os.kill(self._horse_pid) break ... The only changes are: job.started_at = job.started_at or datetime.utcnow() This is needed because the job started_at isn't filled out on the parent process. # kill the job from this side if something is really wrong (interpreter lock/etc) if (datetime.utcnow() - job.started_at).total_seconds() > (job.timeout * 1.1): os.kill(self._horse_pid) break By giving it 10% more time, I'm hoping that for most if not all cases, the forked process will timeout first if it can. If not, the parent process will forcibly kill it preventing the rq worker hang. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

levrik · 2020-02-03T10:15:42Z

@selwin Can this be closed? Looks like the fix was merged.

selwin · 2020-02-03T10:21:50Z

closed :)

arikfr mentioned this issue Oct 28, 2019

RQ: implement reliable timeout getredash/redash#4305

Closed

bartonology mentioned this issue Dec 11, 2019

Add a hard kill from the parent process with a 10% increased timeout … #1169

Merged

selwin closed this as completed Feb 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout SIGALRM not enough to interrupt on some kind of worker #323

Timeout SIGALRM not enough to interrupt on some kind of worker #323

V-E-O commented Mar 14, 2014

V-E-O commented Mar 14, 2014

hhuuggoo commented Nov 5, 2014

aroberts commented Mar 30, 2015

selwin commented Mar 31, 2015

hhuuggoo commented Mar 31, 2015

aroberts commented Mar 31, 2015

selwin commented Mar 31, 2015

bartonology commented Dec 10, 2019

selwin commented Dec 11, 2019 via email

levrik commented Feb 3, 2020

selwin commented Feb 3, 2020

Timeout SIGALRM not enough to interrupt on some kind of worker #323

Timeout SIGALRM not enough to interrupt on some kind of worker #323

Comments

V-E-O commented Mar 14, 2014

V-E-O commented Mar 14, 2014

hhuuggoo commented Nov 5, 2014

aroberts commented Mar 30, 2015

selwin commented Mar 31, 2015

hhuuggoo commented Mar 31, 2015

aroberts commented Mar 31, 2015

selwin commented Mar 31, 2015

bartonology commented Dec 10, 2019

selwin commented Dec 11, 2019 via email

levrik commented Feb 3, 2020

selwin commented Feb 3, 2020