Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout SIGALRM not enough to interrupt on some kind of worker #323

Closed
V-E-O opened this issue Mar 14, 2014 · 11 comments
Closed

Timeout SIGALRM not enough to interrupt on some kind of worker #323

V-E-O opened this issue Mar 14, 2014 · 11 comments

Comments

@V-E-O
Copy link

V-E-O commented Mar 14, 2014

Hi,

My RQ task has some native ext library that do not respect SIGALRM when timeout reached. So non-responding worker hang for a while long beyond the timeout I expect.

The parent process rqworker is still waiting for return from pipeline, but the timeout is reached, that worker is expected to be free not busy. Some weird status happen in RQ dashboard, the worker may be disappearing or waiting for blank queue.

Can we have an enhancement to let the parent rqworker monitor the timeout? If the timeout is reached, try send some signal or just kill the child.

Best Regards,
V.E.O

@V-E-O
Copy link
Author

V-E-O commented Mar 14, 2014

Another similar case and solution:
http://gavintech.blogspot.com/2008/02/python-sigalrm-oddness.html

@hhuuggoo
Copy link

hhuuggoo commented Nov 5, 2014

Could we fix this by putting the SIGALRM code that is currently handled by the worker into the parent? And having the parent kill and cleanup the child on timeout? I can probably get a PR together but I'd like some input from the core devs on recommended approaches before I give it a shot

@aroberts
Copy link

I'm also hitting this issue when the alarm should fire during a SQLAlchemy commit(). The job never detects an alarm signal (it is unclear whether or not the signal is getting fired at this point), and I end up with jobs running way past the timeout.

@selwin
Copy link
Collaborator

selwin commented Mar 31, 2015

Some libraries or tasks may catch SIGALRM so it never reaches the worker. I think having the parent process monitor the forked process is a decent approach. This way, tasks will still be killed on time even if it misbehaves or becomes unresponsive. What do you think @nvie ?

@hhuuggoo
Copy link

I've had luck overriding execute_job and replacing the call to waitpid with this function. I can submit a PR if we like this approach

import os
import threading
import time
import signal

def waitpid_timeout(pid, timeout=5.0):
    result = []
    def helper():
        result.append(os.waitpid(pid, 0))
    st = time.time()
    t = threading.Thread(target=helper)
    t.start()
    t.join(timeout=timeout)
    if t.isAlive():
        os.kill(pid, signal.SIGKILL)
        t.join()
    return result[0]

if __name__ == "__main__":
    import subprocess
    p = subprocess.Popen("python -c 'import time; time.sleep(1)'", shell=True)
    st = time.time()
    print waitpid_timeout(p.pid)
    ed = time.time()
    print ed - st
    p = subprocess.Popen("python -c 'import time; time.sleep(10)'", shell=True)
    st = time.time()
    print waitpid_timeout(p.pid)
    ed = time.time()
    print ed - st

@aroberts
Copy link

One thing we discovered with a little more research is that the current mode (use SIGALRM to raise an exception) ties in nicely with in-workhorse exception handling. An external SIGKILL will not allow any cleanup to run, which could introduce other wrinkles.

The best solution for us, I think, would be a staged timeout, where the timeout is monitored both in the child and parent processes. The parent process timeout would be greater than the child timeout, and would be used as a failsafe if for some reason the child did not respond to the SIGALRM. That way, the interface presented (jobs have "live" control over timeout via try/except) remains as is but the hole is closed.

@selwin
Copy link
Collaborator

selwin commented Mar 31, 2015

Agree with @aroberts' observation. We could configure the parent process to send a kill signal to the child process if the child process is still alive 10 seconds after timeout.

@bartonology
Copy link
Contributor

Hopefully this could be helpful to anyone else that stumbles onto this bug/feature. I've subclassed the worker and added a staged hard kill in case the forked process doesn't observe the timeout and/or gets really stuck. In my case, paramiko has an outstanding deadlock that has been reported, but not yet fixed.

class MyWorker(Worker):
    def monitor_work_horse(self, job):
        ...
            try:
                job.started_at = job.started_at or datetime.utcnow()
                with UnixSignalDeathPenalty(self.job_monitoring_interval, HorseMonitorTimeoutException):
                    retpid, ret_val = os.waitpid(self._horse_pid, 0)
                break`

            except HorseMonitorTimeoutException:
                # Horse has not exited yet and is still running.
                # Send a heartbeat to keep the worker alive.
                self.heartbeat(self.job_monitoring_interval + 5)

                # kill the job from this side if something is really wrong (interpreter lock/etc)
                if (datetime.utcnow() - job.started_at).total_seconds() > (job.timeout * 1.1):
                    os.kill(self._horse_pid)
                    break
           ...

The only changes are:

job.started_at = job.started_at or datetime.utcnow()

This is needed because the job started_at isn't filled out on the parent process.

# kill the job from this side if something is really wrong (interpreter lock/etc)
if (datetime.utcnow() - job.started_at).total_seconds() > (job.timeout * 1.1):
    os.kill(self._horse_pid)
    break

By giving it 10% more time, I'm hoping that for most if not all cases, the forked process will timeout first if it can. If not, the parent process will forcibly kill it preventing the rq worker hang.

@selwin
Copy link
Collaborator

selwin commented Dec 11, 2019 via email

@levrik
Copy link
Contributor

levrik commented Feb 3, 2020

@selwin Can this be closed? Looks like the fix was merged.

@selwin selwin closed this as completed Feb 3, 2020
@selwin
Copy link
Collaborator

selwin commented Feb 3, 2020

closed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants