New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AsyncResult.wait(0) can hang waiting for the client to get results? #2215
Comments
I don't think it's possible for Can you post code to reproduce this? I have seen it a while ago, but can't reproduce it well enough to actually test and find a fix. I agree that relaying the timeout would be a decent band-aid, but there's a more serious bug causing this that I would like to actually find and fix. |
requesting metadata (e.g. ar.data or ar.stdout) will result in flushing iopub if the outputs are incomplete, so separate wait(0) need not be called. This also applies the workaround discussed in ipython#2215
So here's the problem with reproducing this. I have minimal code right now that will reproduce it in my configuration: from IPython.parallel import Client
rc = Client()
lv = rc.load_balanced_view()
import time
def stupidfunction(r):
import numpy.random
return numpy.random.randn(100) # You may need to increase this to see the problem.
a = lv.map_async(stupidfunction,zeros(10))
time.sleep(0.5) # Run wait too quickly and the problem won't have time to arise.
a.wait(0) However, this doesn't cause the problem in a local cluster, as I don't think the problem is really noticeable in that case. I'm running my client at home through SSH through a VPN to the controller at work, which is then connected to engines through fast SSH; the link is therefore not incredibly fast. I've not yet tried it in my office, which has a 100Mb line to the controller, but is still through SSH. If I run this with a local controller and engines, it's fine. This is why I believe it's a problem with results being transferred to either the controller or the client, probably the client. Actually, I've now also checked with the engines running on the same machine as the controller, and the problem still occurs, so it seems that the client-controller connection is the culprit. From what I can tell, self._ready is set to True by self._client.wait, but 'outputs_ready' in self._metadata are still False, which causes the loop in _wait_for_outputs. |
I can also confirm, actually, that this does not happen now that #2255 has been merged, but I'm not sure if that's just because of the timeout changes. |
Thanks for the test case. It should behave the same as before if you do wait(10), but I will also try to reproduce the underlying issue myself by rolling back the fix. You can set client.debug=True to see all messages as they come through. |
requesting metadata (e.g. ar.data or ar.stdout) will result in flushing iopub if the outputs are incomplete, so separate wait(0) need not be called. This also applies the workaround discussed in ipython#2215
closed by #2255. |
requesting metadata (e.g. ar.data or ar.stdout) will result in flushing iopub if the outputs are incomplete, so separate wait(0) need not be called. This also applies the workaround discussed in ipython#2215
I have some code that runs multiple maps on a load balanced view. I run the code on a remote ipengine that I'm connected to via ssh; the client is running locally in an ipython notebook. The individual tasks take some time to run, perhaps a minute or two each.
It appears that AsyncResult instances can end up in a situation where all outstanding jobs for the are finished, but where the outputs are not actually ready for the client yet. Thus _ready is not set, and somehow wait(0) ends up at self._wait_for_outputs(10) (line 161 of asyncresult.py). This hangs for ten seconds owing to the ten second timeout.
As a result, any function which uses wait(0) will hang for ten seconds before returning, including AsyncResult.progress.
While I'm not exactly sure why this is happening, it seems like it can be solved reasonably easily by changing 10 to timeout, thus causing the timeout there to honor the timeout given to wait.
The text was updated successfully, but these errors were encountered: