New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retries and Resubmit (#411 and #412) #413
Conversation
I added # < start fresh Controller >
rc = Client()
# fetch all unfinished tasks (just IDs and timestamps)
unfinished = rc.db_query(dict(completed=None), keys=['msg_id', 'submitted'])
# restore submission order
reordered = sorted(unfinished, key=lambda d: d['submitted'])
# resubmit by msg_id
rc.resubmit([ d['msg_id'] for d in reordered ]) resubmit, like db_query, is a Client method, not a View method. I could easily be persuaded that Views should see this as well. Currently, this enforces that the resubmitted task is identical to the original, and only submitted via the load-balanced scheduler. It is feasible that one would want to make some kind of alteration to the header, or support resubmit of MUX operations, particularly with the goal of restoring cluster state. This should address #411 |
When I manually try to resubmit a task with the following code, I get an error message in ipcontroller. However, the task seems to be run again. from IPython.parallel import Client
rc = Client()
rc.session.session = "job98"
rc.resubmit('99ef4117-dc9c-42b1-bad1-2ba44e1e9896') [IPControllerApp] client::client '45e8113c-156a-42dc-9f35-71db2671ad84' requested 'resubmit_request'
[IPControllerApp] conflicting initial state for record: 99ef4117-dc9c-42b1-bad1-2ba44e1e9896:a762a146-a47b-4a18-9f49-8a383079cbf6 <> [u'45e8113c-156a-42dc-9f35-71db2671ad84']
[IPControllerApp] conflicting initial state for record: 99ef4117-dc9c-42b1-bad1-2ba44e1e9896:2011-05-05 00:50:30.171437 <> 2011-05-05 00:50:30.171000
[IPControllerApp] conflicting initial state for record: 99ef4117-dc9c-42b1-bad1-2ba44e1e9896:{"msg_ids":["99ef4117-dc9c-42b1-bad1-2ba44e1e9896"]} <> {} |
Ah, that's what I get for only testing with SQLite. Should work with mongodb now. |
I think I fixed that conflicting state issue just now, and a couple small related issues. I see that you are doing the session override even on resubmits. Note that this will have no effect at all on the resubmitted tasks - they will be identical to the original in their headers, etc. which contain that information. The only information that changes is the result (potentially), the associated client_id, and information related to where/when it runs. If you are doing that to make sure it's the same as before, that's unnecessary, and if you want it to be different, that's not possible (I can make it possible, but it isn't currently). |
Yes, you are right. Setting the session again makes no sense. I did some testing with my wrapper methods to create a DrQueue job (IPython session) and some tasks. See https://github.com/kaazoo/DrQueueIPython/blob/master/DrQueue/client.py for details. First step: create tasks of job python2.6 sendjob_ipython.py -s 1 -e 5 -b 1 -r blender -f /usr/local/drqueue/tmp/icetest.blend -n "job032" -o "{'rendertype':'animation'}" --owner "foobar" Have a look on them. They are pending: python2.6 listjobs_ipython.py
Tasks of job job032:
msg_id status owner completed at
0a348798-77ce-480e-a264-a726aa8d3c37 pending foobar 2011-05-05 23:35:47
5e60e1ed-c531-4477-80e5-0ae7c760cc57 pending foobar 2011-05-05 23:35:47
01915017-fcee-4b28-8e90-e50a364e8f96 pending foobar 2011-05-05 23:35:47
f2158540-c58a-44b7-8e81-e47d6e828ece pending foobar 2011-05-05 23:35:47
4358e073-5641-49e4-b273-b58ed39e3d00 pending foobar 2011-05-05 23:35:47 Wait a while. Now they are completed: python2.6 listjobs_ipython.py
Tasks of job job032:
msg_id status owner completed at
0a348798-77ce-480e-a264-a726aa8d3c37 ok foobar 2011-05-05 23:42:19
5e60e1ed-c531-4477-80e5-0ae7c760cc57 ok foobar 2011-05-05 23:42:19
01915017-fcee-4b28-8e90-e50a364e8f96 ok foobar 2011-05-05 23:42:45
f2158540-c58a-44b7-8e81-e47d6e828ece ok foobar 2011-05-05 23:42:45
4358e073-5641-49e4-b273-b58ed39e3d00 ok foobar 2011-05-05 23:42:58 Second step: requeue all tasks of job python2.6 controljob_ipython.py -r -n job032
requeuing 0a348798-77ce-480e-a264-a726aa8d3c37
requeuing 5e60e1ed-c531-4477-80e5-0ae7c760cc57
requeuing 01915017-fcee-4b28-8e90-e50a364e8f96
requeuing f2158540-c58a-44b7-8e81-e47d6e828ece
requeuing 4358e073-5641-49e4-b273-b58ed39e3d00
Job job032 is running another time. Have a look again. They are pending: python2.6 listjobs_ipython.py
Tasks of job job032:
msg_id status owner completed at
0a348798-77ce-480e-a264-a726aa8d3c37 pending foobar 2011-05-05 23:35:47
5e60e1ed-c531-4477-80e5-0ae7c760cc57 pending foobar 2011-05-05 23:35:47
01915017-fcee-4b28-8e90-e50a364e8f96 pending foobar 2011-05-05 23:35:47
f2158540-c58a-44b7-8e81-e47d6e828ece pending foobar 2011-05-05 23:35:47
4358e073-5641-49e4-b273-b58ed39e3d00 pending foobar 2011-05-05 23:35:47 Wait a while. Hhhmm, one task isn't ready but the engines are idle: python2.6 listjobs_ipython.py
Tasks of job job032:
msg_id status owner completed at
0a348798-77ce-480e-a264-a726aa8d3c37 ok foobar 2011-05-05 23:45:33
5e60e1ed-c531-4477-80e5-0ae7c760cc57 ok foobar 2011-05-05 23:45:33
01915017-fcee-4b28-8e90-e50a364e8f96 ok foobar 2011-05-05 23:45:58
f2158540-c58a-44b7-8e81-e47d6e828ece ok foobar 2011-05-05 23:45:58
4358e073-5641-49e4-b273-b58ed39e3d00 pending foobar 2011-05-05 23:45:58 Third step: requeue again python2.6 controljob_ipython.py -r -n job032
requeuing 0a348798-77ce-480e-a264-a726aa8d3c37
requeuing 5e60e1ed-c531-4477-80e5-0ae7c760cc57
requeuing 01915017-fcee-4b28-8e90-e50a364e8f96
requeuing f2158540-c58a-44b7-8e81-e47d6e828ece
Traceback (most recent call last):
File "controljob_ipython.py", line 62, in <module>
main()
File "controljob_ipython.py", line 53, in main
client.job_rerun(options.name)
File "/Users/kaazoo/Documents/Entwicklung/drqueue-entwicklung/drqueue-zmq/DrQueue/client.py", line 217, in job_rerun
self.task_requeue(task['msg_id'])
File "/Users/kaazoo/Documents/Entwicklung/drqueue-entwicklung/drqueue-zmq/DrQueue/client.py", line 198, in task_requeue
self.ip_client.resubmit(task_id)
File "<string>", line 2, in resubmit
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/IPython/parallel/client/client.py", line 48, in spin_first
return f(self, *args, **kwargs)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/IPython/parallel/client/client.py", line 1098, in resubmit
raise self._unwrap_exception(content)
IPython.parallel.error.RemoteError: ValueError(Task u'4358e073-5641-49e4-b273-b58ed39e3d00' appears to be inflight)
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/IPython/parallel/controller/hub.py", line 1133, in resubmit_task
raise ValueError("Task %r appears to be inflight"%(msg_id))
ValueError: Task u'4358e073-5641-49e4-b273-b58ed39e3d00' appears to be inflight What could be the couse of this? There's a pending task that can't be run by an engine and can't be requeued. |
Is this consistently reproducible? The 'can't be run' and 'can't be requeued' are really one issue. If a job is listed as 'pending', it's not allowed to be resubmitted, because that would allow a race condition on the result, so if it's stuck in 'pending', it will stay that way, and it's a bug (probably in the Scheduler). You can specify a timeout on tasks, which should prevent it from getting stuck, at least. Does it have any dependencies? Can you do a db query on the task and post it here (excluding buffers)? Does the controller log show that it arrived on an engine? |
@kaazoo any updates on log output or patterns? |
This allows other objects to call it, and build serialized messages without sending.
also add some lbv tests, and related fixes closes ipythongh-412
closes ipythongh-411 * allow `content` in session.serialize to be a unicode object, because mongo+JSON cannot be relied upon to produce encoded bytes.
* use index on msg_id in mongodb backend (_table prevented some methods from working outside the session) * purge_request improved to use fewer db calls * mongodb testcase split into its own file * Fix equality testing, NULL handling, in SQLiteDB backend
I tried it again today after pulling from https://github.com/ipython/ipython.git which already had your last commits in connection to this topic. The error situation as described above doesn't seem to happen anymore. Thanks. |
Adds retries and resubmit logic to IPython.parallel. closes ipythongh-413
Add retries flag to LoadBalancedView, and resubmit method to Client.
Retry behavior is much the same as the previous version. If tasks fail, they will be retried on other engines up to a limit. The default limit is 0 (no retries).
retries
is a flag like everything else, so can be set byView.retries
attribute,View.temp_flags()
,View.set_flags()
, etc.Will close #412