Retries and Resubmit (#411 and #412) #413

Closed
wants to merge 5 commits into
from

Projects

None yet

2 participants

@minrk
IPython member

Add retries flag to LoadBalancedView, and resubmit method to Client.

Retry behavior is much the same as the previous version. If tasks fail, they will be retried on other engines up to a limit. The default limit is 0 (no retries). retries is a flag like everything else, so can be set by View.retries attribute, View.temp_flags(), View.set_flags(), etc.

Will close #412

@minrk
IPython member

I added Client.resubmit([msg_ids]) for re-running previously finished, failed, or aborted tasks. This could also be used, to some degree, for resuming the queue of a crashed or shutdown cluster (assuming db backend in use):

# < start fresh Controller >
rc = Client()
 # fetch all unfinished tasks (just IDs and timestamps)
unfinished = rc.db_query(dict(completed=None), keys=['msg_id', 'submitted'])
 # restore submission order
reordered = sorted(unfinished, key=lambda d: d['submitted'])
 # resubmit by msg_id
rc.resubmit([ d['msg_id'] for d in reordered ])

resubmit, like db_query, is a Client method, not a View method. I could easily be persuaded that Views should see this as well.

Currently, this enforces that the resubmitted task is identical to the original, and only submitted via the load-balanced scheduler. It is feasible that one would want to make some kind of alteration to the header, or support resubmit of MUX operations, particularly with the goal of restoring cluster state.

This should address #411

@kaazoo

When I manually try to resubmit a task with the following code, I get an error message in ipcontroller. However, the task seems to be run again.

from IPython.parallel import Client
rc = Client()
rc.session.session = "job98"
rc.resubmit('99ef4117-dc9c-42b1-bad1-2ba44e1e9896')
[IPControllerApp] client::client '45e8113c-156a-42dc-9f35-71db2671ad84' requested 'resubmit_request'
[IPControllerApp] conflicting initial state for record: 99ef4117-dc9c-42b1-bad1-2ba44e1e9896:a762a146-a47b-4a18-9f49-8a383079cbf6 <> [u'45e8113c-156a-42dc-9f35-71db2671ad84']
[IPControllerApp] conflicting initial state for record: 99ef4117-dc9c-42b1-bad1-2ba44e1e9896:2011-05-05 00:50:30.171437 <> 2011-05-05 00:50:30.171000
[IPControllerApp] conflicting initial state for record: 99ef4117-dc9c-42b1-bad1-2ba44e1e9896:{"msg_ids":["99ef4117-dc9c-42b1-bad1-2ba44e1e9896"]} <> {}
@minrk
IPython member

Ah, that's what I get for only testing with SQLite. Should work with mongodb now.

@minrk
IPython member

I think I fixed that conflicting state issue just now, and a couple small related issues.

I see that you are doing the session override even on resubmits. Note that this will have no effect at all on the resubmitted tasks - they will be identical to the original in their headers, etc. which contain that information. The only information that changes is the result (potentially), the associated client_id, and information related to where/when it runs.

If you are doing that to make sure it's the same as before, that's unnecessary, and if you want it to be different, that's not possible (I can make it possible, but it isn't currently).

@kaazoo

Yes, you are right. Setting the session again makes no sense.

I did some testing with my wrapper methods to create a DrQueue job (IPython session) and some tasks. See https://github.com/kaazoo/DrQueueIPython/blob/master/DrQueue/client.py for details.

First step: create tasks of job

python2.6 sendjob_ipython.py -s 1 -e 5 -b 1 -r blender -f /usr/local/drqueue/tmp/icetest.blend -n "job032" -o "{'rendertype':'animation'}" --owner "foobar"

Have a look on them. They are pending:

python2.6 listjobs_ipython.py
Tasks of job job032:
msg_id                                 status    owner       completed at
0a348798-77ce-480e-a264-a726aa8d3c37   pending   foobar      2011-05-05 23:35:47
5e60e1ed-c531-4477-80e5-0ae7c760cc57   pending   foobar      2011-05-05 23:35:47
01915017-fcee-4b28-8e90-e50a364e8f96   pending   foobar      2011-05-05 23:35:47
f2158540-c58a-44b7-8e81-e47d6e828ece   pending   foobar      2011-05-05 23:35:47
4358e073-5641-49e4-b273-b58ed39e3d00   pending   foobar      2011-05-05 23:35:47

Wait a while. Now they are completed:

python2.6 listjobs_ipython.py
Tasks of job job032:
msg_id                                 status    owner       completed at
0a348798-77ce-480e-a264-a726aa8d3c37   ok        foobar      2011-05-05 23:42:19
5e60e1ed-c531-4477-80e5-0ae7c760cc57   ok        foobar      2011-05-05 23:42:19
01915017-fcee-4b28-8e90-e50a364e8f96   ok        foobar      2011-05-05 23:42:45
f2158540-c58a-44b7-8e81-e47d6e828ece   ok        foobar      2011-05-05 23:42:45
4358e073-5641-49e4-b273-b58ed39e3d00   ok        foobar      2011-05-05 23:42:58

Second step: requeue all tasks of job

python2.6 controljob_ipython.py -r -n job032
requeuing 0a348798-77ce-480e-a264-a726aa8d3c37
requeuing 5e60e1ed-c531-4477-80e5-0ae7c760cc57
requeuing 01915017-fcee-4b28-8e90-e50a364e8f96
requeuing f2158540-c58a-44b7-8e81-e47d6e828ece
requeuing 4358e073-5641-49e4-b273-b58ed39e3d00
Job job032 is running another time.

Have a look again. They are pending:

python2.6 listjobs_ipython.py
Tasks of job job032:
msg_id                                 status    owner       completed at
0a348798-77ce-480e-a264-a726aa8d3c37   pending   foobar      2011-05-05 23:35:47
5e60e1ed-c531-4477-80e5-0ae7c760cc57   pending   foobar      2011-05-05 23:35:47
01915017-fcee-4b28-8e90-e50a364e8f96   pending   foobar      2011-05-05 23:35:47
f2158540-c58a-44b7-8e81-e47d6e828ece   pending   foobar      2011-05-05 23:35:47
4358e073-5641-49e4-b273-b58ed39e3d00   pending   foobar      2011-05-05 23:35:47

Wait a while. Hhhmm, one task isn't ready but the engines are idle:

python2.6 listjobs_ipython.py
Tasks of job job032:
msg_id                                 status    owner       completed at
0a348798-77ce-480e-a264-a726aa8d3c37   ok        foobar      2011-05-05 23:45:33
5e60e1ed-c531-4477-80e5-0ae7c760cc57   ok        foobar      2011-05-05 23:45:33
01915017-fcee-4b28-8e90-e50a364e8f96   ok        foobar      2011-05-05 23:45:58
f2158540-c58a-44b7-8e81-e47d6e828ece   ok        foobar      2011-05-05 23:45:58
4358e073-5641-49e4-b273-b58ed39e3d00   pending   foobar      2011-05-05 23:45:58

Third step: requeue again

python2.6 controljob_ipython.py -r -n job032
requeuing 0a348798-77ce-480e-a264-a726aa8d3c37
requeuing 5e60e1ed-c531-4477-80e5-0ae7c760cc57
requeuing 01915017-fcee-4b28-8e90-e50a364e8f96
requeuing f2158540-c58a-44b7-8e81-e47d6e828ece
Traceback (most recent call last):
  File "controljob_ipython.py", line 62, in <module>
    main()
  File "controljob_ipython.py", line 53, in main
    client.job_rerun(options.name)
  File "/Users/kaazoo/Documents/Entwicklung/drqueue-entwicklung/drqueue-zmq/DrQueue/client.py", line 217, in job_rerun
    self.task_requeue(task['msg_id'])
  File "/Users/kaazoo/Documents/Entwicklung/drqueue-entwicklung/drqueue-zmq/DrQueue/client.py", line 198, in task_requeue
    self.ip_client.resubmit(task_id)
  File "<string>", line 2, in resubmit
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/IPython/parallel/client/client.py", line 48, in spin_first
    return f(self, *args, **kwargs)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/IPython/parallel/client/client.py", line 1098, in resubmit
    raise self._unwrap_exception(content)
IPython.parallel.error.RemoteError: ValueError(Task u'4358e073-5641-49e4-b273-b58ed39e3d00' appears to be inflight)
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/IPython/parallel/controller/hub.py", line 1133, in resubmit_task
    raise ValueError("Task %r appears to be inflight"%(msg_id))
ValueError: Task u'4358e073-5641-49e4-b273-b58ed39e3d00' appears to be inflight

What could be the couse of this? There's a pending task that can't be run by an engine and can't be requeued.

@minrk
IPython member

Is this consistently reproducible?

The 'can't be run' and 'can't be requeued' are really one issue. If a job is listed as 'pending', it's not allowed to be resubmitted, because that would allow a race condition on the result, so if it's stuck in 'pending', it will stay that way, and it's a bug (probably in the Scheduler). You can specify a timeout on tasks, which should prevent it from getting stuck, at least.

Does it have any dependencies? Can you do a db query on the task and post it here (excluding buffers)?

Does the controller log show that it arrived on an engine?

@minrk
IPython member

@kaazoo any updates on log output or patterns?

minrk added some commits May 4, 2011
@minrk minrk split serialize step of Session.send into separate method
This allows other objects to call it, and build serialized messages without sending.
21b0f4c
@minrk minrk add retries flag to LoadBalancedView
also add some lbv tests, and related fixes

closes gh-412
6549d09
@minrk minrk add Client.resubmit for re-running tasks
closes gh-411

* allow `content` in session.serialize to be a unicode object, because mongo+JSON cannot be relied upon to produce encoded bytes.
0c043a6
@minrk minrk various db backend fixes
* use index on msg_id in mongodb backend (_table prevented some methods from working outside the session)
* purge_request improved to use fewer db calls
* mongodb testcase split into its own file
* Fix equality testing, NULL handling, in SQLiteDB backend
ffe043d
@minrk minrk add db,resubmit/retries docs 4bb2eb4
@minrk minrk added a commit that referenced this pull request May 17, 2011
@minrk minrk Merge PR #413
Adds retries and resubmit logic to IPython.parallel.

closes gh-413
50deb54
@minrk minrk added a commit that closed this pull request May 17, 2011
@minrk minrk Merge PR #413
Adds retries and resubmit logic to IPython.parallel.

closes gh-413
50deb54
@minrk minrk closed this in 50deb54 May 17, 2011
@kaazoo

I tried it again today after pulling from https://github.com/ipython/ipython.git which already had your last commits in connection to this topic. The error situation as described above doesn't seem to happen anymore. Thanks.

@mattvonrocketstein mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this pull request Nov 3, 2014
@minrk minrk Merge PR #413
Adds retries and resubmit logic to IPython.parallel.

closes gh-413
1bfd0f5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment