Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retries and Resubmit (#411 and #412) #413

Closed
wants to merge 5 commits into from
Closed

Conversation

minrk
Copy link
Member

@minrk minrk commented May 4, 2011

Add retries flag to LoadBalancedView, and resubmit method to Client.

Retry behavior is much the same as the previous version. If tasks fail, they will be retried on other engines up to a limit. The default limit is 0 (no retries). retries is a flag like everything else, so can be set by View.retries attribute, View.temp_flags(), View.set_flags(), etc.

Will close #412

@minrk
Copy link
Member Author

minrk commented May 4, 2011

I added Client.resubmit([msg_ids]) for re-running previously finished, failed, or aborted tasks. This could also be used, to some degree, for resuming the queue of a crashed or shutdown cluster (assuming db backend in use):

# < start fresh Controller >
rc = Client()
 # fetch all unfinished tasks (just IDs and timestamps)
unfinished = rc.db_query(dict(completed=None), keys=['msg_id', 'submitted'])
 # restore submission order
reordered = sorted(unfinished, key=lambda d: d['submitted'])
 # resubmit by msg_id
rc.resubmit([ d['msg_id'] for d in reordered ])

resubmit, like db_query, is a Client method, not a View method. I could easily be persuaded that Views should see this as well.

Currently, this enforces that the resubmitted task is identical to the original, and only submitted via the load-balanced scheduler. It is feasible that one would want to make some kind of alteration to the header, or support resubmit of MUX operations, particularly with the goal of restoring cluster state.

This should address #411

@kaazoo
Copy link

kaazoo commented May 4, 2011

When I manually try to resubmit a task with the following code, I get an error message in ipcontroller. However, the task seems to be run again.

from IPython.parallel import Client
rc = Client()
rc.session.session = "job98"
rc.resubmit('99ef4117-dc9c-42b1-bad1-2ba44e1e9896')
[IPControllerApp] client::client '45e8113c-156a-42dc-9f35-71db2671ad84' requested 'resubmit_request'
[IPControllerApp] conflicting initial state for record: 99ef4117-dc9c-42b1-bad1-2ba44e1e9896:a762a146-a47b-4a18-9f49-8a383079cbf6 <> [u'45e8113c-156a-42dc-9f35-71db2671ad84']
[IPControllerApp] conflicting initial state for record: 99ef4117-dc9c-42b1-bad1-2ba44e1e9896:2011-05-05 00:50:30.171437 <> 2011-05-05 00:50:30.171000
[IPControllerApp] conflicting initial state for record: 99ef4117-dc9c-42b1-bad1-2ba44e1e9896:{"msg_ids":["99ef4117-dc9c-42b1-bad1-2ba44e1e9896"]} <> {}

@minrk
Copy link
Member Author

minrk commented May 5, 2011

Ah, that's what I get for only testing with SQLite. Should work with mongodb now.

@minrk
Copy link
Member Author

minrk commented May 5, 2011

I think I fixed that conflicting state issue just now, and a couple small related issues.

I see that you are doing the session override even on resubmits. Note that this will have no effect at all on the resubmitted tasks - they will be identical to the original in their headers, etc. which contain that information. The only information that changes is the result (potentially), the associated client_id, and information related to where/when it runs.

If you are doing that to make sure it's the same as before, that's unnecessary, and if you want it to be different, that's not possible (I can make it possible, but it isn't currently).

@kaazoo
Copy link

kaazoo commented May 5, 2011

Yes, you are right. Setting the session again makes no sense.

I did some testing with my wrapper methods to create a DrQueue job (IPython session) and some tasks. See https://github.com/kaazoo/DrQueueIPython/blob/master/DrQueue/client.py for details.

First step: create tasks of job

python2.6 sendjob_ipython.py -s 1 -e 5 -b 1 -r blender -f /usr/local/drqueue/tmp/icetest.blend -n "job032" -o "{'rendertype':'animation'}" --owner "foobar"

Have a look on them. They are pending:

python2.6 listjobs_ipython.py
Tasks of job job032:
msg_id                                 status    owner       completed at
0a348798-77ce-480e-a264-a726aa8d3c37   pending   foobar      2011-05-05 23:35:47
5e60e1ed-c531-4477-80e5-0ae7c760cc57   pending   foobar      2011-05-05 23:35:47
01915017-fcee-4b28-8e90-e50a364e8f96   pending   foobar      2011-05-05 23:35:47
f2158540-c58a-44b7-8e81-e47d6e828ece   pending   foobar      2011-05-05 23:35:47
4358e073-5641-49e4-b273-b58ed39e3d00   pending   foobar      2011-05-05 23:35:47

Wait a while. Now they are completed:

python2.6 listjobs_ipython.py
Tasks of job job032:
msg_id                                 status    owner       completed at
0a348798-77ce-480e-a264-a726aa8d3c37   ok        foobar      2011-05-05 23:42:19
5e60e1ed-c531-4477-80e5-0ae7c760cc57   ok        foobar      2011-05-05 23:42:19
01915017-fcee-4b28-8e90-e50a364e8f96   ok        foobar      2011-05-05 23:42:45
f2158540-c58a-44b7-8e81-e47d6e828ece   ok        foobar      2011-05-05 23:42:45
4358e073-5641-49e4-b273-b58ed39e3d00   ok        foobar      2011-05-05 23:42:58

Second step: requeue all tasks of job

python2.6 controljob_ipython.py -r -n job032
requeuing 0a348798-77ce-480e-a264-a726aa8d3c37
requeuing 5e60e1ed-c531-4477-80e5-0ae7c760cc57
requeuing 01915017-fcee-4b28-8e90-e50a364e8f96
requeuing f2158540-c58a-44b7-8e81-e47d6e828ece
requeuing 4358e073-5641-49e4-b273-b58ed39e3d00
Job job032 is running another time.

Have a look again. They are pending:

python2.6 listjobs_ipython.py
Tasks of job job032:
msg_id                                 status    owner       completed at
0a348798-77ce-480e-a264-a726aa8d3c37   pending   foobar      2011-05-05 23:35:47
5e60e1ed-c531-4477-80e5-0ae7c760cc57   pending   foobar      2011-05-05 23:35:47
01915017-fcee-4b28-8e90-e50a364e8f96   pending   foobar      2011-05-05 23:35:47
f2158540-c58a-44b7-8e81-e47d6e828ece   pending   foobar      2011-05-05 23:35:47
4358e073-5641-49e4-b273-b58ed39e3d00   pending   foobar      2011-05-05 23:35:47

Wait a while. Hhhmm, one task isn't ready but the engines are idle:

python2.6 listjobs_ipython.py
Tasks of job job032:
msg_id                                 status    owner       completed at
0a348798-77ce-480e-a264-a726aa8d3c37   ok        foobar      2011-05-05 23:45:33
5e60e1ed-c531-4477-80e5-0ae7c760cc57   ok        foobar      2011-05-05 23:45:33
01915017-fcee-4b28-8e90-e50a364e8f96   ok        foobar      2011-05-05 23:45:58
f2158540-c58a-44b7-8e81-e47d6e828ece   ok        foobar      2011-05-05 23:45:58
4358e073-5641-49e4-b273-b58ed39e3d00   pending   foobar      2011-05-05 23:45:58

Third step: requeue again

python2.6 controljob_ipython.py -r -n job032
requeuing 0a348798-77ce-480e-a264-a726aa8d3c37
requeuing 5e60e1ed-c531-4477-80e5-0ae7c760cc57
requeuing 01915017-fcee-4b28-8e90-e50a364e8f96
requeuing f2158540-c58a-44b7-8e81-e47d6e828ece
Traceback (most recent call last):
  File "controljob_ipython.py", line 62, in <module>
    main()
  File "controljob_ipython.py", line 53, in main
    client.job_rerun(options.name)
  File "/Users/kaazoo/Documents/Entwicklung/drqueue-entwicklung/drqueue-zmq/DrQueue/client.py", line 217, in job_rerun
    self.task_requeue(task['msg_id'])
  File "/Users/kaazoo/Documents/Entwicklung/drqueue-entwicklung/drqueue-zmq/DrQueue/client.py", line 198, in task_requeue
    self.ip_client.resubmit(task_id)
  File "<string>", line 2, in resubmit
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/IPython/parallel/client/client.py", line 48, in spin_first
    return f(self, *args, **kwargs)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/IPython/parallel/client/client.py", line 1098, in resubmit
    raise self._unwrap_exception(content)
IPython.parallel.error.RemoteError: ValueError(Task u'4358e073-5641-49e4-b273-b58ed39e3d00' appears to be inflight)
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/IPython/parallel/controller/hub.py", line 1133, in resubmit_task
    raise ValueError("Task %r appears to be inflight"%(msg_id))
ValueError: Task u'4358e073-5641-49e4-b273-b58ed39e3d00' appears to be inflight

What could be the couse of this? There's a pending task that can't be run by an engine and can't be requeued.

@minrk
Copy link
Member Author

minrk commented May 6, 2011

Is this consistently reproducible?

The 'can't be run' and 'can't be requeued' are really one issue. If a job is listed as 'pending', it's not allowed to be resubmitted, because that would allow a race condition on the result, so if it's stuck in 'pending', it will stay that way, and it's a bug (probably in the Scheduler). You can specify a timeout on tasks, which should prevent it from getting stuck, at least.

Does it have any dependencies? Can you do a db query on the task and post it here (excluding buffers)?

Does the controller log show that it arrived on an engine?

@minrk
Copy link
Member Author

minrk commented May 13, 2011

@kaazoo any updates on log output or patterns?

minrk added 5 commits May 17, 2011 14:27
This allows other objects to call it, and build serialized messages without sending.
also add some lbv tests, and related fixes

closes ipythongh-412
closes ipythongh-411

* allow `content` in session.serialize to be a unicode object, because mongo+JSON cannot be relied upon to produce encoded bytes.
* use index on msg_id in mongodb backend (_table prevented some methods from working outside the session)
* purge_request improved to use fewer db calls
* mongodb testcase split into its own file
* Fix equality testing, NULL handling, in SQLiteDB backend
@minrk minrk closed this in 50deb54 May 17, 2011
@kaazoo
Copy link

kaazoo commented May 24, 2011

I tried it again today after pulling from https://github.com/ipython/ipython.git which already had your last commits in connection to this topic. The error situation as described above doesn't seem to happen anymore. Thanks.

mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this pull request Nov 3, 2014
Adds retries and resubmit logic to IPython.parallel.

closes ipythongh-413
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add support to automatic retry of tasks
2 participants