parallel tasks never finish under heavy load #3303

akhmerov · 2013-05-10T17:06:57Z

I run an ipython cluster, and observe a very weird behavior. When there are too many tasks, at some point the cluster stops completely. I was able to narrow the issue down to this:

c = Client()  # 250 engines
with c[:].sync_imports():
    from time import sleep
f = lambda x: sleep(.5)
lv = c.load_balanced_view()
res = lv.map_async(f, range(40000))  # Take horribly long time due to issue #3106
res.progress  # In my current run stops at 2000.

This was done with python scheduler, and hwm set to 0, in order to partially mitigate the effect of #3106. I see no engines unregistering, I see nothing out of the ordinary in the logs. The controller, kernel, and engine cpu loads stay minimal when this happens, so it cannot be just a slowdown. The issue occurs for the short tasks, as in the example above, but also when the time of a single task is ~50 sec.

I can investigate the issue more, but I don't quite know where to start.

The text was updated successfully, but these errors were encountered:

minrk · 2013-05-10T17:19:46Z

Can you try to reproduce with master as well (I assume mitigating #3106 means you are using 0.13, since it doesn't affect master).

Also try the pure zmq scheduler, and see if it makes any difference.

minrk · 2013-05-10T17:22:27Z

Does the number of engines matter (can you reproduce with a smaller number of engines?)?

akhmerov · 2013-05-10T18:54:17Z

I can reproduce the bug on 4 engines, however the number of successful jobs goes to 7000 — significantly more successful jobs per engine. I cannot reproduce it on a cluster running locally on my laptop with 0.13.2. It'll take me some more time to try to reproduce it on master or 0.13.2 on our cluster, where the issue was observed.

akhmerov · 2013-05-10T19:04:42Z

I encounter the following when trying to use master (most likely an unrelated bug).

In [2]: c = Client('/home/akhmerov/.ipython/BqsoV7/security/ipcontroller-client.json')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-5fd9b03f6c17> in <module>()
----> 1 c = Client('/home/akhmerov/.ipython/BqsoV7/security/ipcontroller-client.json')

/clusterdata/python/env_testing/lib/python2.7/site-packages/IPython/parallel/client/client.pyc in __init__(self, url_file,
profile, profile_dir, ipython_dir, context, debug, sshserver, sshkey, password, paramiko, timeout, cluster_id,
**extra_args)
    410             cfg = json.load(f)
    411 
--> 412         self._task_scheme = cfg['task_scheme']
    413 
    414         # sync defaults from args, json:

KeyError: 'task_scheme'

akhmerov · 2013-05-10T19:43:38Z

v0.13.2, pure ZMQ scheduler:

In [31]: c.queue_status() # After the bug has occured, so nothing changes anymore
Out[31]: 
{0: {u'completed': 3049, u'queue': 21020, u'tasks': 0},
 1: {u'completed': 7083, u'queue': 16981, u'tasks': 0},
 2: {u'completed': 7072, u'queue': 16989, u'tasks': 0},
 3: {u'completed': 7213, u'queue': 16850, u'tasks': 0},
 u'unassigned': 0}

In [32]: sum([1 for i in res if i.ready()])
Out[32]: 40

The controller log has a bunch of

2013-05-10 21:38:09.434 [IPControllerApp] client::client 'd5124c2c-9efe-49d0-9811-950e8ff39fbb' requested u'queue_request'
2013-05-10 21:38:10.939 [IPControllerApp] heartbeat::sending 762.007559061

The engines logs have

2013-05-10 21:30:16.513 [IPEngineApp] 
*** MESSAGE TYPE:apply_request***
2013-05-10 21:30:16.513 [IPEngineApp]    Content: {}
--->

Queue status seems to be the thing that goes wrong: the engines report very different queue lengths each time the stop occurs.

akhmerov · 2013-05-10T20:27:00Z

Reproduced on master with 4 engines run locally on a laptop with the same symptoms. By the way, the fix to #1306 does not seem to work anymore: submission time of 40000 jobs is huge again.

akhmerov · 2013-05-10T20:45:59Z

A completely different problem appears with hwm set to 1.
Then the calculation finishes, and the queue status shows the correct number of finished jobs. However only 28262 out of 40000 AsyncResult's show that they are ready. Below is the metadata of one result that isn't ready after everything was finished:

{'after': None,
 'completed': None,
 'data': {},
 'engine_id': None,
 'engine_uuid': None,
 'follow': None,
 'msg_id': None,
 'outputs': [],
 'outputs_ready': False,
 'pyerr': None,
 'pyin': None,
 'pyout': None,
 'received': None,
 'started': None,
 'status': None,
 'stderr': '',
 'stdout': '',
 'submitted': datetime.datetime(2013, 5, 10, 16, 21, 57, 711630)}

minrk · 2013-05-11T03:29:24Z

What is the libzmq version you are running zmq.zmq_version()? I think we are running into libzmq's own HWM behavior here. I think I have a fix in mind.

akhmerov · 2013-05-11T03:32:55Z

I only saw the bug in 3.2.2 (and 3 different ipython versions). Could well be ZMQ, however note that the error appeared also when tasks lasted 50 seconds.

minrk · 2013-05-11T03:43:34Z

does it appear when tasks last 50 seconds and you have TaskScheduler.HWM!=0? The task duration doesn't matter, as zmq.HWM will drop messages in the assignment process.

I only saw the bug in 3.2.2

Excellent, then I think I am on the right track. zmq 3 changed the default HWM from -1 (infinite) to 1000, so if you try to assign more than 1000 tasks as fast as you can to engines, they will be dropped.

akhmerov · 2013-05-11T03:48:52Z

does it appear when tasks last 50 seconds and you have TaskScheduler.HWM!=0?

I didn't test it, and not sure if it would be possible with 0.13 due to #3106.

Excellent, then I think I am on the right track.

Seems very reasonable, I wasn't able to reproduce it with 2.x.x. Does this also explain the issue with TaskScheduler.hwm=1 (4 engines, 4e4 tasks of .1 sec each), that I observed in dev?

minrk · 2013-05-11T05:25:06Z

Try #3304, and see how your tests fare.

akhmerov · 2013-05-11T19:41:36Z

#3304 seems to fix both the hwm=1 and hwm=0 problem.

I still see #2215 on master, and submission of 40000 jobs takes around a minute now, so something has reinstated #3106. Unlike before, hwm=1 doesn't take 100% cpu, so things are better at least in that respect.

Is there a quick and dirty way to apply #3304 to 0.13?

I don't find pure ZMQ scheduler a viable option right now because the engines do unregister every now and then.

minrk · 2013-05-11T20:44:11Z

I still see #2215 on master, and submission of 40000 jobs takes around a minute now, so something has reinstated #3106. Unlike before, hwm=1 doesn't take 100% cpu, so things are better at least in that respect.

#3106 has been fixed - submission of jobs is now linear in the number of jobs, and I see that it can submit about 1600 tasks/sec pretty consistently on my laptop, so there isn't a regression, it's just not as fast as it should be.

I should note that it is probably a good idea to set chunksize to something greater than one, which will bundle the tasks and submit much faster, with minimal penalty to load balancing when there are lots of tasks:

amr = view.map_async(f, args, chunksize=8)

Is there a quick and dirty way to apply #3304 to 0.13?

You are welcome to try manual patches via git cherry-pick or git apply, but it will not be supported.

akhmerov · 2013-05-11T21:20:08Z

Sorry for the wrong claims. #3106 is indeed fixed, and instead of #2215 I saw what is fixed by #3308.

I understand that manually patching wouldn't be supported.

set unlimited HWM for all relay devices libzmq-3 changed the default HWM for sockets to 1000 from infinite, which can result in dropped messages on ROUTER or PUB sockets under high load. This PR explicitly sets these back to infinite to avoid dropped messages. closes ipython#3303

minrk mentioned this issue May 11, 2013

set unlimited HWM for all relay devices #3304

Merged

minrk closed this as completed in d7f1ab1 May 14, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel tasks never finish under heavy load #3303

parallel tasks never finish under heavy load #3303

akhmerov commented May 10, 2013

minrk commented May 10, 2013

minrk commented May 10, 2013

akhmerov commented May 10, 2013

akhmerov commented May 10, 2013

akhmerov commented May 10, 2013

akhmerov commented May 10, 2013

akhmerov commented May 10, 2013

minrk commented May 11, 2013

akhmerov commented May 11, 2013

minrk commented May 11, 2013

akhmerov commented May 11, 2013

minrk commented May 11, 2013

akhmerov commented May 11, 2013

minrk commented May 11, 2013

akhmerov commented May 11, 2013

parallel tasks never finish under heavy load #3303

parallel tasks never finish under heavy load #3303

Comments

akhmerov commented May 10, 2013

minrk commented May 10, 2013

minrk commented May 10, 2013

akhmerov commented May 10, 2013

akhmerov commented May 10, 2013

akhmerov commented May 10, 2013

akhmerov commented May 10, 2013

akhmerov commented May 10, 2013

minrk commented May 11, 2013

akhmerov commented May 11, 2013

minrk commented May 11, 2013

akhmerov commented May 11, 2013

minrk commented May 11, 2013

akhmerov commented May 11, 2013

minrk commented May 11, 2013

akhmerov commented May 11, 2013