Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel tasks never finish under heavy load #3303

Closed
akhmerov opened this issue May 10, 2013 · 15 comments · Fixed by #3304
Closed

parallel tasks never finish under heavy load #3303

akhmerov opened this issue May 10, 2013 · 15 comments · Fixed by #3304
Labels
Milestone

Comments

@akhmerov
Copy link
Member

I run an ipython cluster, and observe a very weird behavior. When there are too many tasks, at some point the cluster stops completely. I was able to narrow the issue down to this:

c = Client()  # 250 engines
with c[:].sync_imports():
    from time import sleep
f = lambda x: sleep(.5)
lv = c.load_balanced_view()
res = lv.map_async(f, range(40000))  # Take horribly long time due to issue #3106
res.progress  # In my current run stops at 2000.

This was done with python scheduler, and hwm set to 0, in order to partially mitigate the effect of #3106. I see no engines unregistering, I see nothing out of the ordinary in the logs. The controller, kernel, and engine cpu loads stay minimal when this happens, so it cannot be just a slowdown. The issue occurs for the short tasks, as in the example above, but also when the time of a single task is ~50 sec.

I can investigate the issue more, but I don't quite know where to start.

@minrk
Copy link
Member

minrk commented May 10, 2013

Can you try to reproduce with master as well (I assume mitigating #3106 means you are using 0.13, since it doesn't affect master).

Also try the pure zmq scheduler, and see if it makes any difference.

@minrk
Copy link
Member

minrk commented May 10, 2013

Does the number of engines matter (can you reproduce with a smaller number of engines?)?

@akhmerov
Copy link
Member Author

I can reproduce the bug on 4 engines, however the number of successful jobs goes to 7000 — significantly more successful jobs per engine. I cannot reproduce it on a cluster running locally on my laptop with 0.13.2. It'll take me some more time to try to reproduce it on master or 0.13.2 on our cluster, where the issue was observed.

@akhmerov
Copy link
Member Author

I encounter the following when trying to use master (most likely an unrelated bug).

In [2]: c = Client('/home/akhmerov/.ipython/BqsoV7/security/ipcontroller-client.json')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-5fd9b03f6c17> in <module>()
----> 1 c = Client('/home/akhmerov/.ipython/BqsoV7/security/ipcontroller-client.json')

/clusterdata/python/env_testing/lib/python2.7/site-packages/IPython/parallel/client/client.pyc in __init__(self, url_file,
profile, profile_dir, ipython_dir, context, debug, sshserver, sshkey, password, paramiko, timeout, cluster_id,
**extra_args)
    410             cfg = json.load(f)
    411 
--> 412         self._task_scheme = cfg['task_scheme']
    413 
    414         # sync defaults from args, json:

KeyError: 'task_scheme'

@akhmerov
Copy link
Member Author

v0.13.2, pure ZMQ scheduler:

In [31]: c.queue_status() # After the bug has occured, so nothing changes anymore
Out[31]: 
{0: {u'completed': 3049, u'queue': 21020, u'tasks': 0},
 1: {u'completed': 7083, u'queue': 16981, u'tasks': 0},
 2: {u'completed': 7072, u'queue': 16989, u'tasks': 0},
 3: {u'completed': 7213, u'queue': 16850, u'tasks': 0},
 u'unassigned': 0}

In [32]: sum([1 for i in res if i.ready()])
Out[32]: 40

The controller log has a bunch of

2013-05-10 21:38:09.434 [IPControllerApp] client::client 'd5124c2c-9efe-49d0-9811-950e8ff39fbb' requested u'queue_request'
2013-05-10 21:38:10.939 [IPControllerApp] heartbeat::sending 762.007559061

The engines logs have

2013-05-10 21:30:16.513 [IPEngineApp] 
*** MESSAGE TYPE:apply_request***
2013-05-10 21:30:16.513 [IPEngineApp]    Content: {}
--->

Queue status seems to be the thing that goes wrong: the engines report very different queue lengths each time the stop occurs.

@akhmerov
Copy link
Member Author

Reproduced on master with 4 engines run locally on a laptop with the same symptoms. By the way, the fix to #1306 does not seem to work anymore: submission time of 40000 jobs is huge again.

@akhmerov
Copy link
Member Author

A completely different problem appears with hwm set to 1.
Then the calculation finishes, and the queue status shows the correct number of finished jobs. However only 28262 out of 40000 AsyncResult's show that they are ready. Below is the metadata of one result that isn't ready after everything was finished:

{'after': None,
 'completed': None,
 'data': {},
 'engine_id': None,
 'engine_uuid': None,
 'follow': None,
 'msg_id': None,
 'outputs': [],
 'outputs_ready': False,
 'pyerr': None,
 'pyin': None,
 'pyout': None,
 'received': None,
 'started': None,
 'status': None,
 'stderr': '',
 'stdout': '',
 'submitted': datetime.datetime(2013, 5, 10, 16, 21, 57, 711630)}

@minrk
Copy link
Member

minrk commented May 11, 2013

What is the libzmq version you are running zmq.zmq_version()? I think we are running into libzmq's own HWM behavior here. I think I have a fix in mind.

@akhmerov
Copy link
Member Author

I only saw the bug in 3.2.2 (and 3 different ipython versions). Could well be ZMQ, however note that the error appeared also when tasks lasted 50 seconds.

@minrk
Copy link
Member

minrk commented May 11, 2013

does it appear when tasks last 50 seconds and you have TaskScheduler.HWM!=0? The task duration doesn't matter, as zmq.HWM will drop messages in the assignment process.

I only saw the bug in 3.2.2

Excellent, then I think I am on the right track. zmq 3 changed the default HWM from -1 (infinite) to 1000, so if you try to assign more than 1000 tasks as fast as you can to engines, they will be dropped.

@akhmerov
Copy link
Member Author

does it appear when tasks last 50 seconds and you have TaskScheduler.HWM!=0?

I didn't test it, and not sure if it would be possible with 0.13 due to #3106.

Excellent, then I think I am on the right track.

Seems very reasonable, I wasn't able to reproduce it with 2.x.x. Does this also explain the issue with TaskScheduler.hwm=1 (4 engines, 4e4 tasks of .1 sec each), that I observed in dev?

@minrk
Copy link
Member

minrk commented May 11, 2013

Try #3304, and see how your tests fare.

@akhmerov
Copy link
Member Author

#3304 seems to fix both the hwm=1 and hwm=0 problem.

I still see #2215 on master, and submission of 40000 jobs takes around a minute now, so something has reinstated #3106. Unlike before, hwm=1 doesn't take 100% cpu, so things are better at least in that respect.

Is there a quick and dirty way to apply #3304 to 0.13?

I don't find pure ZMQ scheduler a viable option right now because the engines do unregister every now and then.

@minrk
Copy link
Member

minrk commented May 11, 2013

I still see #2215 on master, and submission of 40000 jobs takes around a minute now, so something has reinstated #3106. Unlike before, hwm=1 doesn't take 100% cpu, so things are better at least in that respect.

#3106 has been fixed - submission of jobs is now linear in the number of jobs, and I see that it can submit about 1600 tasks/sec pretty consistently on my laptop, so there isn't a regression, it's just not as fast as it should be.

I should note that it is probably a good idea to set chunksize to something greater than one, which will bundle the tasks and submit much faster, with minimal penalty to load balancing when there are lots of tasks:

amr = view.map_async(f, args, chunksize=8)

Is there a quick and dirty way to apply #3304 to 0.13?

You are welcome to try manual patches via git cherry-pick or git apply, but it will not be supported.

@akhmerov
Copy link
Member Author

Sorry for the wrong claims. #3106 is indeed fixed, and instead of #2215 I saw what is fixed by #3308.

I understand that manually patching wouldn't be supported.

@minrk minrk closed this as completed in d7f1ab1 May 14, 2013
mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014
set unlimited HWM for all relay devices

libzmq-3 changed the default HWM for sockets to 1000 from infinite, which can result in dropped messages on ROUTER or PUB sockets under high load.

This PR explicitly sets these back to infinite to avoid dropped messages.

 closes ipython#3303
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants