New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition when running lbview.apply() fast multiple times in loop #401
Comments
I'm intrigued by what you mean by 'not created'. What is the actual symptom? It must get a msg_id, I don't see how it's possible for that to fail. Are there any errors in the output of the Controller? We do have the option to wait for a message to actually get sent to the scheduler. Just set I don't actually see where there could be a race condition on a buffer here, though, so there's quite possibly a bug somewhere. I would like some more information on what particular step is failing to happen. To see if all your tasks were created/submitted, you can do:
If all of those lists are the same, then every task was submitted and finished. If there was a race condition, then the data in some tasks could have been clobbered by the following one. You can check on the message data to see if any of them are actually not what they should be. |
Here's a stripped down script which creates tasks in a loop: from IPython.parallel import Client
import time
import os
client = Client()
lbview = client.load_balanced_view()
def do_nothing(render_script, env_dict):
os.listdir(".")
num_tasks = 100
task_frames = range(1, num_tasks + 1, 1)
for x in task_frames:
env_dict = {}
render_script = "foo"
ar = lbview.apply(do_nothing, render_script, env_dict)
# avoid race condition
#time.sleep(0.2)
records = client.db_query({'msg_id' : {'$in' : client.history}}, keys=['header', 'completed', 'engine_uuid'])
print "records: "+str(len(records))
print "history: "+str(len(client.history))
time.sleep(20)
arrived = filter(lambda rec: rec['engine_uuid'] is not None, records)
print "arrived tasks: "+str(len(arrived))
finished = filter(lambda rec: rec['completed'] is not None, records)
print "finished tasks: "+str(len(finished)) It gives different output, but you can see that not all tasks can be created: $ python2.6 job_create_race.py
records: 56
history: 100
arrived tasks: 29
finished tasks: 25
$ python2.6 job_create_race.py
records: 22
history: 100
arrived tasks: 10
finished tasks: 6
$ python2.6 job_create_race.py
records: 8
history: 100
arrived tasks: 5
finished tasks: 1
If you wait a short while after each task with "time.sleep(0.2)", then you seem to get all tasks: $ python2.6 job_create_race.py
records: 100
history: 100
arrived tasks: 100
finished tasks: 100 |
Ah, okay. I thought you were saying that the jobs weren't happening, but it's actually a problem in the record creation. Have you tried this with the Note that you are performing the I can reproduce this in master, but not in my |
Yes, you are right. I modified my test script like you suggested. |
There seems to be a kind of race condition when running lbview.apply() fast multiple times in a loop.
The following example code creates tasks. Sometimes tasks are not being created because the loop runs too fast.
A "time.sleep(0.5)" after the last line helps to avoid this. I think lbview.apply() should block at least for the time it takes to add a task to the queue (I don't mean blocking for task results).
The text was updated successfully, but these errors were encountered: