Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk insert add wfs #260

Merged
merged 4 commits into from Feb 21, 2018
Merged

Conversation

montoyjh
Copy link
Contributor

@montoyjh montoyjh commented Feb 8, 2018

I've found myself in need of a more efficient method of workflow addition to the launchpad lately (e. g. for adding thousands of workflows at a time), so I wrote a method that does it a bit more efficiently using some of mongo's bulk insertion functionality.

I'm not really sure if this is a good idea for the master branch (it seems less secure than adding them one at a time), but I figured I'd submit this PR to see if there was any feedback on how it might be adapted for general use (or if there's a better strategy), if there are any issues.

Also included rudimentary unit test.

@coveralls
Copy link

coveralls commented Feb 8, 2018

Coverage Status

Coverage increased (+0.05%) to 60.275% when pulling a96a407 on montoyjh:bulk_insert_add_wfs into 67e8dff on materialsproject:master.

@computron
Copy link
Member

Thanks!

This method is of course a bit more unsafe than usual since there is no rollback (every one in a while, I really miss SQL)

Can you give an idea of how much speedup there is for inserting 1000 workflows? 2X, 10X, 100X, etc.? I am not looking for an official test, just your general experience. I am asking because others have wanted bulk insertion speedup (see #240) so I would like to pull it, but just want to balance safety vs performance gain. I don't want to propose people sacrifice safety for a small performance boost for example.

@computron
Copy link
Member

Btw you can probably make the "unused variable n" go away using something like:

for _ in range(1000):

since "_" is usually recognized as a dummy/unused var

@computron
Copy link
Member

@richardjgowers this PR is relevant to you based on our previous discussions in case you have any comments / thoughts

@richardjgowers
Copy link
Contributor

@computron thanks for the ping, I'll have a look at this over the weekend

@montoyjh
Copy link
Contributor Author

montoyjh commented Feb 9, 2018

Yep. I'll benchmark and get back to you.

Copy link
Contributor

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only other though was there's now two definitions of how to add a Workflow, so if someone "improves" add_wf it'll be easy to forget to also change this method and and up with weird issues. I'd change lp.add_wf to go through add_wfs too (just pass through whatever it is passed in a list)

self.assertEqual(len(distinct_fw_ids), num_fws_total)
num_wfs_in_db = len(self.lp.get_wf_ids({"name": "lorem workflow"}))
self.assertEqual(num_wfs_in_db, len(wfs))
self.lp.reset('', require_password=False, max_reset_wo_password=1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup/teardown won't run if the test ever fails. This test class already has a teardown method so should be fine without this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work without the max_reset_wo_password argument, but I'll just move that to the teardown method.

for n in range(100):
# create workflows with 3-10 simple fireworks
wfs.append(Workflow([Firework(ftask, name='lorem')
for n in range(0, randint(3, 10))],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the randint adds apart from making it harder to know how many Fws you've added, making the test a little harder to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to ensure that sets of workflows of different firework lengths get tested (it's where I'd imagine the associated methods would fail), but I'll do this deliberately, rather than randomly.

# Insert all fws and wfs, do workflows first so fws don't
# get checked out prematurely
self.workflows.insert_many([wf.to_db_dict() for wf in wfs])
all_fws = chain.from_iterable([wf.fws for wf in wfs])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to chase performance when this scales to 1,000s, you'll want to use generator expressions not list comprehensions. So chain.from_iterable(wf.fws for wf in wfs) without the square brackets

# get checked out prematurely
self.workflows.insert_many([wf.to_db_dict() for wf in wfs])
all_fws = chain.from_iterable([wf.fws for wf in wfs])
self.fireworks.insert_many([fw.to_db_dict() for fw in all_fws])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, pymongo accepts generator expressions here too, so you can pass this as (fw.to_db_dict() for fw in all_fws) and allow pymongo to lazily evaluate the Fws

@computron
Copy link
Member

Thanks @richardjgowers for the helpful comments!

@montoyjh let me know when you have a revised version ready for re-review

@montoyjh
Copy link
Contributor Author

Ok, a new revision, here are some timings, seems like bulk addition gives about a factor of 10 speedup. I'm really not 100% confident this is all that robust, so I'm reluctant to merge the tried and true lpad.add_wf functionality and this one (in fact renamed the function to bulk_add_wfs to suggest the bulk write operation and perhaps to discourage accidental use). Up to you though.

fws_timings

@computron
Copy link
Member

Ok pulling as-is. It agree it would be better to have a single solution for both bulk and single insertion but I guess this is better than the current iteration of the code.

@computron computron merged commit ad07276 into materialsproject:master Feb 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants