New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk insert add wfs #260
Bulk insert add wfs #260
Conversation
Thanks! This method is of course a bit more unsafe than usual since there is no rollback (every one in a while, I really miss SQL) Can you give an idea of how much speedup there is for inserting 1000 workflows? 2X, 10X, 100X, etc.? I am not looking for an official test, just your general experience. I am asking because others have wanted bulk insertion speedup (see #240) so I would like to pull it, but just want to balance safety vs performance gain. I don't want to propose people sacrifice safety for a small performance boost for example. |
Btw you can probably make the "unused variable n" go away using something like: for _ in range(1000): since "_" is usually recognized as a dummy/unused var |
@richardjgowers this PR is relevant to you based on our previous discussions in case you have any comments / thoughts |
@computron thanks for the ping, I'll have a look at this over the weekend |
Yep. I'll benchmark and get back to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only other though was there's now two definitions of how to add a Workflow, so if someone "improves" add_wf
it'll be easy to forget to also change this method and and up with weird issues. I'd change lp.add_wf
to go through add_wfs
too (just pass through whatever it is passed in a list)
self.assertEqual(len(distinct_fw_ids), num_fws_total) | ||
num_wfs_in_db = len(self.lp.get_wf_ids({"name": "lorem workflow"})) | ||
self.assertEqual(num_wfs_in_db, len(wfs)) | ||
self.lp.reset('', require_password=False, max_reset_wo_password=1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cleanup/teardown won't run if the test ever fails. This test class already has a teardown method so should be fine without this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work without the max_reset_wo_password argument, but I'll just move that to the teardown method.
for n in range(100): | ||
# create workflows with 3-10 simple fireworks | ||
wfs.append(Workflow([Firework(ftask, name='lorem') | ||
for n in range(0, randint(3, 10))], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the randint
adds apart from making it harder to know how many Fws you've added, making the test a little harder to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to ensure that sets of workflows of different firework lengths get tested (it's where I'd imagine the associated methods would fail), but I'll do this deliberately, rather than randomly.
fireworks/core/launchpad.py
Outdated
# Insert all fws and wfs, do workflows first so fws don't | ||
# get checked out prematurely | ||
self.workflows.insert_many([wf.to_db_dict() for wf in wfs]) | ||
all_fws = chain.from_iterable([wf.fws for wf in wfs]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to chase performance when this scales to 1,000s, you'll want to use generator expressions not list comprehensions. So chain.from_iterable(wf.fws for wf in wfs)
without the square brackets
fireworks/core/launchpad.py
Outdated
# get checked out prematurely | ||
self.workflows.insert_many([wf.to_db_dict() for wf in wfs]) | ||
all_fws = chain.from_iterable([wf.fws for wf in wfs]) | ||
self.fireworks.insert_many([fw.to_db_dict() for fw in all_fws]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, pymongo accepts generator expressions here too, so you can pass this as (fw.to_db_dict() for fw in all_fws)
and allow pymongo to lazily evaluate the Fws
Thanks @richardjgowers for the helpful comments! @montoyjh let me know when you have a revised version ready for re-review |
Ok, a new revision, here are some timings, seems like bulk addition gives about a factor of 10 speedup. I'm really not 100% confident this is all that robust, so I'm reluctant to merge the tried and true |
Ok pulling as-is. It agree it would be better to have a single solution for both bulk and single insertion but I guess this is better than the current iteration of the code. |
I've found myself in need of a more efficient method of workflow addition to the launchpad lately (e. g. for adding thousands of workflows at a time), so I wrote a method that does it a bit more efficiently using some of mongo's bulk insertion functionality.
I'm not really sure if this is a good idea for the master branch (it seems less secure than adding them one at a time), but I figured I'd submit this PR to see if there was any feedback on how it might be adapted for general use (or if there's a better strategy), if there are any issues.
Also included rudimentary unit test.