Bulk insert add wfs #260

montoyjh · 2018-02-08T22:17:42Z

I've found myself in need of a more efficient method of workflow addition to the launchpad lately (e. g. for adding thousands of workflows at a time), so I wrote a method that does it a bit more efficiently using some of mongo's bulk insertion functionality.

I'm not really sure if this is a good idea for the master branch (it seems less secure than adding them one at a time), but I figured I'd submit this PR to see if there was any feedback on how it might be adapted for general use (or if there's a better strategy), if there are any issues.

Also included rudimentary unit test.

coveralls · 2018-02-08T22:27:10Z

Coverage increased (+0.05%) to 60.275% when pulling a96a407 on montoyjh:bulk_insert_add_wfs into 67e8dff on materialsproject:master.

computron · 2018-02-09T17:42:25Z

Thanks!

This method is of course a bit more unsafe than usual since there is no rollback (every one in a while, I really miss SQL)

Can you give an idea of how much speedup there is for inserting 1000 workflows? 2X, 10X, 100X, etc.? I am not looking for an official test, just your general experience. I am asking because others have wanted bulk insertion speedup (see #240) so I would like to pull it, but just want to balance safety vs performance gain. I don't want to propose people sacrifice safety for a small performance boost for example.

computron · 2018-02-09T17:43:17Z

Btw you can probably make the "unused variable n" go away using something like:

for _ in range(1000):

since "_" is usually recognized as a dummy/unused var

computron · 2018-02-09T17:43:58Z

@richardjgowers this PR is relevant to you based on our previous discussions in case you have any comments / thoughts

richardjgowers · 2018-02-09T19:52:47Z

@computron thanks for the ping, I'll have a look at this over the weekend

montoyjh · 2018-02-09T20:09:21Z

Yep. I'll benchmark and get back to you.

richardjgowers

My only other though was there's now two definitions of how to add a Workflow, so if someone "improves" add_wf it'll be easy to forget to also change this method and and up with weird issues. I'd change lp.add_wf to go through add_wfs too (just pass through whatever it is passed in a list)

richardjgowers · 2018-02-15T14:20:16Z

fireworks/core/tests/test_launchpad.py

+        self.assertEqual(len(distinct_fw_ids), num_fws_total)
+        num_wfs_in_db = len(self.lp.get_wf_ids({"name": "lorem workflow"}))
+        self.assertEqual(num_wfs_in_db, len(wfs))
+        self.lp.reset('', require_password=False, max_reset_wo_password=1000)


This cleanup/teardown won't run if the test ever fails. This test class already has a teardown method so should be fine without this

It doesn't work without the max_reset_wo_password argument, but I'll just move that to the teardown method.

richardjgowers · 2018-02-15T14:20:55Z

fireworks/core/tests/test_launchpad.py

+        for n in range(100):
+            # create workflows with 3-10 simple fireworks
+            wfs.append(Workflow([Firework(ftask, name='lorem')
+                                 for n in range(0, randint(3, 10))],


I'm not sure what the randint adds apart from making it harder to know how many Fws you've added, making the test a little harder to read

I want to ensure that sets of workflows of different firework lengths get tested (it's where I'd imagine the associated methods would fail), but I'll do this deliberately, rather than randomly.

richardjgowers · 2018-02-15T14:22:55Z

fireworks/core/launchpad.py

+        # Insert all fws and wfs, do workflows first so fws don't
+        # get checked out prematurely
+        self.workflows.insert_many([wf.to_db_dict() for wf in wfs])
+        all_fws = chain.from_iterable([wf.fws for wf in wfs])


If you want to chase performance when this scales to 1,000s, you'll want to use generator expressions not list comprehensions. So chain.from_iterable(wf.fws for wf in wfs) without the square brackets

richardjgowers · 2018-02-15T14:24:04Z

fireworks/core/launchpad.py

+        # get checked out prematurely
+        self.workflows.insert_many([wf.to_db_dict() for wf in wfs])
+        all_fws = chain.from_iterable([wf.fws for wf in wfs])
+        self.fireworks.insert_many([fw.to_db_dict() for fw in all_fws])


Similarly, pymongo accepts generator expressions here too, so you can pass this as (fw.to_db_dict() for fw in all_fws) and allow pymongo to lazily evaluate the Fws

computron · 2018-02-18T04:40:35Z

Thanks @richardjgowers for the helpful comments!

@montoyjh let me know when you have a revised version ready for re-review

montoyjh · 2018-02-19T01:58:05Z

Ok, a new revision, here are some timings, seems like bulk addition gives about a factor of 10 speedup. I'm really not 100% confident this is all that robust, so I'm reluctant to merge the tried and true lpad.add_wf functionality and this one (in fact renamed the function to bulk_add_wfs to suggest the bulk write operation and perhaps to discourage accidental use). Up to you though.

computron · 2018-02-21T01:38:59Z

Ok pulling as-is. It agree it would be better to have a single solution for both bulk and single insertion but I guess this is better than the current iteration of the code.

montoyjh added 2 commits February 8, 2018 14:06

added multi-workflow add function and rudimentary test

11ac38f

updated dostring

f4607b8

richardjgowers reviewed Feb 15, 2018

View reviewed changes

montoyjh added 2 commits February 18, 2018 17:47

minor revisions on tests and iterators for add_wfs

b66e748

codacy cleanup

a96a407

computron merged commit ad07276 into materialsproject:master Feb 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk insert add wfs #260

Bulk insert add wfs #260

montoyjh commented Feb 8, 2018

coveralls commented Feb 8, 2018 •

edited

computron commented Feb 9, 2018

computron commented Feb 9, 2018

computron commented Feb 9, 2018

richardjgowers commented Feb 9, 2018

montoyjh commented Feb 9, 2018

richardjgowers left a comment

richardjgowers Feb 15, 2018

montoyjh Feb 19, 2018

richardjgowers Feb 15, 2018

montoyjh Feb 19, 2018

richardjgowers Feb 15, 2018

richardjgowers Feb 15, 2018

computron commented Feb 18, 2018

montoyjh commented Feb 19, 2018

computron commented Feb 21, 2018

Bulk insert add wfs #260

Bulk insert add wfs #260

Conversation

montoyjh commented Feb 8, 2018

coveralls commented Feb 8, 2018 • edited

computron commented Feb 9, 2018

computron commented Feb 9, 2018

computron commented Feb 9, 2018

richardjgowers commented Feb 9, 2018

montoyjh commented Feb 9, 2018

richardjgowers left a comment

Choose a reason for hiding this comment

richardjgowers Feb 15, 2018

Choose a reason for hiding this comment

montoyjh Feb 19, 2018

Choose a reason for hiding this comment

richardjgowers Feb 15, 2018

Choose a reason for hiding this comment

montoyjh Feb 19, 2018

Choose a reason for hiding this comment

richardjgowers Feb 15, 2018

Choose a reason for hiding this comment

richardjgowers Feb 15, 2018

Choose a reason for hiding this comment

computron commented Feb 18, 2018

montoyjh commented Feb 19, 2018

computron commented Feb 21, 2018

coveralls commented Feb 8, 2018 •

edited