DM-30145: Support IN operator in database deletes #519

timj · 2021-05-11T21:41:10Z

In my quick tests of deleting 5000 datasets from a collection the bridge emptyTrash deletes go from 10 seconds to 0.5 second with this change.

ktlim · 2021-05-11T22:20:17Z

python/lsst/daf/butler/registry/interfaces/_database.py

+            n_loops = math.ceil(n_elements / MAX_ELEMENTS_PER_IN)
+            n_per_loop = math.ceil(n_elements / n_loops)


This seems kind of complicated; why not just take MAX_ELEMENTS_PER_IN at a time until you run out? Then (because you would explicitly check) you also don't have the possibility that endpos is off the end (which I think you have now).

I was trying not to have a situation where there are 5001 rows and 1000 as MAX_ELEMENTS_PER_IN to make it run 6 times when I can make it run 5 times if I use 1001.

Re "endpos off the end" -- I checked and python is perfectly fine with that so I don't need any special logic to handle overrun.

I do realize the logic is not quite right regardless but the loop still covers everything.

Except I think you still get 6 loops of 834 in the case you gave, not 5 of 1001. (You'd need to replace the first computation with a math.floor.) What's the use of having a MAX if it's not the maximum? And is an extra loop/query really that much time (compared with the thousands that you are avoiding)?

Just saw you have a new calculation. It is better, but still a lot of comments and code for not a lot of gain.

Yes. Whilst doing the school run I realized which way I had messed it up. I can remove the complication -- I just felt like I preferred a more even spread...

ktlim · 2021-05-11T22:22:34Z

python/lsst/daf/butler/registry/interfaces/_database.py

+                newsql = sql.where(sqlalchemy.sql.and_(*clauses, in_clause))
+                rowcount += self._connection.execute(newsql).rowcount
+            return rowcount
+        else:


Sometimes it's more readable for the short, default/fallback case to come before the long, specialized case.

Yes, when I started they were about the same length.

andy-slac · 2021-05-12T03:09:10Z

python/lsst/daf/butler/registry/interfaces/_database.py

+            # Nothing to calculate since we can always use IN
+            column = columns[0]
+            changing_columns = [column]
+            content[column] = set([row[column] for row in rows])


No need to make list before turning it into set, iterable is OK.

andy-slac · 2021-05-12T03:17:53Z

python/lsst/daf/butler/registry/interfaces/_database.py

+            for row in rows:
+                for k, v in row.items():
+                    content[k].add(v)
+            changing_columns = [col for col in content if len(content[col]) > 1]


[col for col, values in content.items() if len(values) > 1] for extra efficiency?

andy-slac · 2021-05-12T03:37:09Z

python/lsst/daf/butler/registry/interfaces/_database.py

+            iposn = 0
+            while iposn < n_elements:
+                endpos = iposn + n_per_loop
+                in_clause = table.columns[name].in_(in_content[iposn:endpos])
+                iposn = endpos


It maybe easier to read as:

for iposn in range(0, n_elements, n_per_loop): endpos = iposn + n_per_loop

python/lsst/daf/butler/registry/interfaces/_database.py

andy-slac · 2021-05-12T03:52:59Z

Sorry, forgot to add comment before clicking Approved. Looks OK, but check my comment about transactions, I think we need to wrap it into a single transaction.

Now checks to see if a multi-column delete can use IN

Allow IN operator to be used in database DELETE

e80d650

ktlim reviewed May 11, 2021

View reviewed changes

timj force-pushed the tickets/DM-30145 branch from 4dda5eb to 34afded Compare May 11, 2021 23:22

andy-slac approved these changes May 12, 2021

View reviewed changes

timj added 3 commits May 12, 2021 10:31

Improve detection for IN compatibility in DELETE

0556b42

Now checks to see if a multi-column delete can use IN

Split large deletes into chunks when using IN

5f375c7

Temporarily pin click version

6abef65

timj force-pushed the tickets/DM-30145 branch from 34afded to 6abef65 Compare May 12, 2021 17:33

Rearrange the if clause to put simplest one first

0d41652

timj merged commit 20a983f into master May 12, 2021

timj deleted the tickets/DM-30145 branch May 12, 2021 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-30145: Support IN operator in database deletes #519

DM-30145: Support IN operator in database deletes #519

timj commented May 11, 2021

ktlim May 11, 2021

timj May 11, 2021

timj May 11, 2021 •

edited

timj May 11, 2021

ktlim May 12, 2021 •

edited

ktlim May 12, 2021

timj May 12, 2021

ktlim May 11, 2021

timj May 11, 2021

andy-slac May 12, 2021

andy-slac May 12, 2021

andy-slac May 12, 2021

andy-slac commented May 12, 2021

		n_loops = math.ceil(n_elements / MAX_ELEMENTS_PER_IN)
		n_per_loop = math.ceil(n_elements / n_loops)

DM-30145: Support IN operator in database deletes #519

DM-30145: Support IN operator in database deletes #519

Conversation

timj commented May 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timj May 11, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ktlim May 12, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andy-slac commented May 12, 2021

timj May 11, 2021 •

edited

ktlim May 12, 2021 •

edited