Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-30145: Support IN operator in database deletes #519

Merged
merged 5 commits into from May 12, 2021
Merged

Conversation

timj
Copy link
Member

@timj timj commented May 11, 2021

In my quick tests of deleting 5000 datasets from a collection the bridge emptyTrash deletes go from 10 seconds to 0.5 second with this change.

Comment on lines 1468 to 1480
n_loops = math.ceil(n_elements / MAX_ELEMENTS_PER_IN)
n_per_loop = math.ceil(n_elements / n_loops)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems kind of complicated; why not just take MAX_ELEMENTS_PER_IN at a time until you run out? Then (because you would explicitly check) you also don't have the possibility that endpos is off the end (which I think you have now).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying not to have a situation where there are 5001 rows and 1000 as MAX_ELEMENTS_PER_IN to make it run 6 times when I can make it run 5 times if I use 1001.

Copy link
Member Author

@timj timj May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re "endpos off the end" -- I checked and python is perfectly fine with that so I don't need any special logic to handle overrun.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do realize the logic is not quite right regardless but the loop still covers everything.

Copy link
Contributor

@ktlim ktlim May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except I think you still get 6 loops of 834 in the case you gave, not 5 of 1001. (You'd need to replace the first computation with a math.floor.) What's the use of having a MAX if it's not the maximum? And is an extra loop/query really that much time (compared with the thousands that you are avoiding)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw you have a new calculation. It is better, but still a lot of comments and code for not a lot of gain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Whilst doing the school run I realized which way I had messed it up. I can remove the complication -- I just felt like I preferred a more even spread...

newsql = sql.where(sqlalchemy.sql.and_(*clauses, in_clause))
rowcount += self._connection.execute(newsql).rowcount
return rowcount
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes it's more readable for the short, default/fallback case to come before the long, specialized case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when I started they were about the same length.

# Nothing to calculate since we can always use IN
column = columns[0]
changing_columns = [column]
content[column] = set([row[column] for row in rows])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to make list before turning it into set, iterable is OK.

for row in rows:
for k, v in row.items():
content[k].add(v)
changing_columns = [col for col in content if len(content[col]) > 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[col for col, values in content.items() if len(values) > 1] for extra efficiency?

Comment on lines 1483 to 1487
iposn = 0
while iposn < n_elements:
endpos = iposn + n_per_loop
in_clause = table.columns[name].in_(in_content[iposn:endpos])
iposn = endpos
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It maybe easier to read as:

for iposn in range(0, n_elements, n_per_loop):
    endpos = iposn + n_per_loop

@andy-slac
Copy link
Contributor

Sorry, forgot to add comment before clicking Approved. Looks OK, but check my comment about transactions, I think we need to wrap it into a single transaction.

@timj timj merged commit 20a983f into master May 12, 2021
@timj timj deleted the tickets/DM-30145 branch May 12, 2021 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants