New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-30145: Support IN operator in database deletes #519
Conversation
n_loops = math.ceil(n_elements / MAX_ELEMENTS_PER_IN) | ||
n_per_loop = math.ceil(n_elements / n_loops) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems kind of complicated; why not just take MAX_ELEMENTS_PER_IN at a time until you run out? Then (because you would explicitly check) you also don't have the possibility that endpos is off the end (which I think you have now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying not to have a situation where there are 5001 rows and 1000 as MAX_ELEMENTS_PER_IN to make it run 6 times when I can make it run 5 times if I use 1001.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re "endpos off the end" -- I checked and python is perfectly fine with that so I don't need any special logic to handle overrun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do realize the logic is not quite right regardless but the loop still covers everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except I think you still get 6 loops of 834 in the case you gave, not 5 of 1001. (You'd need to replace the first computation with a math.floor
.) What's the use of having a MAX if it's not the maximum? And is an extra loop/query really that much time (compared with the thousands that you are avoiding)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saw you have a new calculation. It is better, but still a lot of comments and code for not a lot of gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Whilst doing the school run I realized which way I had messed it up. I can remove the complication -- I just felt like I preferred a more even spread...
newsql = sql.where(sqlalchemy.sql.and_(*clauses, in_clause)) | ||
rowcount += self._connection.execute(newsql).rowcount | ||
return rowcount | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes it's more readable for the short, default/fallback case to come before the long, specialized case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when I started they were about the same length.
# Nothing to calculate since we can always use IN | ||
column = columns[0] | ||
changing_columns = [column] | ||
content[column] = set([row[column] for row in rows]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to make list
before turning it into set
, iterable is OK.
for row in rows: | ||
for k, v in row.items(): | ||
content[k].add(v) | ||
changing_columns = [col for col in content if len(content[col]) > 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[col for col, values in content.items() if len(values) > 1]
for extra efficiency?
iposn = 0 | ||
while iposn < n_elements: | ||
endpos = iposn + n_per_loop | ||
in_clause = table.columns[name].in_(in_content[iposn:endpos]) | ||
iposn = endpos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It maybe easier to read as:
for iposn in range(0, n_elements, n_per_loop):
endpos = iposn + n_per_loop
Sorry, forgot to add comment before clicking Approved. Looks OK, but check my comment about transactions, I think we need to wrap it into a single transaction. |
Now checks to see if a multi-column delete can use IN
In my quick tests of deleting 5000 datasets from a collection the bridge emptyTrash deletes go from 10 seconds to 0.5 second with this change.