New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement hash_join for merges #57970
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the refactor - this structure makes a lot more sense
if self.na_position == -1: | ||
continue | ||
if needs_resize(l.size, l.capacity): | ||
with gil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need to reacquire the gil here - was Cython giving an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
Discarding owned Python object not allowed without gil
Resizing on the ndarray needs the Gil (happy to be convinced otherwise if I am mistaken)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh OK - I see that pattern elsewhere in the codebase. I think there is something simple being overlooked that requires the GIL, but let's leave that to a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is something simple being overlooked that requires the GIL
That would make me very happy as a Dask developer :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work @phofl
Thx for your rewiew! merging to unblock follow up work but happy to address other comments in follow ups |
cc @mroeschke
Our abstraction in merges is bad, this makes it a little worse unfortunately. But it enables a potentially huge performance improvement for joins that could be hash joins. I am using "right" to make the decision, because left determines the result order, which means that we would have to sort after we are finished which gives the performance improvement back. using right makes this problem go away.
We get time complexity O(m+n) here over O(m*n) with a non-trivial factor as before
This makes adding semi joins pretty easy as well, which is nice next to the performance improvements here.