Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement hash_join for merges #57970

Merged
merged 25 commits into from Mar 24, 2024
Merged

Implement hash_join for merges #57970

merged 25 commits into from Mar 24, 2024

Conversation

phofl
Copy link
Member

@phofl phofl commented Mar 22, 2024

cc @mroeschke

Our abstraction in merges is bad, this makes it a little worse unfortunately. But it enables a potentially huge performance improvement for joins that could be hash joins. I am using "right" to make the decision, because left determines the result order, which means that we would have to sort after we are finished which gives the performance improvement back. using right makes this problem go away.

We get time complexity O(m+n) here over O(m*n) with a non-trivial factor as before

This makes adding semi joins pretty easy as well, which is nice next to the performance improvements here.

| Change   | Before [38086f11] <backtest>   | After [3b6b787e] <to>   |   Ratio | Benchmark (Parameter)                                                       |
|----------|--------------------------------|-------------------------|---------|-----------------------------------------------------------------------------|
| -        | 193±4ms                        | 165±5ms                 |    0.85 | join_merge.I8Merge.time_i8merge('inner')                                    |
| -        | 7.76±0.2ms                     | 6.53±0.06ms             |    0.84 | join_merge.Merge.time_merge_2intkey(False)                                  |
| -        | 1.03±0.02ms                    | 834±4μs                 |    0.81 | join_merge.Merge.time_merge_dataframe_integer_2key(False)                   |
| -        | 485±2μs                        | 339±5μs                 |    0.7  | join_merge.Merge.time_merge_dataframe_integer_key(False)                    |
| -        | 3.29±0.2ms                     | 2.29±0.07ms             |    0.7  | join_merge.MergeDatetime.time_merge(('ms', 'ms'), 'Europe/Brussels', False) |
| -        | 3.31±0.07ms                    | 2.27±0.07ms             |    0.69 | join_merge.MergeDatetime.time_merge(('ms', 'ms'), None, False)              |
| -        | 2.79±0.09ms                    | 1.77±0.01ms             |    0.63 | join_merge.MergeDatetime.time_merge(('ns', 'ms'), 'Europe/Brussels', False) |
| -        | 2.89±0.04ms                    | 1.78±0.05ms             |    0.62 | join_merge.MergeDatetime.time_merge(('ns', 'ms'), None, False)              |
| -        | 2.57±0.09ms                    | 1.56±0.03ms             |    0.61 | join_merge.MergeDatetime.time_merge(('ns', 'ns'), None, False)              |
| -        | 1.97±0.05ms                    | 1.18±0.02ms             |    0.6  | join_merge.MergeEA.time_merge('Float32', False)                             |
| -        | 1.84±0.02ms                    | 1.10±0.03ms             |    0.6  | join_merge.MergeEA.time_merge('UInt16', False)                              |
| -        | 2.09±0.04ms                    | 1.24±0.01ms             |    0.59 | join_merge.MergeEA.time_merge('UInt64', False)                              |
| -        | 2.10±0.09ms                    | 1.22±0.01ms             |    0.58 | join_merge.MergeEA.time_merge('Float64', False)                             |
| -        | 2.09±0.08ms                    | 1.22±0.01ms             |    0.58 | join_merge.MergeEA.time_merge('UInt32', False)                              |
| -        | 2.70±0.1ms                     | 1.54±0.02ms             |    0.57 | join_merge.MergeDatetime.time_merge(('ns', 'ns'), 'Europe/Brussels', False) |
| -        | 1.72±0.02ms                    | 971±20μs                |    0.57 | join_merge.MergeEA.time_merge('Int16', False)                               |
| -        | 1.76±0.03ms                    | 973±10μs                |    0.55 | join_merge.MergeEA.time_merge('Int32', False)                               |
| -        | 1.94±0.09ms                    | 1.07±0.03ms             |    0.55 | join_merge.MergeEA.time_merge('Int64', False)                               |
| -        | 57.0±2ms                       | 21.7±0.3ms              |    0.38 | join_merge.UniqueMerge.time_unique_merge(1000000)                           |
| -        | 106±7ms                        | 34.1±0.3ms              |    0.32 | join_merge.UniqueMerge.time_unique_merge(4000000)                           |

@phofl phofl requested a review from WillAyd as a code owner March 22, 2024 22:27
pandas/_libs/hashtable_class_helper.pxi.in Show resolved Hide resolved
pandas/_libs/hashtable_class_helper.pxi.in Outdated Show resolved Hide resolved
pandas/core/reshape/merge.py Outdated Show resolved Hide resolved
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactor - this structure makes a lot more sense

if self.na_position == -1:
continue
if needs_resize(l.size, l.capacity):
with gil:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need to reacquire the gil here - was Cython giving an error?

Copy link
Member Author

@phofl phofl Mar 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

Discarding owned Python object not allowed without gil

Resizing on the ndarray needs the Gil (happy to be convinced otherwise if I am mistaken)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh OK - I see that pattern elsewhere in the codebase. I think there is something simple being overlooked that requires the GIL, but let's leave that to a separate PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is something simple being overlooked that requires the GIL

That would make me very happy as a Dask developer :)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work @phofl

@phofl phofl added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Mar 24, 2024
@phofl phofl added this to the 3.0 milestone Mar 24, 2024
@phofl
Copy link
Member Author

phofl commented Mar 24, 2024

Thx for your rewiew! merging to unblock follow up work but happy to address other comments in follow ups

@phofl phofl merged commit 669ddfb into pandas-dev:main Mar 24, 2024
50 checks passed
@phofl phofl deleted the hash_join branch March 24, 2024 00:41
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants