Implement hash_join for merges #57970

phofl · 2024-03-22T22:27:29Z

Our abstraction in merges is bad, this makes it a little worse unfortunately. But it enables a potentially huge performance improvement for joins that could be hash joins. I am using "right" to make the decision, because left determines the result order, which means that we would have to sort after we are finished which gives the performance improvement back. using right makes this problem go away.

We get time complexity O(m+n) here over O(m*n) with a non-trivial factor as before

This makes adding semi joins pretty easy as well, which is nice next to the performance improvements here.

| Change   | Before [38086f11] <backtest>   | After [3b6b787e] <to>   |   Ratio | Benchmark (Parameter)                                                       |
|----------|--------------------------------|-------------------------|---------|-----------------------------------------------------------------------------|
| -        | 193±4ms                        | 165±5ms                 |    0.85 | join_merge.I8Merge.time_i8merge('inner')                                    |
| -        | 7.76±0.2ms                     | 6.53±0.06ms             |    0.84 | join_merge.Merge.time_merge_2intkey(False)                                  |
| -        | 1.03±0.02ms                    | 834±4μs                 |    0.81 | join_merge.Merge.time_merge_dataframe_integer_2key(False)                   |
| -        | 485±2μs                        | 339±5μs                 |    0.7  | join_merge.Merge.time_merge_dataframe_integer_key(False)                    |
| -        | 3.29±0.2ms                     | 2.29±0.07ms             |    0.7  | join_merge.MergeDatetime.time_merge(('ms', 'ms'), 'Europe/Brussels', False) |
| -        | 3.31±0.07ms                    | 2.27±0.07ms             |    0.69 | join_merge.MergeDatetime.time_merge(('ms', 'ms'), None, False)              |
| -        | 2.79±0.09ms                    | 1.77±0.01ms             |    0.63 | join_merge.MergeDatetime.time_merge(('ns', 'ms'), 'Europe/Brussels', False) |
| -        | 2.89±0.04ms                    | 1.78±0.05ms             |    0.62 | join_merge.MergeDatetime.time_merge(('ns', 'ms'), None, False)              |
| -        | 2.57±0.09ms                    | 1.56±0.03ms             |    0.61 | join_merge.MergeDatetime.time_merge(('ns', 'ns'), None, False)              |
| -        | 1.97±0.05ms                    | 1.18±0.02ms             |    0.6  | join_merge.MergeEA.time_merge('Float32', False)                             |
| -        | 1.84±0.02ms                    | 1.10±0.03ms             |    0.6  | join_merge.MergeEA.time_merge('UInt16', False)                              |
| -        | 2.09±0.04ms                    | 1.24±0.01ms             |    0.59 | join_merge.MergeEA.time_merge('UInt64', False)                              |
| -        | 2.10±0.09ms                    | 1.22±0.01ms             |    0.58 | join_merge.MergeEA.time_merge('Float64', False)                             |
| -        | 2.09±0.08ms                    | 1.22±0.01ms             |    0.58 | join_merge.MergeEA.time_merge('UInt32', False)                              |
| -        | 2.70±0.1ms                     | 1.54±0.02ms             |    0.57 | join_merge.MergeDatetime.time_merge(('ns', 'ns'), 'Europe/Brussels', False) |
| -        | 1.72±0.02ms                    | 971±20μs                |    0.57 | join_merge.MergeEA.time_merge('Int16', False)                               |
| -        | 1.76±0.03ms                    | 973±10μs                |    0.55 | join_merge.MergeEA.time_merge('Int32', False)                               |
| -        | 1.94±0.09ms                    | 1.07±0.03ms             |    0.55 | join_merge.MergeEA.time_merge('Int64', False)                               |
| -        | 57.0±2ms                       | 21.7±0.3ms              |    0.38 | join_merge.UniqueMerge.time_unique_merge(1000000)                           |
| -        | 106±7ms                        | 34.1±0.3ms              |    0.32 | join_merge.UniqueMerge.time_unique_merge(4000000)                           |

pandas/_libs/hashtable_class_helper.pxi.in

pandas/core/reshape/merge.py

WillAyd

Thanks for the refactor - this structure makes a lot more sense

WillAyd · 2024-03-23T20:52:58Z

pandas/_libs/hashtable_class_helper.pxi.in

+                    if self.na_position == -1:
+                        continue
+                    if needs_resize(l.size, l.capacity):
+                        with gil:


You shouldn't need to reacquire the gil here - was Cython giving an error?

Yep

Discarding owned Python object not allowed without gil

Resizing on the ndarray needs the Gil (happy to be convinced otherwise if I am mistaken)

Huh OK - I see that pattern elsewhere in the codebase. I think there is something simple being overlooked that requires the GIL, but let's leave that to a separate PR

I think there is something simple being overlooked that requires the GIL

That would make me very happy as a Dask developer :)

WillAyd

nice work @phofl

phofl · 2024-03-24T00:41:05Z

Thx for your rewiew! merging to unblock follow up work but happy to address other comments in follow ups

phofl added 15 commits March 21, 2024 15:59

Improve merge performance

af72f4c

Restrict to other unique

6bad79e

Fixup

8f162eb

Rewrite unique computation

2a52d2a

Improve performance

3801895

Improve performance

223f136

Improve performance

1a3bf1b

Improve performance

c0f3532

Integrate better

80a1cf8

Fix bugs

b2d11e3

Fix bugs

3b6b787

Rename method

616dfc8

Rename keyword

28c7eb8

Add whatsnew

4cb9365

Clean up

f60d8b4

phofl requested a review from WillAyd as a code owner March 22, 2024 22:27

Fix typing

243f6c0

WillAyd requested changes Mar 23, 2024

View reviewed changes

pandas/_libs/hashtable_class_helper.pxi.in Show resolved Hide resolved

pandas/_libs/hashtable_class_helper.pxi.in Outdated Show resolved Hide resolved

pandas/core/reshape/merge.py Outdated Show resolved Hide resolved

phofl added 9 commits March 23, 2024 14:02

Update

7dfbe15

Update docs

fc0ea08

Update

0bd0759

Update

97c4c26

Update

a0dd1d3

Update

de001a6

Merge remote-tracking branch 'upstream/main' into hash_join

e48d06a

Fixup for changed main

3bb0604

Fixup for changed main

6ee2a4c

WillAyd reviewed Mar 23, 2024

View reviewed changes

WillAyd approved these changes Mar 23, 2024

View reviewed changes

phofl added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Mar 24, 2024

phofl added this to the 3.0 milestone Mar 24, 2024

phofl merged commit 669ddfb into pandas-dev:main Mar 24, 2024
50 checks passed

phofl deleted the hash_join branch March 24, 2024 00:41

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024

Implement hash_join for merges (pandas-dev#57970)

aebd00c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement hash_join for merges #57970

Implement hash_join for merges #57970

phofl commented Mar 22, 2024 •

edited

WillAyd left a comment

WillAyd Mar 23, 2024

phofl Mar 23, 2024 •

edited

WillAyd Mar 23, 2024

phofl Mar 23, 2024

WillAyd left a comment

phofl commented Mar 24, 2024

Implement hash_join for merges #57970

Implement hash_join for merges #57970

Conversation

phofl commented Mar 22, 2024 • edited

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd Mar 23, 2024

Choose a reason for hiding this comment

phofl Mar 23, 2024 • edited

Choose a reason for hiding this comment

WillAyd Mar 23, 2024

Choose a reason for hiding this comment

phofl Mar 23, 2024

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

phofl commented Mar 24, 2024

phofl commented Mar 22, 2024 •

edited

phofl Mar 23, 2024 •

edited