Skip to content

Conversation

mzeitlin11
Copy link
Member

Broken off a branch working towards #13745

This doesn't have noticeable user-facing impact on its own since this is a smaller part of the merge operation. Some timings:

import numpy as np
import pandas._libs.join as libjoin
np.random.seed(0)
arr1 = np.random.randint(0, 100, 100000)
arr2 = np.random.randint(0, 100, 100000)

Master:

In [2]: %timeit libjoin.inner_join(arr1, arr2, 100)
1.19 s ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit libjoin.left_outer_join(arr1, arr2, 100)
1.22 s ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit libjoin.full_outer_join(arr1, arr2, 100)
1.26 s ± 33.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This pr:

In [2]: %timeit libjoin.inner_join(arr1, arr2, 100)
729 ms ± 17.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit libjoin.left_outer_join(arr1, arr2, 100)
714 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit libjoin.full_outer_join(arr1, arr2, 100)
715 ms ± 6.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@mzeitlin11 mzeitlin11 added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 16, 2021
@jreback
Copy link
Contributor

jreback commented Jun 16, 2021

wow!

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jreback jreback added this to the 1.3 milestone Jun 17, 2021
@jreback
Copy link
Contributor

jreback commented Jun 17, 2021

migth as well backport this

@jreback
Copy link
Contributor

jreback commented Jun 17, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app
Copy link

lumberbot-app bot commented Jun 17, 2021

Something went wrong ... Please have a look at my logs.

jreback pushed a commit that referenced this pull request Jun 17, 2021
Co-authored-by: Matthew Zeitlin <37011898+mzeitlin11@users.noreply.github.com>
@mzeitlin11 mzeitlin11 deleted the get_result_indexer branch June 17, 2021 17:07
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants