PERF: join/merge on subset of MultiIndex #48611

lukemanley · 2022-09-17T18:51:00Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v1.6.0.rst file if fixing a bug or adding a new feature.

Existing code passes a range to algos.take_nd. Passing an ndarray is faster.

       before           after         ratio
     [a712c501]       [00034d83]
     <main>           <multiindex-join-subset>
-    45.8±0.7ms       26.0±0.8ms      0.57  join_merge.JoinMultiindexSubset.time_join_multiindex_subset

phofl · 2022-09-17T19:31:44Z

pandas/core/reshape/merge.py

+        # for left and right respectively. If left/right is None then
+        # the join occurred on all indices of left/right
        if dropped_level_name in left.names:
+            if lindexer is None:


Why are you pulling this inside the loop? This is harder to read than before

Agreed. We can actually avoid calling take_nd when the indexers are None as that means "take everything". I made another commit which further improves times and should be clearer.

before after ratio [a712c501] [ad3f42b5] <main> <multiindex-join-subset> - 43.8±1ms 10.5±1ms 0.24 join_merge.JoinMultiindexSubset.time_join_multiindex_subset

pandas/core/reshape/merge.py

mroeschke

LGTM. Merge when ready @phofl

phofl · 2022-09-20T17:21:21Z

thx @lukemanley

…8662) * BUG: Series.getitem not falling back to positional for bool index * Update pandas/tests/series/indexing/test_getitem.py Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> * Fix build warning for use of `strdup` in ultrajson (#48369) * WEB: Update versions json to fix version switcher in the docs (#48655) * PERF: join/merge on subset of MultiIndex (#48611) * DOC: Update documentation for date_range(), bdate_range(), and interval_range() to include timedelta as a possible data type for the freq parameter (#48631) * Update documentation for date_range(), bdate_range(), and interval_range() to include timedelta as a possible data type for the freq parameter * Add test case for date_range construction using datetime.timedelta * TYP: tighten Axis (#48612) * TYP: tighten Axis * allow 'rows' * BUG: Fix metadata propagation in df.corr and df.cov, GH28283 (#48616) * Add finalize to df.corr and df.cov * Clean * TST: add test case for PeriodIndex in HDFStore(GH7796) (#48618) * TST: add test case for PeriodIndex in HDFStore * TST: add test case for PeriodIndex in HDFStore * use pytest.mark.parameterize instead * Add OpenSSF Scorecards GitHub Action (#48570) * Create scorecards.yml * Update scorecards.yml * Add OpenSSF Scorecards badge to README.md * Trim whitespace in scorecards.yml * Skip scorecards.yml on forks * Fix whitespace * Pin scorecards.yml dependencies to major versions * ENH: move an exception and add a prehook to check for exception place… (#48088) * ENH: move an exception and add a prehook to check for exception placement * ENH: fix import * ENH: revert moving error * ENH: add docstring and fix import for test * ENH: re-design approach based on feedback * ENH: update whatsnew rst * ENH: apply feedback changes * ENH: refactor to remove exception_warning_list and ignore _version.py * ENH: remove NotThisMethod from tests and all * REGR: TextIOWrapper raising an error in read_csv (#48651) * REGR: TextIOWrapper raising an error in read_csv * pyupgrade * do not try to seek on unseekable buffers * unseekable buffer might also have read ahead * safer alternative: do not mess with internal/private(?) buffer of TextIOWrapper (effectively applies the shortcut only to files pandas opens) * Fix scorecard.yml workflow (#48668) * Set scorecard-action to v2.0.3 scorecard-action does not have a major version tag. Temporarily disabling github.repository check to ensure action now works. * Enable github.repository check * BUG: DatetimeIndex ignoring explicit tz=None (#48659) * BUG: DatetimeIndex ignoring explicit tz=None * GH ref * Corrected pd.merge indicator type hint (#48677) * Corrected pd.merge indicator type hint https://pandas.pydata.org/docs/reference/api/pandas.merge.html It should be "str | bool" instead of just string * Update merge.py fixed type hint in merge.py * Update merge.py Update indicator type hint in _MergeOperation * Update merge.py Added type hint _MergeOperation init * DOC: Document default value for options.display.max_cols when not running in terminal (#48672) DOC: Document default value for options.display.max_cols display.max_cols has a default value of 20 when not running in a terminal such as Jupyter Notebook * ENH: DTA/TDA add datetimelike scalar with mismatched reso (#48669) * ENH: DTA/TDA add datetimelike scalar with mismatched reso * mypy fixup * REF: support reso in remaining tslibs helpers (#48661) * REF: support reso in remaining tslibs helpers * update setup.py * PERF: Avoid fragmentation of DataFrame in read_sas (#48603) * PERF: Avoid fragmentation of DataFrame in read_sas * Add whatsnew * Add warning * DOC: Add deprecation infos to deprecated functions (#48599) * DOC: Add deprecation infos to deprecated functions * Add sections * Fix * BLD: Build wheels using cibuildwheel (#48283) * BLD: Build wheels using cibuildwheel * update from code review Co-Authored-By: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> * fix 3.11 version * changes from code review * Update test_wheels.py * sync run time with pandas-wheels Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> * REGR: Performance decrease in factorize (#48620) * TYP: type all arguments with str default values (#48508) * TYP: type all arguments with str default values * na_rep: back to str * na(t)_rep is always a string * add float for some functions * and the same for the few float default arguments * define a few more literal constants * avoid itertools.cycle mypy error * revert mistake * TST: Catch more pyarrow PerformanceWarnings (#48699) * REGR: to_hdf raising AssertionError with boolean index (#48696) * REGR: to_hdf raising AssertionError with boolean index * Add gh ref * REGR: Regression in DataFrame.loc when setting df with all True indexer (#48711) * BUG: pivot_table raising for nullable dtype and margins (#48714) * TST: Address MPL 3.6 deprecation warnings (#48695) * TST: Address MPL 3.6 deprecation warnings * Address min build * missing () Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Co-authored-by: Ralf Gommers <ralf.gommers@gmail.com> Co-authored-by: Marc Garcia <garcia.marc@gmail.com> Co-authored-by: Luke Manley <lukemanley@gmail.com> Co-authored-by: Siddhartha Gandhi <siddhartha.a.gandhi@gmail.com> Co-authored-by: Torsten Wörtwein <twoertwein@users.noreply.github.com> Co-authored-by: Xiao Yuan <yuanx749@gmail.com> Co-authored-by: paradox-lab <57354735+paradox-lab@users.noreply.github.com> Co-authored-by: Pedro Nacht <15221358+pnacht@users.noreply.github.com> Co-authored-by: dataxerik <dsshar@gmail.com> Co-authored-by: jbrockmendel <jbrockmendel@gmail.com> Co-authored-by: Pablo <48098178+PabloRuizCuevas@users.noreply.github.com> Co-authored-by: tmoschou <5567550+tmoschou@users.noreply.github.com> Co-authored-by: Thomas Li <47963215+lithomas1@users.noreply.github.com> Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com>

…ndas-dev#48662) * BUG: Series.getitem not falling back to positional for bool index * Update pandas/tests/series/indexing/test_getitem.py Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> * Fix build warning for use of `strdup` in ultrajson (pandas-dev#48369) * WEB: Update versions json to fix version switcher in the docs (pandas-dev#48655) * PERF: join/merge on subset of MultiIndex (pandas-dev#48611) * DOC: Update documentation for date_range(), bdate_range(), and interval_range() to include timedelta as a possible data type for the freq parameter (pandas-dev#48631) * Update documentation for date_range(), bdate_range(), and interval_range() to include timedelta as a possible data type for the freq parameter * Add test case for date_range construction using datetime.timedelta * TYP: tighten Axis (pandas-dev#48612) * TYP: tighten Axis * allow 'rows' * BUG: Fix metadata propagation in df.corr and df.cov, GH28283 (pandas-dev#48616) * Add finalize to df.corr and df.cov * Clean * TST: add test case for PeriodIndex in HDFStore(GH7796) (pandas-dev#48618) * TST: add test case for PeriodIndex in HDFStore * TST: add test case for PeriodIndex in HDFStore * use pytest.mark.parameterize instead * Add OpenSSF Scorecards GitHub Action (pandas-dev#48570) * Create scorecards.yml * Update scorecards.yml * Add OpenSSF Scorecards badge to README.md * Trim whitespace in scorecards.yml * Skip scorecards.yml on forks * Fix whitespace * Pin scorecards.yml dependencies to major versions * ENH: move an exception and add a prehook to check for exception place… (pandas-dev#48088) * ENH: move an exception and add a prehook to check for exception placement * ENH: fix import * ENH: revert moving error * ENH: add docstring and fix import for test * ENH: re-design approach based on feedback * ENH: update whatsnew rst * ENH: apply feedback changes * ENH: refactor to remove exception_warning_list and ignore _version.py * ENH: remove NotThisMethod from tests and all * REGR: TextIOWrapper raising an error in read_csv (pandas-dev#48651) * REGR: TextIOWrapper raising an error in read_csv * pyupgrade * do not try to seek on unseekable buffers * unseekable buffer might also have read ahead * safer alternative: do not mess with internal/private(?) buffer of TextIOWrapper (effectively applies the shortcut only to files pandas opens) * Fix scorecard.yml workflow (pandas-dev#48668) * Set scorecard-action to v2.0.3 scorecard-action does not have a major version tag. Temporarily disabling github.repository check to ensure action now works. * Enable github.repository check * BUG: DatetimeIndex ignoring explicit tz=None (pandas-dev#48659) * BUG: DatetimeIndex ignoring explicit tz=None * GH ref * Corrected pd.merge indicator type hint (pandas-dev#48677) * Corrected pd.merge indicator type hint https://pandas.pydata.org/docs/reference/api/pandas.merge.html It should be "str | bool" instead of just string * Update merge.py fixed type hint in merge.py * Update merge.py Update indicator type hint in _MergeOperation * Update merge.py Added type hint _MergeOperation init * DOC: Document default value for options.display.max_cols when not running in terminal (pandas-dev#48672) DOC: Document default value for options.display.max_cols display.max_cols has a default value of 20 when not running in a terminal such as Jupyter Notebook * ENH: DTA/TDA add datetimelike scalar with mismatched reso (pandas-dev#48669) * ENH: DTA/TDA add datetimelike scalar with mismatched reso * mypy fixup * REF: support reso in remaining tslibs helpers (pandas-dev#48661) * REF: support reso in remaining tslibs helpers * update setup.py * PERF: Avoid fragmentation of DataFrame in read_sas (pandas-dev#48603) * PERF: Avoid fragmentation of DataFrame in read_sas * Add whatsnew * Add warning * DOC: Add deprecation infos to deprecated functions (pandas-dev#48599) * DOC: Add deprecation infos to deprecated functions * Add sections * Fix * BLD: Build wheels using cibuildwheel (pandas-dev#48283) * BLD: Build wheels using cibuildwheel * update from code review Co-Authored-By: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> * fix 3.11 version * changes from code review * Update test_wheels.py * sync run time with pandas-wheels Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> * REGR: Performance decrease in factorize (pandas-dev#48620) * TYP: type all arguments with str default values (pandas-dev#48508) * TYP: type all arguments with str default values * na_rep: back to str * na(t)_rep is always a string * add float for some functions * and the same for the few float default arguments * define a few more literal constants * avoid itertools.cycle mypy error * revert mistake * TST: Catch more pyarrow PerformanceWarnings (pandas-dev#48699) * REGR: to_hdf raising AssertionError with boolean index (pandas-dev#48696) * REGR: to_hdf raising AssertionError with boolean index * Add gh ref * REGR: Regression in DataFrame.loc when setting df with all True indexer (pandas-dev#48711) * BUG: pivot_table raising for nullable dtype and margins (pandas-dev#48714) * TST: Address MPL 3.6 deprecation warnings (pandas-dev#48695) * TST: Address MPL 3.6 deprecation warnings * Address min build * missing () Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Co-authored-by: Ralf Gommers <ralf.gommers@gmail.com> Co-authored-by: Marc Garcia <garcia.marc@gmail.com> Co-authored-by: Luke Manley <lukemanley@gmail.com> Co-authored-by: Siddhartha Gandhi <siddhartha.a.gandhi@gmail.com> Co-authored-by: Torsten Wörtwein <twoertwein@users.noreply.github.com> Co-authored-by: Xiao Yuan <yuanx749@gmail.com> Co-authored-by: paradox-lab <57354735+paradox-lab@users.noreply.github.com> Co-authored-by: Pedro Nacht <15221358+pnacht@users.noreply.github.com> Co-authored-by: dataxerik <dsshar@gmail.com> Co-authored-by: jbrockmendel <jbrockmendel@gmail.com> Co-authored-by: Pablo <48098178+PabloRuizCuevas@users.noreply.github.com> Co-authored-by: tmoschou <5567550+tmoschou@users.noreply.github.com> Co-authored-by: Thomas Li <47963215+lithomas1@users.noreply.github.com> Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com>

lukemanley added 2 commits September 17, 2022 14:36

multiindex join subset

00034d8

whatsnew

0ab2d01

phofl reviewed Sep 17, 2022

View reviewed changes

lukemanley added 2 commits September 17, 2022 23:13

simplify

ad3f42b

Merge remote-tracking branch 'upstream/main' into multiindex-join-subset

c01606d

mroeschke reviewed Sep 19, 2022

View reviewed changes

pandas/core/reshape/merge.py Show resolved Hide resolved

mroeschke added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode MultiIndex labels Sep 19, 2022

lukemanley added 2 commits September 19, 2022 19:50

clean comments

2b24442

Merge remote-tracking branch 'upstream/main' into multiindex-join-subset

cbf0f76

mroeschke approved these changes Sep 20, 2022

View reviewed changes

phofl approved these changes Sep 20, 2022

View reviewed changes

phofl merged commit 41ec469 into pandas-dev:main Sep 20, 2022

phofl added this to the 1.6 milestone Sep 20, 2022

phofl pushed a commit to phofl/pandas that referenced this pull request Sep 22, 2022

PERF: join/merge on subset of MultiIndex (pandas-dev#48611)

4d07469

lukemanley deleted the multiindex-join-subset branch September 24, 2022 00:48

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022

PERF: join/merge on subset of MultiIndex (pandas-dev#48611)

a2f890d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: join/merge on subset of MultiIndex #48611

PERF: join/merge on subset of MultiIndex #48611

Uh oh!

lukemanley commented Sep 17, 2022

Uh oh!

phofl Sep 17, 2022

Uh oh!

lukemanley Sep 18, 2022

Uh oh!

Uh oh!

mroeschke left a comment

Uh oh!

phofl commented Sep 20, 2022

Uh oh!

Uh oh!

Uh oh!

PERF: join/merge on subset of MultiIndex #48611

PERF: join/merge on subset of MultiIndex #48611

Uh oh!

Conversation

lukemanley commented Sep 17, 2022

Uh oh!

phofl Sep 17, 2022

Choose a reason for hiding this comment

Uh oh!

lukemanley Sep 18, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mroeschke left a comment

Choose a reason for hiding this comment

Uh oh!

phofl commented Sep 20, 2022

Uh oh!

Uh oh!