Performance regression in DataFrame[bool_indexer] #33924

TomAugspurger · 2020-05-01T15:45:02Z

import numpy as np
import pandas as pd

idx_dupe = np.array(range(30)) * 99
df = pd.DataFrame(np.random.randn(10000, 5))
bool_indexer = [True] * 5000 + [False] * 5000


%timeit df[np.array(bool_indexer)]

# 1.0.2
2.58 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# master
5.47 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Note that this only affects the case where the indexer is a list. Performance looks fine when an ndarray is passed.

https://pandas.pydata.org/speed/pandas/index.html#indexing.DataFrameNumericIndexing.time_bool_indexer?commits=80d37adcc3d9bfbbe17e8aa626d6b5873465ca98-4f89c261f624305fc7bae6c43ae862663994be34 points to somewhere in 80d37ad..4f89c26, which has a bunch of commits.

The text was updated successfully, but these errors were encountered:

rohitkg98 · 2020-05-02T07:48:54Z

The performance issue seems to be in

pandas/pandas/core/indexers.py

Line 348 in e835b76

def check_array_indexer(array: AnyArrayLike, indexer: Any) -> Any:

in which

pandas/pandas/core/construction.py

Line 56 in 911e19b

def array(

is called which calls infer_dtype twice.
It seems to have started from this PR #31150.
The performance also seems to have worsened just a tiny bit after this commit.
I have used cProfile to perform the analysis. I apologise for any mistakes, I'm still a novice dev.

#34199) * PERF: Remove unnecessary copies in sorting functions * PERF: Create array from list with given dtype=bool * Run black * Run tests * Run tests * Run tests * Fix imports * Add asv * Run black * Remove asv * Add requested changes * Run black * Delete newline * Fix whatsnew * Add requested changes * Fix * Fix * Fix typo * Fix * Update asv Co-authored-by: mproszewska <magdalena.proszewska@gmail.com>

TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Needs Triage Issue that has not been reviewed by a pandas team member labels May 1, 2020

TomAugspurger added this to the 1.1 milestone May 1, 2020

TomAugspurger added the Regression Functionality that used to work in a prior pandas version label May 1, 2020

mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label May 11, 2020

Zaharid mentioned this issue May 13, 2020

BUG: Performance regression in categorical indexer #34162

Closed

3 tasks

mproszewska mentioned this issue May 15, 2020

PERF: Fixes performance regression in DataFrame[bool_indexer] (#33924) #34199

Merged

5 tasks

jreback closed this as completed in #34199 Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in DataFrame[bool_indexer] #33924

Performance regression in DataFrame[bool_indexer] #33924

TomAugspurger commented May 1, 2020

rohitkg98 commented May 2, 2020

Performance regression in DataFrame[bool_indexer] #33924

Performance regression in DataFrame[bool_indexer] #33924

Comments

TomAugspurger commented May 1, 2020

rohitkg98 commented May 2, 2020