Skip to content

Conversation

@mproszewska
Copy link
Contributor

Insted of calling construction.array in check_array_indexer, creates array with dtype=bool before calling check_array_indexer.

Fixes performance regression after commit b9bcdc3.

setup = """
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(100000, 5))
bool_indexer = [True] * 50000 + [False] * 50000
"""
import timeit
timeit.timeit("df[bool_indexer]",setup=setup, number=1000)

# master
# 27.29123323399108
# now
# 8.814320757053792

@pep8speaks
Copy link

pep8speaks commented May 15, 2020

Hello @mproszewska! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-31 23:38:07 UTC

else:
if not is_array_like(result):
# GH 33924
result = pd_array(result, dtype=bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just np.array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of nullable input

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the input is nullable, wouldn't not is_array_like(result) be False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that this array/list can have nullable elements.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an asv that hits this case & a whatsnew note

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels May 18, 2020
@mproszewska
Copy link
Contributor Author

indexing.DataFrameNumericIndexing.time_bool_indexer looks smiliar to the case that linked issue describes, so I'm not sure if additional asv is needed

@jreback jreback added this to the 1.1 milestone May 25, 2020
@jreback
Copy link
Contributor

jreback commented May 25, 2020

can you run and post the indexing the asvs

@mproszewska
Copy link
Contributor Author

[ 50.00%] · For pandas commit e0a34f8f <perf-bool> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 55.00%] ··· indexing.DataFrameNumericIndexing.time_bool_indexer                                                                                    1.29±0.3ms
[ 60.00%] ··· indexing.DataFrameNumericIndexing.time_iloc                                                                                              147±20μs
[ 65.00%] ··· indexing.DataFrameNumericIndexing.time_iloc_dups                                                                                         204±20μs
[ 70.00%] ··· indexing.DataFrameNumericIndexing.time_loc                                                                                               83.2±4μs
[ 75.00%] ··· indexing.DataFrameNumericIndexing.time_loc_dups                                                                                        4.73±0.3ms
[ 75.00%] · For pandas commit 5f26c342 <master^2> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython0.29.16-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 80.00%] ··· indexing.DataFrameNumericIndexing.time_bool_indexer                                                                                    3.63±0.1ms
[ 85.00%] ··· indexing.DataFrameNumericIndexing.time_iloc                                                                                               116±6μs
[ 90.00%] ··· indexing.DataFrameNumericIndexing.time_iloc_dups                                                                                         180±20μs
[ 95.00%] ··· indexing.DataFrameNumericIndexing.time_loc                                                                                               85.1±9μs
[100.00%] ··· indexing.DataFrameNumericIndexing.time_loc_dups                                                                                        5.86±0.8ms

@mproszewska
Copy link
Contributor Author

I could change size of tested DataFrame, so that benchmarks would change more significantly.

@jreback
Copy link
Contributor

jreback commented May 31, 2020

I could change size of tested DataFrame, so that benchmarks would change more significantly.

sure as long as benchamarks are << 1s its ok (e.g. in ms). please post the actual asv run. otherwise looks fine. ping on green.

@mproszewska
Copy link
Contributor Author

After asv change

 before           after         ratio
     [5f26c342]       [e0a34f8f]
     <master^2>       <perf-bool>
-      26.3±0.3ms      7.27±0.03ms     0.28  indexing.DataFrameNumericIndexing.time_bool_indexer

@jreback
Copy link
Contributor

jreback commented Jun 1, 2020

lgtm ping on green

@mproszewska
Copy link
Contributor Author

lgtm ping on green

ping

@jreback jreback merged commit c6c273a into pandas-dev:master Jun 1, 2020
@jreback
Copy link
Contributor

jreback commented Jun 1, 2020

thanks @mproszewska

@mproszewska mproszewska deleted the perf-bool branch June 1, 2020 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance regression in DataFrame[bool_indexer]

4 participants