BUG: GroupBy().fillna() performance regression #37149

smithto1 · 2020-10-15T22:52:55Z

closes BUG: GroupBy().fillna() performance regression #36757
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

A performance regression was introduced with #30679 in handling grouping on a duplicate axis. The regression can be avoided by skipping the call to get_indexer_non_unique if the axis is unchanged (i.e. like it is in fillna).

I don't know how to write a test for this fix since it is just an issue of speed. If there is a test pattern or other check that should be included please highlight and I'm happy to add it.

In lieu of a test, running the minimal example from the issue report on the fixed branch shows the performance fix:

In [5]: import pandas as pd
   ...: import numpy as np
   ...: 
   ...: N = 2000
   ...: df = pd.DataFrame({"A": [1] * N, "B": [np.nan, 1.0] * (N // 2)})
   ...: df = df.sort_values("A").set_index("A")
   ...: 
   ...: %time df.groupby("A")["B"].fillna(method="ffill")
   ...:
Wall time: 1.03 ms
Out[5]:
A
1    NaN
1    1.0
1    1.0
1    1.0
1    1.0
    ...
1    1.0
1    1.0
1    1.0
1    1.0
1    1.0
Name: B, Length: 2000, dtype: float64

jreback · 2020-10-15T23:45:45Z

an asv is appropriate here (or if we have one that covers just show the results)

jreback · 2020-10-15T23:46:04Z

if not u can add one along the lines of what i posted

jreback

pls add an asv or show an existing one that is caught by this

pandas/core/groupby/groupby.py

doc/source/whatsnew/v1.1.4.rst

pep8speaks · 2020-10-16T23:59:08Z

Hello @smithto1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-17 22:24:29 UTC

smithto1 · 2020-10-17T00:05:53Z

pls add an asv or show an existing one that is caught by this

No existing one seemed to catch it so added a new one.

asv continuous --quick -f 1.1 -b groupby.FillNA upstream/master HEAD

(pandas-dev) C:\git\pandas-smithto1\asv_bench>python C:\anaconda3\envs\pandas-dev\Scripts\asv.exe continuous --quick -f 1.1 -b groupby.FillNA upstream/master HEAD
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing 5412aaa into conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[ 0.00%] · For pandas commit 58dcafa (round 1/2):
[ 0.00%] ·· Building for conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 0.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] · For pandas commit 5412aaa (round 1/2):
[ 25.00%] ·· Building for conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] · For pandas commit 5412aaa (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 62.50%] ··· groupby.FillNA.time_df_ffill 5.28±0ms
[ 75.00%] ··· groupby.FillNA.time_srs_ffill 3.12±0ms
[ 75.00%] · For pandas commit 58dcafa (round 2/2):
[ 75.00%] ·· Building for conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 87.50%] ··· groupby.FillNA.time_df_ffill 18.8±0s
[100.00%] ··· groupby.FillNA.time_srs_ffill 17.7±0s
before after ratio
[58dcafa] [5412aaa]

    18.8±0s         5.28±0ms     0.00  groupby.FillNA.time_df_ffill

    17.7±0s         3.12±0ms     0.00  groupby.FillNA.time_srs_ffill

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

asv_bench/benchmarks/groupby.py

jreback

pls show the results of the asv

asv_bench/benchmarks/groupby.py

smithto1 · 2020-10-17T22:25:26Z

pls show the results of the asv

(pandas-dev) C:\git\pandas-smithto1\asv_bench>python C:\anaconda3\envs\pandas-dev\Scripts\asv.exe continuous -f 1.1 -b groupby.FillNA upstream/master HEAD
       before           after         ratio
     [85793fb8]       [8a93e0b8]
     <issue36757~1^2>       <issue36757>
-     2.68±0.04ms      2.00±0.02ms     0.74  groupby.FillNA.time_df_bfill
-     2.68±0.01ms      1.99±0.05ms     0.74  groupby.FillNA.time_df_ffill
-     1.85±0.03ms      1.28±0.02ms     0.69  groupby.FillNA.time_srs_bfill
-     1.87±0.03ms      1.26±0.04ms     0.67  groupby.FillNA.time_srs_ffill

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

alippai · 2020-10-17T22:27:43Z

LGTM, thanks @smithto1 fixing the issue reported by me recently so swiftly

smithto1 · 2020-10-18T08:20:53Z

@jreback can you have another look.

jreback · 2020-10-18T14:56:36Z

thanks @smithto1

simonjayhawkins · 2020-10-18T17:01:23Z

@meeseeksdev backport 1.1.x

…ression

smithto1 · 2020-10-19T09:50:03Z

LGTM, thanks @smithto1 fixing the issue reported by me recently so swiftly

@alippai Thanks for reporting the issue.

…37223) Co-authored-by: Thomas Smith <thomassmith0304@gmail.com>

smithto1 added 3 commits October 15, 2020 23:26

pandas-dev#36757 fix for speed issue

8bc61db

whatsnew

1f8f593

merging master

8199ba4

jreback requested changes Oct 16, 2020

View reviewed changes

pandas/core/groupby/groupby.py Show resolved Hide resolved

doc/source/whatsnew/v1.1.4.rst Outdated Show resolved Hide resolved

jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance labels Oct 16, 2020

jreback added this to the 1.1.4 milestone Oct 16, 2020

jreback added the Groupby label Oct 16, 2020

smithto1 added 4 commits October 16, 2020 21:50

addressing comments

03bb0be

Merge remote-tracking branch 'upstream/master' into issue36757

045e068

added asv FillNA

19ddf0d

fixing FillNA

5412aaa

black

bafedb5

alippai approved these changes Oct 17, 2020

View reviewed changes

asv_bench/benchmarks/groupby.py Show resolved Hide resolved

asv_bench/benchmarks/groupby.py Outdated Show resolved Hide resolved

jreback requested changes Oct 17, 2020

View reviewed changes

asv_bench/benchmarks/groupby.py Show resolved Hide resolved

asv_bench/benchmarks/groupby.py Outdated Show resolved Hide resolved

smithto1 added 2 commits October 17, 2020 22:40

merging master

3272b9b

addressing comments

8a93e0b

jreback approved these changes Oct 18, 2020

View reviewed changes

jreback merged commit de10e72 into pandas-dev:master Oct 18, 2020

meeseeksmachine mentioned this pull request Oct 18, 2020

Backport PR #37149 on branch 1.1.x (BUG: GroupBy().fillna() performance regression) #37223

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Oct 18, 2020

Backport PR pandas-dev#37149: BUG: GroupBy().fillna() performance reg…

2d237b6

…ression

simonjayhawkins pushed a commit that referenced this pull request Oct 19, 2020

Backport PR #37149: BUG: GroupBy().fillna() performance regression (#…

5d310b8

…37223) Co-authored-by: Thomas Smith <thomassmith0304@gmail.com>

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Oct 26, 2020

BUG: GroupBy().fillna() performance regression (pandas-dev#37149)

cd11253

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

BUG: GroupBy().fillna() performance regression (pandas-dev#37149)

cce0ee2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: GroupBy().fillna() performance regression #37149

BUG: GroupBy().fillna() performance regression #37149

smithto1 commented Oct 15, 2020 •

edited

Loading

jreback commented Oct 15, 2020

jreback commented Oct 15, 2020

jreback left a comment

pep8speaks commented Oct 16, 2020 •

edited

Loading

smithto1 commented Oct 17, 2020

jreback left a comment

smithto1 commented Oct 17, 2020

alippai commented Oct 17, 2020

smithto1 commented Oct 18, 2020

jreback commented Oct 18, 2020

simonjayhawkins commented Oct 18, 2020

smithto1 commented Oct 19, 2020

BUG: GroupBy().fillna() performance regression #37149

BUG: GroupBy().fillna() performance regression #37149

Conversation

smithto1 commented Oct 15, 2020 • edited Loading

jreback commented Oct 15, 2020

jreback commented Oct 15, 2020

jreback left a comment

Choose a reason for hiding this comment

pep8speaks commented Oct 16, 2020 • edited Loading

Comment last updated at 2020-10-17 22:24:29 UTC

smithto1 commented Oct 17, 2020

jreback left a comment

Choose a reason for hiding this comment

smithto1 commented Oct 17, 2020

alippai commented Oct 17, 2020

smithto1 commented Oct 18, 2020

jreback commented Oct 18, 2020

simonjayhawkins commented Oct 18, 2020

smithto1 commented Oct 19, 2020

smithto1 commented Oct 15, 2020 •

edited

Loading

pep8speaks commented Oct 16, 2020 •

edited

Loading