BUG: GroupBy().fillna() performance regression #36757

alippai · 2020-10-01T01:27:16Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

import pandas as pd
import numpy as np

N = 2000
df = pd.DataFrame({"A": [1] * N, "B": [np.nan, 1.0] * (N // 2)})
df = df.sort_values("A").set_index("A")

df["B"] = df.groupby("A")["B"].fillna(method="ffill")

Problem description

The groupby + fillna gets extremely slow increasing the N.
This is a regression from 1.0.5->1.1.0.

Note: if I remove the .set_index("A") it's fast again.

Expected Output

Same output, just faster.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : d9fff27
python : 3.7.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.110-1.el7.elrepo.x86_64
Version : #1 SMP Fri Jan 5 11:35:48 EST 2018
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 49.6.0.post20200917
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2020-10-01T06:45:53Z

Thanks @alippai for the report, can confirm this reproduces

on master:

In [5]: import pandas as pd 
   ...: import numpy as np 
   ...:  
   ...: N = 2000 
   ...: df = pd.DataFrame({"A": [1] * N, "B": [np.nan, 1.0] * (N // 2)}) 
   ...: df = df.sort_values("A").set_index("A") 
   ...: %time df.groupby("A")["B"].fillna(method="ffill")                       
CPU times: user 1.09 s, sys: 571 ms, total: 1.66 s
Wall time: 1.66 s
Out[5]: 
A
1    NaN
1    1.0
1    1.0
1    1.0
1    1.0
    ... 
1    1.0
1    1.0
1    1.0
1    1.0
1    1.0
Name: B, Length: 2000, dtype: float64

on 1.0.5:

In [8]: import pandas as pd 
   ...: import numpy as np 
   ...:  
   ...: N = 2000 
   ...: df = pd.DataFrame({"A": [1] * N, "B": [np.nan, 1.0] * (N // 2)}) 
   ...: df = df.sort_values("A").set_index("A") 
   ...:  
   ...: %time df.groupby("A")["B"].fillna(method="ffill")                       
CPU times: user 3.99 ms, sys: 0 ns, total: 3.99 ms
Wall time: 3.39 ms
Out[8]: 
A
1    NaN
1    1.0
1    1.0
1    1.0
1    1.0
    ... 
1    1.0
1    1.0
1    1.0
1    1.0
1    1.0
Name: B, Length: 2000, dtype: float64

alippai · 2020-10-01T12:36:16Z

ffill() is fast, but the output is different: #34725

alippai · 2020-10-01T15:32:07Z

As for larger N (starting from 10k) this never completes, can we consider adding back the Bug label? Looks like quadratic complexity or worse.

erfannariman · 2020-10-01T21:33:02Z

Running profiler gives:

         243385 function calls (235532 primitive calls) in 10.477 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 1632/421    7.849    0.005    7.882    0.019 {built-in method numpy.core._multiarray_umath.implement_array_function}
        1    2.054    2.054    9.936    9.936 {method 'get_indexer_non_unique' of 'pandas._libs.index.IndexEngine' objects}
      375    0.048    0.000    0.048    0.000 {built-in method marshal.loads}
      377    0.042    0.000    0.042    0.000 {method 'read' of '_io.BufferedReader' objects}
        1    0.042    0.042    0.042    0.042 {method 'unique' of 'pandas._libs.hashtable.Int64HashTable' objects}
    83/81    0.037    0.000    0.039    0.000 {built-in method _imp.create_dynamic}
        1    0.022    0.022   10.002   10.002 groupby.py:1167(_concat_objects)
      410    0.020    0.000    0.020    0.000 {built-in method builtins.compile}

smithto1 · 2020-10-12T09:56:19Z

ffill() is fast, but the output is different: #34725

@alippai #34725 is fixed now if that helps

smithto1 · 2020-10-15T22:08:15Z

take

smithto1 · 2020-10-15T22:37:01Z

It seems the regression was introduced in #30679

alippai added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 1, 2020

MarcoGorelli added Performance Memory or execution speed performance and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 1, 2020

github-actions bot assigned smithto1 Oct 15, 2020

smithto1 added a commit to smithto1/pandas that referenced this issue Oct 15, 2020

pandas-dev#36757 fix for speed issue

8bc61db

smithto1 mentioned this issue Oct 15, 2020

BUG: GroupBy().fillna() performance regression #37149

Merged

5 tasks

jreback added this to the 1.1.4 milestone Oct 16, 2020

jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Oct 16, 2020

jreback closed this as completed in #37149 Oct 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: GroupBy().fillna() performance regression #36757

BUG: GroupBy().fillna() performance regression #36757

alippai commented Oct 1, 2020 •

edited

Loading

INSTALLED VERSIONS

MarcoGorelli commented Oct 1, 2020 •

edited

Loading

alippai commented Oct 1, 2020

alippai commented Oct 1, 2020

erfannariman commented Oct 1, 2020 •

edited

Loading

smithto1 commented Oct 12, 2020

smithto1 commented Oct 15, 2020

smithto1 commented Oct 15, 2020

BUG: GroupBy().fillna() performance regression #36757

BUG: GroupBy().fillna() performance regression #36757

Comments

alippai commented Oct 1, 2020 • edited Loading

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

MarcoGorelli commented Oct 1, 2020 • edited Loading

alippai commented Oct 1, 2020

alippai commented Oct 1, 2020

erfannariman commented Oct 1, 2020 • edited Loading

smithto1 commented Oct 12, 2020

smithto1 commented Oct 15, 2020

smithto1 commented Oct 15, 2020

alippai commented Oct 1, 2020 •

edited

Loading

Output of `pd.show_versions()`

MarcoGorelli commented Oct 1, 2020 •

edited

Loading

erfannariman commented Oct 1, 2020 •

edited

Loading