Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby().first() much slower with a str column present in the data. #19283

Closed
jdmarino opened this issue Jan 17, 2018 · 12 comments
Closed

groupby().first() much slower with a str column present in the data. #19283

jdmarino opened this issue Jan 17, 2018 · 12 comments
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version

Comments

@jdmarino
Copy link

(I copied this code from a jupyter notebook)

import pandas as pd
import sys
pd.options.display.max_rows = 10
print('pandas version', pd.__version__)
print('python version', sys.version)

#pandas version 0.22.0
#python version 3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]

msgs = pd.DataFrame({ 'orderid':pd.np.random.random_sample(size=100000)
                     ,'qty':pd.np.random.random_sample(size=100000)})
msgs['date'] = '1900-01-01'
msgs['textcol'] = 'lorem ipsum etc'
msgs.info()

# omits textcol in data  takes 59 ms
g = msgs[['date','orderid','qty']].groupby(['date','orderid'])
%time orders = g.first()
orders.info(null_counts=True)

# has textcol in data  takes 10.6 s
g = msgs.groupby(['date','orderid'])
%time orders = g.first()
orders.info(null_counts=True)

Problem description

I find that the presence of a text column in a dataframe's data (i.e. not the groupby) dramatically slows down a groupby.first() in version 0.22 (but not 0.21.1) by 2 orders of magnitude. The operation takes 59 ms without a text column present in the data and 10.6 secs when it is. (The problem is not limited to this kind of made-up data; I discovered it in my work after upgrading pandas.)

Expected Output

When I run the same code under 0.21.1 the times are 55 ms and 67 ms.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 44 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: 2.7.3.2 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 17, 2018

I'm unable to reproduce your slowdown:

On 0.21.1:

In [6]: %timeit g1.first()
2.63 ms ± 51.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit g2.first()
7.96 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: pd.__version__
Out[8]: '0.21.1'

On 0.22.0:

In [8]: g1 = msgs[['date','orderid','qty']].groupby(['date','orderid'])

In [9]: g2 = msgs.groupby(['date', 'orderid'])

In [10]: %timeit g1.first()
2.49 ms ± 70.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit g2.first()
5.07 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: pd.__version__
Out[12]: '0.22.0'

Are there any other differences in your environments? 0.22.0 only contained 1 change relative to 0.21.1, and I'd be surprised if it had an impact on Groupby.first.

Edit: Whoops, I'm apparently unable to read. I missed the ms vs. s in the second groupby. Consider me surprised then.

@jdmarino
Copy link
Author

It might be worth removing your line "I'm unable to reproduce your slowdown". People like me might stop reading right there assuming the OP was a crackpot. ;-)

@chris-b1
Copy link
Contributor

Thanks @jdmarino - seems the min_count changes in #18921 is causing the groupy dispatch to the right cython method to fail. Fortunately/unfortunately there's a fallback so tests didn't pick it up. As a slightly faster workaround you could use nth(0).

In [18]: g2 = msgs.groupby(['date', 'orderid'])

In [19]: %timeit g2.nth(0)
64.2 ms ± 463 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [16]: from pandas.testing import assert_frame_equal

In [17]: assert_frame_equal(g2.nth(0), g2.first())

@TomAugspurger - looks like the 1 in this line is issue?

'f': lambda func, a, b, c, d, e: func(a, b, c, d, 1, -1)

def fail():
    raise Exception

g2 = msgs.groupby(['date', 'orderid'])

g2._cython_agg_general('first', alt=fail, numeric_only=False, min_count=1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-220c58451090> in <module>()
----> 1 g2._cython_agg_general('first', alt=fail, numeric_only=False, min_count=1)

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
   3607                             min_count=-1):
   3608         new_items, new_blocks = self._cython_agg_blocks(
-> 3609             how, alt=alt, numeric_only=numeric_only, min_count=min_count)
   3610         return self._wrap_agged_blocks(new_items, new_blocks)
   3611 

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
   3651             try:
   3652                 result, _ = self.grouper.aggregate(
-> 3653                     block.values, how, axis=agg_axis, min_count=min_count)
   3654             except NotImplementedError:
   3655                 # generally if we have numeric_only=False

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in aggregate(self, values, how, axis, min_count)
   2312     def aggregate(self, values, how, axis=0, min_count=-1):
   2313         return self._cython_operation('aggregate', values, how, axis,
-> 2314                                       min_count=min_count)
   2315 
   2316     def transform(self, values, how, axis=0):

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _cython_operation(self, kind, values, how, axis, min_count)
   2268             result = self._aggregate(
   2269                 result, counts, values, labels, func, is_numeric,
-> 2270                 is_datetimelike, min_count)
   2271         elif kind == 'transform':
   2272             result = _maybe_fill(np.empty_like(values, dtype=out_dtype),

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _aggregate(self, result, counts, values, comp_ids, agg_func, is_numeric, is_datetimelike, min_count)
   2330                          min_count)
   2331         else:
-> 2332             agg_func(result, counts, values, comp_ids, min_count)
   2333 
   2334         return result

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in wrapper(*args, **kwargs)
   2169 
   2170                 def wrapper(*args, **kwargs):
-> 2171                     return f(afunc, *args, **kwargs)
   2172 
   2173                 # need to curry our sub-function

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in <lambda>(func, a, b, c, d, e)
   2113             'first': {
   2114                 'name': 'group_nth',
-> 2115                 'f': lambda func, a, b, c, d, e: func(a, b, c, d, 1, -1)
   2116             },
   2117             'last': 'group_last',

pandas/_libs/groupby.pyx in pandas._libs.groupby.group_nth_object()

TypeError: group_nth_object() takes exactly 5 positional arguments (6 given)

@chris-b1 chris-b1 added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Jan 17, 2018
@chris-b1 chris-b1 added this to the 0.23.0 milestone Jan 17, 2018
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 17, 2018

Thanks @chris-b1. I added a min_count to group_nth_ in groupby_helper, but group_nth_object is defined separately, and so lost it. Adding in an unused min_count to group_nth_object will fix things.

@lv10
Copy link

lv10 commented Jan 29, 2018

Hi,

I would like to add that we are having the same issue. We have a block of code that groupsby and then call first as follows:

>> groups = self.df.groupby(['col_name'])
>> firsts = groups.first()

Under pandas 0.21 it takes: less than 3 seconds
Under pandas 0.22 it takes: between 2 and 3 mins

@albertvillanova
Copy link
Contributor

The same performance issue is also found for function GroupBy.last

@TomAugspurger
Copy link
Contributor

Anybody interested in submitting a fix? It'll involve touching some Cython, but the only change is adding a min_count parameter to

def group_nth_object(ndarray[object, ndim=2] out,
, and asserting that it's always -1.

@jreback
Copy link
Contributor

jreback commented Jan 30, 2018

http://pandas.pydata.org/pandas-docs/stable/contributing.html

@sursu
Copy link
Contributor

sursu commented Apr 11, 2018

A friendly question:
Has this problem been resolved or the recommended solution remains to use .nth(0)?

Tears fall down my face due to how slow .first() and .last() are.
I have Anaconda 5.1 installed, and pd.__version__ returns '0.22.0'.

I could use .nth(0) instead of .first(), but I am not quiet sure what to do with .last(), apart from doing the computations on Dataframe.as_matrix() with numpy.

@chris-b1
Copy link
Contributor

chris-b1 commented Apr 11, 2018 via email

@sursu
Copy link
Contributor

sursu commented Apr 11, 2018

Opened an issue here: #20657


I also find that .last() does not perform as expected (might be related, not sure whether I need to open a new issue):

Example:

df = pd.DataFrame([[179293473,'2016-06-01 00:00:03.549745','http://www.dr.dk/nyheder/',39169523],[179293473,'2016-06-01 00:04:22.346018','https://www.information.dk/indland/2016/05/hvert-tredje-offer-naar-anmelde-voldtaegt-tide', 39125224],
 [179773461, '2016-06-01 22:13:16.588146', 'https://www.google.dk', 31658124],
 [179773461, '2016-06-01 22:14:04.059781', 'https://www.google.dk', 31658124],
 [179773461, '2016-06-01 22:16:37.230587', np.nan, 31658124],
 [179773461, '2016-06-01 22:23:09.847149', 'https://www.google.dk', 32718401],
 [179773461, '2016-06-01 22:23:55.158929', np.nan, 32718401],
 [179773461, '2016-06-01 22:27:00.857224', np.nan, 32718401]],
columns=['SessionID', 'PageTime', 'ReferrerURL', 'PageID'])

Now, when I run:
df.groupby('SessionID').last()
I get:

  SessionID PageTime ReferrerURL PageID
179293473 2016-06-01 00:04:22.346018 https://www.information.dk/indland/2016/05/hve... 39125224
179773461 2016-06-01 22:27:00.857224 https://www.google.dk 32718401

When, in fact, I was expecting the same result as obtained from:
df.groupby('SessionID').nth(-1)

  SessionID PageID PageTime ReferrerURL
179293473 39125224 2016-06-01 00:04:22.346018 https://www.information.dk/indland/2016/05/hve...
179773461 32718401 2016-06-01 22:27:00.857224 NaN

And while we are at .nth(), why does it mix up my column order?

@chris-b1
Copy link
Contributor

chris-b1 commented Apr 11, 2018 via email

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

7 participants