New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boolean operations in groupby objects are extremely slow compared to numpy counterpart. #15435

Closed
daniel-severo opened this Issue Feb 17, 2017 · 3 comments

Comments

Projects
None yet
4 participants
@daniel-severo

daniel-severo commented Feb 17, 2017

%timeit -n 10000 np.random.choice([True, False], 10000).any()
%timeit -n 10000 np.random.choice([True, False], 10000).sum().astype(bool)
10000 loops, best of 3: 83.7 µs per loop
10000 loops, best of 3: 106 µs per loop
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "a": np.random.randint(0,20, 10000),
    "b": np.random.randint(0,20, 10000),
    "c": np.random.choice([True, False], 10000)
})

%timeit -n 100 df.groupby(["a", "b"])["c"].any()
%timeit -n 100 df.groupby(["a", "b"])["c"].sum().astype(bool)
100 loops, best of 3: 40.9 ms per loop
100 loops, best of 3: 1.46 ms per loop

Problem description

The issue here is that the any method for groupby objects seams to be freakishly slow. It is actually better to sum up all the boolean values and do a typecast with .astype(bool). In numpy the operations have similar benchmarks. The method with any is actually faster!.

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-53-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 8.1.1
setuptools: None
Cython: None
numpy: 1.12.0
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.2.2
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: 2.45.0
pandas_datareader: None

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 17, 2017

@daniel-severo The reason for this difference is that for sum we have a specialized groupby version (in cython), and we don't have this for any. So in the case of any, the function is generally applied individually on each group, making it a lot slower.
But, if you or someone would be interested, I don't think it would be too hard to make such a specialized groupby version for any as well.

@jreback jreback added this to the Next Major Release milestone Feb 17, 2017

@daniel-severo

This comment has been minimized.

daniel-severo commented Feb 17, 2017

I see. I'll take a tackle at it :)

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Feb 26, 2017

@jreback jreback modified the milestones: Next Major Release, Next Minor Release Mar 29, 2017

@jreback jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017

@mroeschke mroeschke referenced this issue Jan 10, 2018

Open

PERF: Discrepancy in groupby methods #19165

4 of 7 tasks complete

@WillAyd WillAyd referenced this issue Feb 16, 2018

Merged

Cythonized GroupBy any #19722

4 of 4 tasks complete

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Feb 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment