New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: isin() is slower for categorical data than for integers #20003

Closed
vfilimonov opened this Issue Mar 5, 2018 · 2 comments

Comments

Projects
None yet
4 participants
@vfilimonov
Contributor

vfilimonov commented Mar 5, 2018

Problem description

For long series and many categories 'Series.isin()' is slower for categorical data rather than for int64. If categories are built from strings, then the degradation of the performance is even larger.

import pandas as pd
import numpy as np

N = 3000000
Ncats = 100

cats = pd.Series(['abcdef%d'%_ for _ in range(Ncats)])

df = pd.DataFrame({'A': np.random.randn(N),
                   'B': np.random.randn(N),
                   'C': np.random.randint(0, Ncats, N),
                  })
df['D'] = cats.loc[df['C'].values].values
df['E'] = df['C'].astype('category')
df['F'] = df['D'].astype('category')

sel_codes = [1,2]
sel_cats = cats.loc[sel_codes].values

%timeit inds = df.C.isin(sel_codes)  # int64
%timeit inds = df.E.isin(sel_codes)  # category based on int64
%timeit inds = df.D.isin(sel_cats)  # object / string
%timeit inds = df.F.isin(sel_cats)  # category based on string

On my machine:

6.25 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
28.7 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
104 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
142 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Interestingly, if there're many categories to compare with, categorical data is faster, e.g. for

sel_codes = range(90)
sel_cats = cats.loc[sel_codes].values

%timeit inds = df.C.isin(sel_codes)  # int64
%timeit inds = df.E.isin(sel_codes)  # category based on int64
%timeit inds = df.D.isin(sel_cats)  # object / string
%timeit inds = df.F.isin(sel_cats)  # category based on string

the timings are:

441 ms ± 61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
422 ms ± 68.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
147 ms ± 7.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
171 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

p.s. I'm not sure if such performance issues are worth filing.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.7.1
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: 1.3.0
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0b10
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 5, 2018

Thanks for the report, these are absolutely worth filing.

In this case we'll want to get the index position of values in the categorical categories (get_indexer should do the trick), pass the codes to algos.isin. Will just have to be careful with missing values, which will both be -1 by default.

@ma3axaka

This comment has been minimized.

Contributor

ma3axaka commented Mar 26, 2018

I'm looking at this issue.

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Apr 9, 2018

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Apr 24, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment