performance of Series label-indexing with a list of labels #16285

Closed
azag0 opened this Issue May 8, 2017 · 5 comments

Comments

Projects
None yet
4 participants
@azag0

azag0 commented May 8, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
from numpy import random
dct = dict(zip(range(1000), random.randint(1000, size=1000)))
keys = random.randint(1000, size=1000000).tolist()
%timeit [dct[k] for k in keys]
# 86.2 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
sdct = pd.Series(dct)
%timeit sdct[keys]
# 673 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Problem description

I would expect the Series performance to be comparable, if not faster than the Python comprehension.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 16.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 35.0.2
Cython: None
numpy: 1.12.1
scipy: 0.19.0
statsmodels: 0.6.1
xarray: None
IPython: 6.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
matplotlib: 2.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
boto: None
pandas_datareader: None

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 May 8, 2017

Contributor

Yeah, this is slow. Note that it is much faster if the keys are wrapped in an array.

In [11]: %timeit [dct[k] for k in keys]
10 loops, best of 3: 79.5 ms per loop

In [12]: %timeit sdct[keys]
1 loop, best of 3: 645 ms per loop

In [13]: %timeit sdct[np.array(keys)]
10 loops, best of 3: 65.1 ms per loop

Bottleneck seems to be in clean_index_list

Contributor

chris-b1 commented May 8, 2017

Yeah, this is slow. Note that it is much faster if the keys are wrapped in an array.

In [11]: %timeit [dct[k] for k in keys]
10 loops, best of 3: 79.5 ms per loop

In [12]: %timeit sdct[keys]
1 loop, best of 3: 645 ms per loop

In [13]: %timeit sdct[np.array(keys)]
10 loops, best of 3: 65.1 ms per loop

Bottleneck seems to be in clean_index_list

@azag0

This comment has been minimized.

Show comment
Hide comment
@azag0

azag0 May 8, 2017

Ah, I see, good catch, could have tried that.

This makes it comparably faster to the comprehension. Shouldn't it be significantly faster though? I assume the comprehension is interpreted, whereas the Series lookup is one call to the C extension. Or does it boil down to efficiencies of Pandas's and C Python's hash tables?

azag0 commented May 8, 2017

Ah, I see, good catch, could have tried that.

This makes it comparably faster to the comprehension. Shouldn't it be significantly faster though? I assume the comprehension is interpreted, whereas the Series lookup is one call to the C extension. Or does it boil down to efficiencies of Pandas's and C Python's hash tables?

@chris-b1

This comment has been minimized.

Show comment
Hide comment
@chris-b1

chris-b1 May 8, 2017

Contributor

The bulk of the time in this operation is actually in placing the new values, not the hash table lookups. Below I skip the hash table in both cases

In [154]: vals = list(dct.values())

# python
In [157]: %timeit [vals[x] for x in keys]
10 loops, best of 3: 64.4 ms per loop

In [158]: %timeit [dct[k] for k in keys]
10 loops, best of 3: 80.5 ms per loop
# 16.1 ms of "overhead"

# pandas / numpy
In [160]: %timeit sdct.values.take(keys)
10 loops, best of 3: 52.2 ms per loop

In [159]: In [13]: %timeit sdct[np.array(keys)]
10 loops, best of 3: 62.3 ms per loop
# 10.1 ms of "overhead"

You are are right that these are C operations that avoid python overhead, but they are also basic python ops on optimized data structures, so not as much pickup as you might guess.

Contributor

chris-b1 commented May 8, 2017

The bulk of the time in this operation is actually in placing the new values, not the hash table lookups. Below I skip the hash table in both cases

In [154]: vals = list(dct.values())

# python
In [157]: %timeit [vals[x] for x in keys]
10 loops, best of 3: 64.4 ms per loop

In [158]: %timeit [dct[k] for k in keys]
10 loops, best of 3: 80.5 ms per loop
# 16.1 ms of "overhead"

# pandas / numpy
In [160]: %timeit sdct.values.take(keys)
10 loops, best of 3: 52.2 ms per loop

In [159]: In [13]: %timeit sdct[np.array(keys)]
10 loops, best of 3: 62.3 ms per loop
# 10.1 ms of "overhead"

You are are right that these are C operations that avoid python overhead, but they are also basic python ops on optimized data structures, so not as much pickup as you might guess.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche May 8, 2017

Member

Bottleneck seems to be in clean_index_list

This is indeed the case:

In [29]: %timeit pd._libs.lib.clean_index_list(keys)
1 loop, best of 3: 532 ms per loop

@chris-b1 do you understand the purpose of _ensure_index / clean_index_list in this case? (or @jreback ?)
I mean, why is it for each item in the list checking whether it is a list?
This notion seems introduced in 27e34a4 (by then still in python), but this is for setting as axis, not gettting. If that is the only use-case, we could certainly split this off of the current _ensure_index so it is not used for getting.

Member

jorisvandenbossche commented May 8, 2017

Bottleneck seems to be in clean_index_list

This is indeed the case:

In [29]: %timeit pd._libs.lib.clean_index_list(keys)
1 loop, best of 3: 532 ms per loop

@chris-b1 do you understand the purpose of _ensure_index / clean_index_list in this case? (or @jreback ?)
I mean, why is it for each item in the list checking whether it is a list?
This notion seems introduced in 27e34a4 (by then still in python), but this is for setting as axis, not gettting. If that is the only use-case, we could certainly split this off of the current _ensure_index so it is not used for getting.

jreback added a commit to jreback/pandas that referenced this issue May 9, 2017

@jreback jreback added this to the 0.20.2 milestone May 9, 2017

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback May 9, 2017

Contributor

so you can look at #16295, but this is actually quite subtle. you cannot simply np.asarray things, otherwise numpy converts things oddly in some cases. this is why np.array works above, its exactly what we do when we can convert it cleanly (e.g. a list of integers).

Contributor

jreback commented May 9, 2017

so you can look at #16295, but this is actually quite subtle. you cannot simply np.asarray things, otherwise numpy converts things oddly in some cases. this is why np.array works above, its exactly what we do when we can convert it cleanly (e.g. a list of integers).

@jreback jreback closed this in #16295 May 9, 2017

jreback added a commit that referenced this issue May 9, 2017

pawroman added a commit to pawroman/pandas that referenced this issue May 9, 2017

pcluo added a commit to pcluo/pandas that referenced this issue May 22, 2017

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 29, 2017

PERF: fix clean_index_list perf (#16295)
closes #16285
(cherry picked from commit ce4eef3)

TomAugspurger added a commit that referenced this issue May 30, 2017

PERF: fix clean_index_list perf (#16295)
closes #16285
(cherry picked from commit ce4eef3)

stangirala added a commit to stangirala/pandas that referenced this issue Jun 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment