Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
PERF: PeriodIndex.size #14822
Comments
MaximilianR
referenced
this issue
in pydata/xarray
Dec 8, 2016
Merged
PERF: Use len rather than size #1157
chris-b1
added Performance Period Regression
labels
Dec 8, 2016
|
@sinhrks thoughts? |
|
I am going to release 0.19.2 in a few days, so if there is a PR, it could maybe still be included. |
|
I think changing the definition of |
|
@sinhrks do you know of any other properties or methods like this? In this case, |
|
Should all of the methods of More generally, is it dangerous that calling |
|
so for So changing the implementation would be fine (if any errors show up need to be looked at though), and potential perf comparisons... |
|
OK, thanks @jreback I think the problem is bigger than I imagined - a shallow copy takes 142ms, and even a basic lookup takes 1.4ms: In [1]: import pandas as pd
In [2]: index=pd.PeriodIndex(start='2000', periods=50000, freq='D')
In [3]: %timeit index._shallow_copy()
1 loop, best of 3: 162 ms per loop
In [4]: %timeit index._shallow_copy(values=index._values)
The slowest run took 5.87 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.1 µs per loop
In [6]: all(index._shallow_copy(values=index._values) == index._shallow_copy())
Out[6]: True
So almost 1000x slower than In [13]: index = pd.Int64Index(range(0,50000))
In [14]: %timeit index.get_loc(index[500])
The slowest run took 475.16 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.83 µs per loop@jreback & @sinhrks do you have any suggestions for the most efficient way to solve this? I had planned to replace FWIW this makes the latest pandas unusable in our environment - the speed has fallen by a multiple, given how much we use |
|
there is probably some boxing going on this should be similar speed to a DTI index the _shallow_copy is the tip off |
Boxing everywhere! For I think the core issue is that lots of places we rely on |
|
The weirdness deepens... I've tracked down the index = pd.PeriodIndex(start='2000', periods=50000, freq='B')
In [37]: index._int64index
Out[37]:
Int64Index([ 7827, 7828, 7829, 7830, 7831, 7832, 7833, 7834, 7835,
7836,
...
57817, 57818, 57819, 57820, 57821, 57822, 57823, 57824, 57825,
57826],
dtype='int64', length=50000)
In [35]: %timeit index._int64index.get_loc(12827)
100 loops, best of 3: 1.57 ms per loop # really slowBut if I create exactly the same index directly as an In [40]: int_index = pd.Int64Index(range(7827,57827))
In [44]: int_index.equals(index._int64index)
Out[44]: True
In [41]: %timeit int_index.get_loc(12827)
The slowest run took 765.78 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.87 µs per loop # really fast
Any ideas? |
|
I am not sure why this is not cached. The reason I don't think this will have any negative effects and should fix most of the speed issues.
|
|
@MaximilianR if you want to make this change (and do a perf comparison) and no negative effects, and you do it soon, then could include in 0.19.2. |
jreback
added this to the
Next Major Release
milestone
Dec 20, 2016
|
the size issue is still related to boxing though. |
|
OK I'll work on that now, + the boxing. One more q - should the |
|
|
|
we also may not have sufficient asv for period (though not sure). for these cases pls add. |
jreback
modified the milestone: 0.19.2, Next Major Release
Dec 20, 2016
|
closed by #14931 |
MaximilianR commentedDec 8, 2016
Code Sample, a copy-pastable example if possible
Problem description
@sinhrks - now that the
PeriodIndexcall to.valuesunboxes all the periods, operations likePeriodIndex.sizeare much slower.What's the best way around this? Should we override more methods so that they call into
._valuesrather than.values?Output of
pd.show_versions()In [6]: pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.1.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: 1.2.0
tables: None
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: 1.1.2
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.5
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None