New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: PeriodIndex.size #14822

Closed
max-sixty opened this Issue Dec 8, 2016 · 17 comments

Comments

Projects
None yet
5 participants
@max-sixty
Contributor

max-sixty commented Dec 8, 2016

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: index = pd.PeriodIndex(start='1952', periods=50000, freq='B')

In [3]: %timeit index.size
10 loops, best of 3: 102 ms per loop

In [4]: %timeit index._values.size
The slowest run took 24.79 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 233 ns per loop

Problem description

@sinhrks - now that the PeriodIndex call to .values unboxes all the periods, operations like PeriodIndex.size are much slower.

What's the best way around this? Should we override more methods so that they call into ._values rather than .values?

Output of pd.show_versions()

In [6]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.1.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: 1.2.0
tables: None
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: 1.1.2
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.5
sqlalchemy: 1.1.4
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

@max-sixty

This comment has been minimized.

Show comment
Hide comment
@max-sixty

max-sixty Dec 15, 2016

Contributor

@sinhrks thoughts?

Contributor

max-sixty commented Dec 15, 2016

@sinhrks thoughts?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Dec 15, 2016

Member

I am going to release 0.19.2 in a few days, so if there is a PR, it could maybe still be included.

Member

jorisvandenbossche commented Dec 15, 2016

I am going to release 0.19.2 in a few days, so if there is a PR, it could maybe still be included.

@sinhrks

This comment has been minimized.

Show comment
Hide comment
@sinhrks

sinhrks Dec 15, 2016

Member

I think changing the definition of IndexOpsMixin.size to use _values should work (including Series with extension types)

Member

sinhrks commented Dec 15, 2016

I think changing the definition of IndexOpsMixin.size to use _values should work (including Series with extension types)

@max-sixty

This comment has been minimized.

Show comment
Hide comment
@max-sixty

max-sixty Dec 15, 2016

Contributor

@sinhrks do you know of any other properties or methods like this?

In this case, size could call __len__() and we can abstract over that

Contributor

max-sixty commented Dec 15, 2016

@sinhrks do you know of any other properties or methods like this?

In this case, size could call __len__() and we can abstract over that

@max-sixty

This comment has been minimized.

Show comment
Hide comment
@max-sixty

max-sixty Dec 17, 2016

Contributor

Should all of the methods of OpsIndexMixin: https://github.com/pandas-dev/pandas/blob/master/pandas/core/base.py#L817 use _values?

More generally, is it dangerous that calling .values on an object is a slow operation? Are we going to be chasing down performance issues for a while?

Contributor

max-sixty commented Dec 17, 2016

Should all of the methods of OpsIndexMixin: https://github.com/pandas-dev/pandas/blob/master/pandas/core/base.py#L817 use _values?

More generally, is it dangerous that calling .values on an object is a slow operation? Are we going to be chasing down performance issues for a while?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 17, 2016

Contributor

so _values are 'internal', while .values' is external. for lots of indexes (e.g. Int64) and non-pandas-dtypes Series these are the same. But for extensions these are different by design, e.g. a ._values should return for example a DatetimeIndex (for timezeone-aware series). So yes, using ._values will not involved coercions, while .values potentially could.

for PeriodIndex this previously didn't matter, but will going forward.

So changing the implementation would be fine (if any errors show up need to be looked at though), and potential perf comparisons...

Contributor

jreback commented Dec 17, 2016

so _values are 'internal', while .values' is external. for lots of indexes (e.g. Int64) and non-pandas-dtypes Series these are the same. But for extensions these are different by design, e.g. a ._values should return for example a DatetimeIndex (for timezeone-aware series). So yes, using ._values will not involved coercions, while .values potentially could.

for PeriodIndex this previously didn't matter, but will going forward.

So changing the implementation would be fine (if any errors show up need to be looked at though), and potential perf comparisons...

@max-sixty

This comment has been minimized.

Show comment
Hide comment
@max-sixty

max-sixty Dec 20, 2016

Contributor

OK, thanks @jreback

I think the problem is bigger than I imagined - a shallow copy takes 142ms, and even a basic lookup takes 1.4ms:

In [1]: import pandas as pd

In [2]: index=pd.PeriodIndex(start='2000', periods=50000, freq='D')

In [3]: %timeit index._shallow_copy()
1 loop, best of 3: 162 ms per loop

In [4]: %timeit index._shallow_copy(values=index._values)
The slowest run took 5.87 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.1 µs per loop

In [6]: all(index._shallow_copy(values=index._values) == index._shallow_copy())
Out[6]: True

In [9]: %timeit index.get_loc(index[500])
The slowest run took 26.78 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 1.46 ms per loop

So almost 1000x slower than Int64Index:

In [13]: index = pd.Int64Index(range(0,50000))

In [14]: %timeit index.get_loc(index[500])
The slowest run took 475.16 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.83 µs per loop

@jreback & @sinhrks do you have any suggestions for the most efficient way to solve this? I had planned to replace .values with ._values in a few places, but the required changes are probably much greater than that...

FWIW this makes the latest pandas unusable in our environment - the speed has fallen by a multiple, given how much we use PeriodIndex. Keen to be part of the solution though...

Contributor

max-sixty commented Dec 20, 2016

OK, thanks @jreback

I think the problem is bigger than I imagined - a shallow copy takes 142ms, and even a basic lookup takes 1.4ms:

In [1]: import pandas as pd

In [2]: index=pd.PeriodIndex(start='2000', periods=50000, freq='D')

In [3]: %timeit index._shallow_copy()
1 loop, best of 3: 162 ms per loop

In [4]: %timeit index._shallow_copy(values=index._values)
The slowest run took 5.87 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.1 µs per loop

In [6]: all(index._shallow_copy(values=index._values) == index._shallow_copy())
Out[6]: True

In [9]: %timeit index.get_loc(index[500])
The slowest run took 26.78 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 1.46 ms per loop

So almost 1000x slower than Int64Index:

In [13]: index = pd.Int64Index(range(0,50000))

In [14]: %timeit index.get_loc(index[500])
The slowest run took 475.16 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.83 µs per loop

@jreback & @sinhrks do you have any suggestions for the most efficient way to solve this? I had planned to replace .values with ._values in a few places, but the required changes are probably much greater than that...

FWIW this makes the latest pandas unusable in our environment - the speed has fallen by a multiple, given how much we use PeriodIndex. Keen to be part of the solution though...

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 20, 2016

Contributor

there is probably some boxing going on

this should be similar speed to a DTI index

the _shallow_copy is the tip off

Contributor

jreback commented Dec 20, 2016

there is probably some boxing going on

this should be similar speed to a DTI index

the _shallow_copy is the tip off

@max-sixty

This comment has been minimized.

Show comment
Hide comment
@max-sixty

max-sixty Dec 20, 2016

Contributor

there is probably some boxing going on
this should be similar speed to a DTI index
the _shallow_copy is the tip off

Boxing everywhere! For _shallow_copy it's here: https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/base.py#L374 (no cover in PeriodIndex)

I think the core issue is that lots of places we rely on .values being a fast array. Should these all be ._values?

Contributor

max-sixty commented Dec 20, 2016

there is probably some boxing going on
this should be similar speed to a DTI index
the _shallow_copy is the tip off

Boxing everywhere! For _shallow_copy it's here: https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/base.py#L374 (no cover in PeriodIndex)

I think the core issue is that lots of places we rely on .values being a fast array. Should these all be ._values?

@max-sixty

This comment has been minimized.

Show comment
Hide comment
@max-sixty

max-sixty Dec 20, 2016

Contributor

The weirdness deepens... I've tracked down the get_loc slowdown to getting the ordinal from the Int64Index:

index = pd.PeriodIndex(start='2000', periods=50000, freq='B')

In [37]: index._int64index
Out[37]:
Int64Index([ 7827,  7828,  7829,  7830,  7831,  7832,  7833,  7834,  7835,
             7836,
            ...
            57817, 57818, 57819, 57820, 57821, 57822, 57823, 57824, 57825,
            57826],
           dtype='int64', length=50000)

In [35]: %timeit index._int64index.get_loc(12827)
100 loops, best of 3: 1.57 ms per loop  # really slow

But if I create exactly the same index directly as an Int64Index, the problem goes away:

In [40]: int_index = pd.Int64Index(range(7827,57827))

In [44]: int_index.equals(index._int64index)
Out[44]: True

In [41]: %timeit int_index.get_loc(12827)
The slowest run took 765.78 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.87 µs per loop  # really fast

Any ideas?

Contributor

max-sixty commented Dec 20, 2016

The weirdness deepens... I've tracked down the get_loc slowdown to getting the ordinal from the Int64Index:

index = pd.PeriodIndex(start='2000', periods=50000, freq='B')

In [37]: index._int64index
Out[37]:
Int64Index([ 7827,  7828,  7829,  7830,  7831,  7832,  7833,  7834,  7835,
             7836,
            ...
            57817, 57818, 57819, 57820, 57821, 57822, 57823, 57824, 57825,
            57826],
           dtype='int64', length=50000)

In [35]: %timeit index._int64index.get_loc(12827)
100 loops, best of 3: 1.57 ms per loop  # really slow

But if I create exactly the same index directly as an Int64Index, the problem goes away:

In [40]: int_index = pd.Int64Index(range(7827,57827))

In [44]: int_index.equals(index._int64index)
Out[44]: True

In [41]: %timeit int_index.get_loc(12827)
The slowest run took 765.78 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.87 µs per loop  # really fast

Any ideas?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 20, 2016

Contributor

I am not sure why this is not cached. The reason .asi8 doen't need caching it is a cheap view (as its on the values), but here we need an actual index. And there is some overhead to creating an index itself (even if in this case its a pass thru).

I don't think this will have any negative effects and should fix most of the speed issues.

In [1]: index = pd.PeriodIndex(start='2000', periods=50000, freq='B')

In [2]: %timeit index._int64index.get_loc(12827)
The slowest run took 1381.93 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.3 µs per loop
diff --git a/pandas/tseries/period.py b/pandas/tseries/period.py
index 4bab3bc..593ac3d 100644
--- a/pandas/tseries/period.py
+++ b/pandas/tseries/period.py
@@ -356,9 +356,8 @@ class PeriodIndex(DatelikeOps, DatetimeIndexOpsMixin, Int64Index):
     def asi8(self):
         return self._values.view('i8')
 
-    @property
+    @cache_readonly
     def _int64index(self):
-        # do not cache, same as .asi8
         return Int64Index(self.asi8, name=self.name, fastpath=True)
 
     @property
Contributor

jreback commented Dec 20, 2016

I am not sure why this is not cached. The reason .asi8 doen't need caching it is a cheap view (as its on the values), but here we need an actual index. And there is some overhead to creating an index itself (even if in this case its a pass thru).

I don't think this will have any negative effects and should fix most of the speed issues.

In [1]: index = pd.PeriodIndex(start='2000', periods=50000, freq='B')

In [2]: %timeit index._int64index.get_loc(12827)
The slowest run took 1381.93 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.3 µs per loop
diff --git a/pandas/tseries/period.py b/pandas/tseries/period.py
index 4bab3bc..593ac3d 100644
--- a/pandas/tseries/period.py
+++ b/pandas/tseries/period.py
@@ -356,9 +356,8 @@ class PeriodIndex(DatelikeOps, DatetimeIndexOpsMixin, Int64Index):
     def asi8(self):
         return self._values.view('i8')
 
-    @property
+    @cache_readonly
     def _int64index(self):
-        # do not cache, same as .asi8
         return Int64Index(self.asi8, name=self.name, fastpath=True)
 
     @property
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 20, 2016

Contributor

@MaximilianR if you want to make this change (and do a perf comparison) and no negative effects, and you do it soon, then could include in 0.19.2.

Contributor

jreback commented Dec 20, 2016

@MaximilianR if you want to make this change (and do a perf comparison) and no negative effects, and you do it soon, then could include in 0.19.2.

@jreback jreback added this to the Next Major Release milestone Dec 20, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 20, 2016

Contributor

the size issue is still related to boxing though.

Contributor

jreback commented Dec 20, 2016

the size issue is still related to boxing though.

@max-sixty

This comment has been minimized.

Show comment
Hide comment
@max-sixty

max-sixty Dec 20, 2016

Contributor

OK I'll work on that now, + the boxing.

One more q - should the .shape-like properties be calling into ._values or ._data? I'm not sure of the distinction

Contributor

max-sixty commented Dec 20, 2016

OK I'll work on that now, + the boxing.

One more q - should the .shape-like properties be calling into ._values or ._data? I'm not sure of the distinction

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 20, 2016

Contributor

._values is fine (you can try to do this in pandas.core.base and it should be ok, but if you encounter issues then should do it in sub-classes (but try in the base class first)

Contributor

jreback commented Dec 20, 2016

._values is fine (you can try to do this in pandas.core.base and it should be ok, but if you encounter issues then should do it in sub-classes (but try in the base class first)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 20, 2016

Contributor

we also may not have sufficient asv for period (though not sure). for these cases pls add.

Contributor

jreback commented Dec 20, 2016

we also may not have sufficient asv for period (though not sure). for these cases pls add.

@jreback jreback modified the milestones: 0.19.2, Next Major Release Dec 20, 2016

@max-sixty max-sixty referenced this issue Dec 20, 2016

Merged

PERF: PeriodIndex speed up #14931

4 of 4 tasks complete
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 21, 2016

Contributor

closed by #14931

Contributor

jreback commented Dec 21, 2016

closed by #14931

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment