New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame __getitem__ ~100X slower on Pandas 0.19.1 vs 0.18.1 possibly getitem caching? #14930

Closed
dragoljub opened this Issue Dec 20, 2016 · 7 comments

Comments

Projects
None yet
3 participants
@dragoljub
# Run on Pandas 0.19.1

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(1000,5000), columns=['C'+str(c) for c in range(5000)])

%%prun
for col in df:
    df[col]

         65011 function calls (65010 primitive calls) in 2.012 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5000    1.935    0.000    1.955    0.000 base.py:1362(__contains__)
     5000    0.025    0.000    2.002    0.000 frame.py:2033(__getitem__)
        1    0.011    0.011    2.012    2.012 <string>:2(<module>)
     5000    0.010    0.000    0.016    0.000 frame.py:2059(_getitem_column)
     5001    0.007    0.000    0.007    0.000 {method 'view' of 'numpy.ndarray' objects}
     5001    0.004    0.000    0.012    0.000 base.py:492(values)
     5000    0.004    0.000    0.007    0.000 generic.py:1381(_get_item_cache)
     5000    0.004    0.000    0.018    0.000 base.py:1275(<lambda>)
     5000    0.003    0.000    0.014    0.000 base.py:874(_values)
     5000    0.002    0.000    0.002    0.000 {built-in method builtins.isinstance}
     5000    0.002    0.000    0.003    0.000 common.py:438(_apply_if_callable)
     5000    0.002    0.000    0.002    0.000 {method 'get' of 'dict' objects}
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.hash}
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.callable}
        1    0.000    0.000    2.012    2.012 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 generic.py:833(__iter__)
        1    0.000    0.000    0.000    0.000 generic.py:401(_info_axis)
      2/1    0.000    0.000    0.000    0.000 {built-in method builtins.iter}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        1    0.000    0.000    0.000    0.000 base.py:1315(__iter__)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

%time df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 5000 entries, C0 to C4999
dtypes: float64(5000)
memory usage: 38.1 MB
Wall time: 4.39 s

Problem description

It appears that get_item_cache() or __contains__ may have something to do with it. This affects other functionality such as df.info() which is now also ~100X slower.

Expected Output

# Run on Pandas 0.18.1

In [6]: %%prun
   ...: for col in df:
   ...:     df[col]
   ...:
         45011 function calls (45010 primitive calls) in 0.032 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5000    0.012    0.000    0.028    0.000 frame.py:1973(__getitem__)
     5000    0.005    0.000    0.008    0.000 frame.py:1999(_getitem_column)
     5000    0.004    0.000    0.005    0.000 base.py:1233(__contains__)
        1    0.004    0.004    0.032    0.032 <string>:2(<module>)
     5000    0.002    0.000    0.003    0.000 generic.py:1345(_get_item_cache)
     5000    0.001    0.000    0.002    0.000 common.py:1846(_apply_if_callable)
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
     5000    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.hash}
     5000    0.000    0.000    0.000    0.000 {built-in method builtins.callable}
        1    0.000    0.000    0.032    0.032 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 generic.py:808(__iter__)
        1    0.000    0.000    0.000    0.000 base.py:440(values)
      2/1    0.000    0.000    0.000    0.000 {built-in method builtins.iter}
        1    0.000    0.000    0.000    0.000 {method 'view' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 generic.py:381(_info_axis)
        1    0.000    0.000    0.000    0.000 base.py:1186(__iter__)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

%time df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 5000 entries, C0 to C4999
dtypes: float64(5000)
memory usage: 38.1 MB
Wall time: 40 ms

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 8.1.1
setuptools: 20.10.1
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.7.1
IPython: 4.2.0
sphinx: 1.3.6
patsy: 0.4.0
dateutil: 2.5.0
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.2.5
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jreback jreback added this to the 0.19.2 milestone Dec 20, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 20, 2016

Contributor

I have a fix for this.

Contributor

jreback commented Dec 20, 2016

I have a fix for this.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 20, 2016

Contributor

I changed this 3c96442 to make it lazy
but it keeps reinitializing which is not great

Contributor

jreback commented Dec 20, 2016

I changed this 3c96442 to make it lazy
but it keeps reinitializing which is not great

@dragoljub

This comment has been minimized.

Show comment
Hide comment
@dragoljub

dragoljub Dec 20, 2016

@jreback Thanks for looking into this. 👍

Can you provide some info about over_size_threshold. It does appear this issue does not affect all frames, just those with many columns. What is over_size_threshold currently set to? Is there a way to prevent creation of hash tables on column index?

Was this regression due to something that improved performance of row-based index but possibly hurt columnar access?

@jreback Thanks for looking into this. 👍

Can you provide some info about over_size_threshold. It does appear this issue does not affect all frames, just those with many columns. What is over_size_threshold currently set to? Is there a way to prevent creation of hash tables on column index?

Was this regression due to something that improved performance of row-based index but possibly hurt columnar access?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 20, 2016

Contributor

see the referenced issues. They explain all.

Contributor

jreback commented Dec 20, 2016

see the referenced issues. They explain all.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 20, 2016

Contributor

@dragoljub short answer is this.

When the over_size_threshold was triggered (> 1M len of index), a different path was followed where the hashtable was constructed for any type of indexing operation (once its constructed it stays till collected). The issue was that for these large indexes we actually use searchsorted rather than the hashtable to look things up. So we were needlessly constructing and using memory for this.

The solution was to do our uniqueness checking at the same time as we are already checking for sortedness( e.g. the is_monotonic_increasing).

However it seems that this was each time constructing the index mapping (bad).

cc @llllllllll
cc @ssanderson

Contributor

jreback commented Dec 20, 2016

@dragoljub short answer is this.

When the over_size_threshold was triggered (> 1M len of index), a different path was followed where the hashtable was constructed for any type of indexing operation (once its constructed it stays till collected). The issue was that for these large indexes we actually use searchsorted rather than the hashtable to look things up. So we were needlessly constructing and using memory for this.

The solution was to do our uniqueness checking at the same time as we are already checking for sortedness( e.g. the is_monotonic_increasing).

However it seems that this was each time constructing the index mapping (bad).

cc @llllllllll
cc @ssanderson

@ssanderson

This comment has been minimized.

Show comment
Hide comment
@ssanderson

ssanderson Dec 21, 2016

Contributor

@jreback I took a look at your proposed fix. I think it looks correct, but I think there might be a simpler fix that just makes initialized synonymous with self.mapping is not None (or drops initialized entirely in favor of theis not Nonecheck), and changes the gating behavior on theis_unique` property. It's certainly possible that I missed something though: keeping all the different flags straight in my head is tricky.

Contributor

ssanderson commented Dec 21, 2016

@jreback I took a look at your proposed fix. I think it looks correct, but I think there might be a simpler fix that just makes initialized synonymous with self.mapping is not None (or drops initialized entirely in favor of theis not Nonecheck), and changes the gating behavior on theis_unique` property. It's certainly possible that I missed something though: keeping all the different flags straight in my head is tricky.

@dragoljub

This comment has been minimized.

Show comment
Hide comment
@dragoljub

dragoljub Dec 21, 2016

@jreback Thanks for the explanation and quick response time on this. 😄 This will be a great last minute per fix for 0.19.2!

@jreback Thanks for the explanation and quick response time on this. 😄 This will be a great last minute per fix for 0.19.2!

@jreback jreback closed this in 07c83ee Dec 21, 2016

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Dec 24, 2016

PERF: fix getitem unique_check / initialization issue
closes #14930

Author: Jeff Reback <jeff@reback.net>

Closes #14933 from jreback/perf and squashes the following commits:

dc32b39 [Jeff Reback] PERF: fix getitem unique_check / initialization issue

(cherry picked from commit 07c83ee)

ShaharBental added a commit to ShaharBental/pandas that referenced this issue Dec 26, 2016

PERF: fix getitem unique_check / initialization issue
closes #14930

Author: Jeff Reback <jeff@reback.net>

Closes #14933 from jreback/perf and squashes the following commits:

dc32b39 [Jeff Reback] PERF: fix getitem unique_check / initialization issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment