New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance regression in ewm.corr(pairwise=True) #17917

Closed
grumpyquant opened this Issue Oct 19, 2017 · 5 comments

Comments

Projects
None yet
2 participants
@grumpyquant

grumpyquant commented Oct 19, 2017

Problem description

The deprecation of Panel in the 0.20.x releases has introduced a severe performance regression in ewm.corr(pairwise=True) for a common case when this function is called on a long time series (e.g. a dataframe with 1 million rows and 6 columns). The issue is the last 3 lines of code in this section of core/window.py:

                # TODO: not the most efficient (perf-wise)
                # though not bad code-wise
                from pandas import Panel, MultiIndex, concat

                with warnings.catch_warnings(record=True):
                    p = Panel.from_dict(results).swapaxes('items', 'major')
                    if len(p.major_axis) > 0:
                        p.major_axis = arg1.columns[p.major_axis]
                    if len(p.minor_axis) > 0:
                        p.minor_axis = arg2.columns[p.minor_axis]

                if len(p.items):
                    result = concat(
                        [p.iloc[i].T for i in range(len(p.items))],
                        keys=p.items)

The result is converted from a Panel to a DataFrame by running a concat along an axis that is typically very long. This is killing performance for me compared to the 0.19 releases.

My solution was to replace the last 3 lines with:

                      result = DataFrame(
                          p.values.reshape((p.shape[0], p.shape[1]*p.shape[2])),
                          index=p.items,
                          columns=MultiIndex.from_product((arg1.columns, arg2.columns))
                      )
                      result = result.stack(dropna=False)

This works for me but I'm no pandas internals expert, so perhaps this solution does not work in all cases.

I'd really appreciate getting a workaround in though - clearly from the TODO comment the developers were at least aware this was a performance issue when they added the code. I'm happy to rejig the above if it has problems.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.22.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 2.8.5
pip: 9.0.1
setuptools: 26.1.1
Cython: 0.26
numpy: 1.13.3
scipy: 0.19.1
xarray: 0.8.2
IPython: 6.0.0
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 19, 2017

if you replace this does it pass the tests? want to do a PR? (would also appreciate an asv for this).

@grumpyquant

This comment has been minimized.

grumpyquant commented Oct 19, 2017

What's an ASV?

@grumpyquant

This comment has been minimized.

grumpyquant commented Oct 19, 2017

Thanks.

I realized my patch above won't work if arg1.columns or arg2.columns is a MultiIndex. I can fix it but I need a way of combining two indices into a MultiIndex where one (or both) of the indices could be a MultiIndex.

I.e. if idx1 and idx2 are Index objects, then I need an expression to compute idx equivalent to:

idx = concat([DataFrame([], columns=idx2)] * len(idx1), keys=idx1).index

The above is too slow since it creates too many unnecessary empty DataFrames.

If someone can provide me with a fast way of doing that I can get this working. Apologies, this is probably the wrong venue for this...I don't ever submit patches.

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 19, 2017

not really sure what you need, pls show an example

In [15]: idx1 = pd.Index(list('abc'))

In [16]: idx2 = pd.Index([1, 2, 3])

In [17]: pd.MultiIndex.from_arrays([idx1, idx2])
Out[17]: 
MultiIndex(levels=[['a', 'b', 'c'], [1, 2, 3]],
           labels=[[0, 1, 2], [0, 1, 2]])

@jreback jreback added this to the 0.21.1 milestone Oct 20, 2017

@jreback jreback modified the milestones: 0.21.1, Next Major Release Dec 2, 2017

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Jan 15, 2018

jreback added a commit to jreback/pandas that referenced this issue Jan 15, 2018

jreback added a commit to jreback/pandas that referenced this issue Jan 16, 2018

jreback added a commit to jreback/pandas that referenced this issue Jan 27, 2018

jreback added a commit to jreback/pandas that referenced this issue Feb 1, 2018

jreback added a commit that referenced this issue Feb 1, 2018

PERF: remove use of Panel & perf in rolling corr/cov (#19257)
* PERF: remove use of Panel & perf in rolling corr/cov

closes #17917

harisbal pushed a commit to harisbal/pandas that referenced this issue Feb 28, 2018

PERF: remove use of Panel & perf in rolling corr/cov (pandas-dev#19257)
* PERF: remove use of Panel & perf in rolling corr/cov

closes pandas-dev#17917
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment