Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
performance regression in ewm.corr(pairwise=True) #17917
The deprecation of Panel in the 0.20.x releases has introduced a severe performance regression in ewm.corr(pairwise=True) for a common case when this function is called on a long time series (e.g. a dataframe with 1 million rows and 6 columns). The issue is the last 3 lines of code in this section of core/window.py:
# TODO: not the most efficient (perf-wise) # though not bad code-wise from pandas import Panel, MultiIndex, concat with warnings.catch_warnings(record=True): p = Panel.from_dict(results).swapaxes('items', 'major') if len(p.major_axis) > 0: p.major_axis = arg1.columns[p.major_axis] if len(p.minor_axis) > 0: p.minor_axis = arg2.columns[p.minor_axis] if len(p.items): result = concat( [p.iloc[i].T for i in range(len(p.items))], keys=p.items)
The result is converted from a Panel to a DataFrame by running a concat along an axis that is typically very long. This is killing performance for me compared to the 0.19 releases.
My solution was to replace the last 3 lines with:
result = DataFrame( p.values.reshape((p.shape, p.shape*p.shape)), index=p.items, columns=MultiIndex.from_product((arg1.columns, arg2.columns)) ) result = result.stack(dropna=False)
This works for me but I'm no pandas internals expert, so perhaps this solution does not work in all cases.
I'd really appreciate getting a workaround in though - clearly from the TODO comment the developers were at least aware this was a performance issue when they added the code. I'm happy to rejig the above if it has problems.
I realized my patch above won't work if arg1.columns or arg2.columns is a MultiIndex. I can fix it but I need a way of combining two indices into a MultiIndex where one (or both) of the indices could be a MultiIndex.
I.e. if idx1 and idx2 are Index objects, then I need an expression to compute idx equivalent to:
idx = concat([DataFrame(, columns=idx2)] * len(idx1), keys=idx1).index
The above is too slow since it creates too many unnecessary empty DataFrames.
If someone can provide me with a fast way of doing that I can get this working. Apologies, this is probably the wrong venue for this...I don't ever submit patches.
not really sure what you need, pls show an example