Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Enable DataFrame.corrwith to compute rank correlations #22375

Merged
merged 10 commits into from
Dec 31, 2018
Merged

ENH: Enable DataFrame.corrwith to compute rank correlations #22375

merged 10 commits into from
Dec 31, 2018

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Aug 16, 2018

This PR is to enable DataFrame.corrwith to calculate rank correlations in addition to Pearson's correlation (and should hopefully be fully backwards compatible), as well as clarifying functionality in the docstring . An error message is generated if the user specifies a form of correlation that isn't implemented. Tests were added for the new behavior.

closes #22328
closes #21925

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you are changing the impl can you run the asv for perf check

@dsaxton
Copy link
Member Author

dsaxton commented Aug 16, 2018

I'm having some trouble finding a relevant benchmark for corrwith (a grep on the benchmarks folder returns no hits). Should one be added to the Correlation class in stats_ops.py?

@dsaxton
Copy link
Member Author

dsaxton commented Aug 16, 2018

Here is one I put together which seems to suggest around a 33-50% speed-up with the new implementation.

import numpy as np
from pandas import DataFrame, Series

np.random.seed(2357)

def corrwith_old(df1, df2, axis=0, drop=False):
    axis = df1._get_axis_number(axis)
    this = df1._get_numeric_data()

    if isinstance(df2, Series):
        return this.apply(df2.corr, axis=axis)

    df2 = df2._get_numeric_data()

    left, right = this.align(df2, join='inner', copy=False)

    # mask missing values
    left = left + right * 0
    right = right + left * 0

    if axis == 1:
        left = left.T
        right = right.T

    # demeaned data
    ldem = left - left.mean()
    rdem = right - right.mean()

    num = (ldem * rdem).sum()
    dom = (left.count() - 1) * left.std() * right.std()

    correl = num / dom

    if not drop:
        raxis = 1 if axis == 0 else 0
        result_index = this._get_axis(raxis).union(df2._get_axis(raxis))
        correl = correl.reindex(result_index)

    return correl


def corrwith_new(df1, df2, axis=0, drop=False, method='pearson'):
    if method not in ['pearson', 'spearman', 'kendall']:
        raise ValueError("method must be either 'pearson', "
                         "'spearman', or 'kendall', '{method}' "
                         "was supplied".format(method=method))

    axis = df1._get_axis_number(axis)
    this = df1._get_numeric_data()

    if isinstance(df2, Series):
        return this.apply(lambda x: df2.corr(x, method=method),
                          axis=axis)

    df2 = df2._get_numeric_data()
    left, right = this.align(df2, join='inner', copy=False)

    if axis == 1:
        left = left.T
        right = right.T

    correl = (left.apply(lambda x:
                         x.corr(right[x.name], method=method)))

    if drop:
        correl.dropna(inplace=True)

    return correl


data1 = DataFrame(np.random.normal(size=(10**6, 10)))
data2 = DataFrame(np.random.normal(size=(10**6, 10)))

%timeit corrwith_old(data1, data2)
%timeit corrwith_new(data1, data2)

@jreback
Copy link
Contributor

jreback commented Aug 16, 2018

can u add this as an asv

in particular i suspect you method will slow down when used with a larger number of columns

so pls show that as well

@codecov
Copy link

codecov bot commented Aug 16, 2018

Codecov Report

Merging #22375 into master will decrease coverage by 60.42%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #22375       +/-   ##
===========================================
- Coverage   92.31%   31.89%   -60.43%     
===========================================
  Files         166      166               
  Lines       52335    52429       +94     
===========================================
- Hits        48313    16722    -31591     
- Misses       4022    35707    +31685
Flag Coverage Δ
#multiple 30.28% <0%> (-60.45%) ⬇️
#single 31.89% <0%> (-11.16%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 27.82% <0%> (-69.1%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-98.65%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/core/reshape/reshape.py 8.06% <0%> (-91.51%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
... and 127 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aeff38d...870d1a3. Read the comment docs.

@dsaxton
Copy link
Member Author

dsaxton commented Aug 16, 2018

I've added the new benchmark function time_corrwith to stats_ops.py within the Correlation class. I'll try to run the benchmark locally (although I botched the setup of asv on my main computer and it seems to be broken).

@dsaxton
Copy link
Member Author

dsaxton commented Aug 16, 2018

@jreback I think your suspicion may have been accurate regarding the dimension of the inputs. In any case, for the tests it looks like there's an issue with scipy during the travis build. Am I missing an import somewhere perhaps?

@WillAyd
Copy link
Member

WillAyd commented Aug 16, 2018

@dsaxton we test quite a few different configurations, not all of which have SciPy installed. If your tests depend on that package you should use the skip_if_no_scipy decorate which you'll see used in other places in that same module


def time_corr(self, method):
self.df.corr(method=method)

def time_corrwith(self, method):
self.df.corrwith(df2, method=method)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df2 should be self.df2 here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thank you.

@dsaxton
Copy link
Member Author

dsaxton commented Aug 16, 2018

@jreback If a certain benchmark is actually part of the PR, what would you say is the most straightforward way to show that there isn't a degradation in performance? Here the time_corrwith function will not work for the current implementation because it uses all three correlation types. I suppose it could temporarily be modified to look only at Pearson?

@WillAyd
Copy link
Member

WillAyd commented Aug 16, 2018

You should have for all three. The two new ones will fail since they don't exist on master but that's OK - still sets a baseline going forward and we can still get insights out of the Pearson benchmark

@dsaxton
Copy link
Member Author

dsaxton commented Aug 16, 2018

screen shot 2018-08-16 at 4 31 22 pm

Here are the results I get running the stat_ops.Correlation benchmarks (all DataFrames involved are comprised of 1000 x 30 Gaussian arrays). The Pearson calculation somehow took longer than that of DataFrame.corr, even though the latter should actually be doing more work since it computes the full 30 x 30 correlation matrix. I'll need to run it on the current corrwith to see how that performs.

@dsaxton
Copy link
Member Author

dsaxton commented Aug 17, 2018

I modified the body of the function to mimic how the current calculation is being done in the special case where method=pearson (in the event that there was somehow an overhead associated with repeatedly calling Series.corr) and I saw essentially the same performance as before (shown below). My guess then is that DataFrame.corr gets its speed from nancorr within pandas._libs.algos. Could using functions from algos inside corrwith as well make sense?

        if method in ['spearman', 'kendall']:
            correl = (left.apply(lambda x:
                                 x.corr(right[x.name], method=method)))
        else:
            correl = (((right - right.mean()) * (left - left.mean())).mean()
                      / right.std() / left.std())

screen shot 2018-08-16 at 6 52 16 pm

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a whatsnew note in other enhancements

asv_bench/benchmarks/stat_ops.py Show resolved Hide resolved
pandas/core/frame.py Show resolved Hide resolved
method : {'pearson', 'kendall', 'spearman'}
* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation

Returns
-------
correls : Series
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a See Also and revert to .corr (and add to .corr a See Also referring to .corrwith).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do the asv's show ?

pandas/core/frame.py Show resolved Hide resolved
@jreback jreback added Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations labels Aug 17, 2018
@jreback jreback added this to the 0.24.0 milestone Aug 17, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a whatsnew as well (with both issues)

pandas/tests/frame/test_analytics.py Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Aug 17, 2018

don't post pictures of the asv's just copy-paste the text pls.

@dsaxton
Copy link
Member Author

dsaxton commented Aug 17, 2018

asv results for the Correlation class (DataFrames are built from np.random.randn(1000, 30)):

> asv continuous -f 1.1 origin/corrwith-dev HEAD -b stat_ops.Correlation
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Building a511a498 for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt........................................
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For pandas commit hash a511a498:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ··· Running (stat_ops.Correlation.time_corr--)...
[ 16.67%] ··· stat_ops.Correlation.time_corr                                                                                         ok
[ 16.67%] ··· ========== =============
                method                
              ---------- -------------
               spearman     93.6±2ms  
               kendall      219±4ms   
               pearson    2.14±0.09ms 
              ========== =============

[ 33.33%] ··· stat_ops.Correlation.time_corrwith_cols                                                                                ok
[ 33.33%] ··· ========== ============
                method               
              ---------- ------------
               spearman    51.8±1ms  
               kendall     20.6±1ms  
               pearson    17.0±0.7ms 
              ========== ============

[ 50.00%] ··· stat_ops.Correlation.time_corrwith_rows                                                                                ok
[ 50.00%] ··· ========== ==========
                method             
              ---------- ----------
               spearman   758±30ms 
               kendall    493±7ms  
               pearson    258±7ms  
              ========== ==========

[ 50.00%] · For pandas commit hash a511a498:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running (stat_ops.Correlation.time_corr--)...
[ 66.67%] ··· stat_ops.Correlation.time_corr                                                                                         ok
[ 66.67%] ··· ========== =============
                method                
              ---------- -------------
               spearman     101±6ms   
               kendall      226±9ms   
               pearson    2.16±0.03ms 
              ========== =============

[ 83.33%] ··· stat_ops.Correlation.time_corrwith_cols                                                                                ok
[ 83.33%] ··· ========== ============
                method               
              ---------- ------------
               spearman    52.1±4ms  
               kendall     21.1±2ms  
               pearson    16.5±0.6ms 
              ========== ============

[100.00%] ··· stat_ops.Correlation.time_corrwith_rows                                                                                ok
[100.00%] ··· ========== ==========
                method             
              ---------- ----------
               spearman   752±20ms 
               kendall    490±4ms  
               pearson    266±3ms  
              ========== ==========


BENCHMARKS NOT SIGNIFICANTLY CHANGED.
> 

pandas/core/frame.py Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 23, 2018 via email

@dsaxton
Copy link
Member Author

dsaxton commented Aug 23, 2018

@TomAugspurger I just pushed some changes that include the above (zip / map instead of apply, more clarity in the whatsnew), and also added a relatively simple test to check if the method can handle duplicate columns (below). I may have mauled the git history a bit with all the merging, so a rebase might be needed.

Regarding the test, I am creating a DataFrame with three columns equal to the integers 0, 1, ... , 9, then creating another DataFrame which adds an additional column of the same form with a duplicate column name. The result of corrwith should be a Series of four ones (and should not generate an error). Please let me know if there's a more straightforward way to simply check that the method doesn't err out.

    def test_corrwith_dup_cols(self):
        # GH 21925
        df1 = pd.DataFrame(np.vstack([np.arange(10)] * 3).T)
        df2 = df1.copy()
        df2 = pd.concat((df2, df2[0]), axis=1)

        result = df1.corrwith(df2).values
        expected = np.ones(4)
        tm.assert_almost_equal(result, expected)

@jreback
Copy link
Contributor

jreback commented Sep 4, 2018

can you rebase

@pep8speaks
Copy link

pep8speaks commented Sep 5, 2018

Hello @dsaxton! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 28, 2018 at 02:42 Hours UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the perf diff with this (also add something with a lot of columns, like 100) to see what effect that has


num = (ldem * rdem).sum()
dom = (left.count() - 1) * left.std() * right.std()
correl = Series(map(lambda x: Series(x[0]).corr(Series(x[1]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of nesting this, can you create an in-line named function that you pass to the map

@jreback
Copy link
Contributor

jreback commented Dec 23, 2018

can you merge master and update to comments

@mroeschke
Copy link
Member

Looks like there's a linting error:

2018-12-26T19:14:53.0223964Z ##[error]./asv_bench/benchmarks/stat_ops.py(115,1): error W293: blank line contains whitespace

doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved
pandas/core/frame.py Show resolved Hide resolved
dom = (left.count() - 1) * left.std() * right.std()
correl = num / dom

else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do elif method in ['.....], and add an else clause that raises (wrong method is passed)

@pytest.mark.xfail
def test_corrwith_dup_cols(self):
# GH 21925
df1 = pd.DataFrame(np.vstack([np.arange(10)] * 3).T)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example with an empty frame.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test to ensure that the output is an empty Series (with the proper index)?

doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved
@@ -466,6 +466,33 @@ def test_corrwith_mixed_dtypes(self):
expected = pd.Series(data=corrs, index=['a', 'b'])
tm.assert_series_equal(result, expected)

@pytest.mark.xfail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm missing something, but why is this xfailed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had xfailed this because master was giving an error when columns were duplicated (ValueError: cannot reindex from a duplicate axis)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rewrite the test to not have duplicate labels?

But I notice now that on master we do allow duplicates in DataFrame.corrwith

In [6]: df = pd.DataFrame([[1, 2], [3, 4]], columns=[0, 0], index=[1, 1])

In [7]: df.corrwith(df)
Out[7]:
0    1.0
0    1.0
dtype: float64

so we may need to adjust the implementation to not regress on that. It'd be good to add some tests for that if they aren't already present.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wonder why that code sample works but this one gives an error:

import numpy as np
import pandas as pd

df1 = pd.DataFrame(np.random.random(size=(10, 2)), columns=["a", "b"])
df2 = pd.DataFrame(np.random.random(size=(10, 2)), columns=["a", "a"])

df1.corrwith(df2)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger By duplicate labels do you mean the values within the DataFrame themselves (i.e., not the indices or column names)?

doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved
pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/tests/frame/test_analytics.py Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Dec 29, 2018

can you merge master, also update to handle duplicates

daniel saxton and others added 9 commits December 29, 2018 12:10
* Remove incorrect error (didn't account for callables)
* Add xfail to duplicate columns test
* Fix transpose (was taken twice for Pearson)
* Remove inplace usage for dropna
* Check for invalid method
* Do not cast arrays to Series in function c
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment. ping on green.

pandas/core/frame.py Outdated Show resolved Hide resolved
@jreback jreback added this to the 0.24.0 milestone Dec 30, 2018
* Add comment for when drop is False
* Check if len(idx_diff) > 0
* Remove unnecessary string casting in error message
@jreback
Copy link
Contributor

jreback commented Dec 30, 2018

lgtm. ping on green.

@dsaxton
Copy link
Member Author

dsaxton commented Dec 31, 2018

@jreback Looks like it's passing. Thanks for your help and patience!

@jreback jreback merged commit 08640c3 into pandas-dev:master Dec 31, 2018
@jreback
Copy link
Contributor

jreback commented Dec 31, 2018

thanks @dsaxton

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.corrwith has unintuitive behavior Allow different methods of correlation when using corrwith
6 participants