ENH: Enable DataFrame.corrwith to compute rank correlations #22375

dsaxton · 2018-08-16T00:14:57Z

This PR is to enable DataFrame.corrwith to calculate rank correlations in addition to Pearson's correlation (and should hopefully be fully backwards compatible), as well as clarifying functionality in the docstring . An error message is generated if the user specifies a form of correlation that isn't implemented. Tests were added for the new behavior.

closes #22328
closes #21925

jreback

since you are changing the impl can you run the asv for perf check

dsaxton · 2018-08-16T04:07:12Z

I'm having some trouble finding a relevant benchmark for corrwith (a grep on the benchmarks folder returns no hits). Should one be added to the Correlation class in stats_ops.py?

dsaxton · 2018-08-16T04:36:55Z

Here is one I put together which seems to suggest around a 33-50% speed-up with the new implementation.

import numpy as np
from pandas import DataFrame, Series

np.random.seed(2357)

def corrwith_old(df1, df2, axis=0, drop=False):
    axis = df1._get_axis_number(axis)
    this = df1._get_numeric_data()

    if isinstance(df2, Series):
        return this.apply(df2.corr, axis=axis)

    df2 = df2._get_numeric_data()

    left, right = this.align(df2, join='inner', copy=False)

    # mask missing values
    left = left + right * 0
    right = right + left * 0

    if axis == 1:
        left = left.T
        right = right.T

    # demeaned data
    ldem = left - left.mean()
    rdem = right - right.mean()

    num = (ldem * rdem).sum()
    dom = (left.count() - 1) * left.std() * right.std()

    correl = num / dom

    if not drop:
        raxis = 1 if axis == 0 else 0
        result_index = this._get_axis(raxis).union(df2._get_axis(raxis))
        correl = correl.reindex(result_index)

    return correl


def corrwith_new(df1, df2, axis=0, drop=False, method='pearson'):
    if method not in ['pearson', 'spearman', 'kendall']:
        raise ValueError("method must be either 'pearson', "
                         "'spearman', or 'kendall', '{method}' "
                         "was supplied".format(method=method))

    axis = df1._get_axis_number(axis)
    this = df1._get_numeric_data()

    if isinstance(df2, Series):
        return this.apply(lambda x: df2.corr(x, method=method),
                          axis=axis)

    df2 = df2._get_numeric_data()
    left, right = this.align(df2, join='inner', copy=False)

    if axis == 1:
        left = left.T
        right = right.T

    correl = (left.apply(lambda x:
                         x.corr(right[x.name], method=method)))

    if drop:
        correl.dropna(inplace=True)

    return correl


data1 = DataFrame(np.random.normal(size=(10**6, 10)))
data2 = DataFrame(np.random.normal(size=(10**6, 10)))

%timeit corrwith_old(data1, data2)
%timeit corrwith_new(data1, data2)

jreback · 2018-08-16T09:17:24Z

can u add this as an asv

in particular i suspect you method will slow down when used with a larger number of columns

so pls show that as well

codecov · 2018-08-16T12:35:27Z

Codecov Report

Merging #22375 into master will decrease coverage by 60.42%.
The diff coverage is 0%.

@@             Coverage Diff             @@
##           master   #22375       +/-   ##
===========================================
- Coverage   92.31%   31.89%   -60.43%     
===========================================
  Files         166      166               
  Lines       52335    52429       +94     
===========================================
- Hits        48313    16722    -31591     
- Misses       4022    35707    +31685

Flag	Coverage Δ
#multiple	`30.28% <0%> (-60.45%)`	⬇️
#single	`31.89% <0%> (-11.16%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`27.82% <0%> (-69.1%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-98.65%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/core/reshape/reshape.py	`8.06% <0%> (-91.51%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
... and 127 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aeff38d...870d1a3. Read the comment docs.

dsaxton · 2018-08-16T12:40:09Z

I've added the new benchmark function time_corrwith to stats_ops.py within the Correlation class. I'll try to run the benchmark locally (although I botched the setup of asv on my main computer and it seems to be broken).

dsaxton · 2018-08-16T16:31:34Z

@jreback I think your suspicion may have been accurate regarding the dimension of the inputs. In any case, for the tests it looks like there's an issue with scipy during the travis build. Am I missing an import somewhere perhaps?

WillAyd · 2018-08-16T18:07:00Z

@dsaxton we test quite a few different configurations, not all of which have SciPy installed. If your tests depend on that package you should use the skip_if_no_scipy decorate which you'll see used in other places in that same module

mroeschke · 2018-08-16T18:09:31Z

asv_bench/benchmarks/stat_ops.py


    def time_corr(self, method):
        self.df.corr(method=method)
+
+    def time_corrwith(self, method):
+        self.df.corrwith(df2, method=method)


df2 should be self.df2 here.

Good catch, thank you.

dsaxton · 2018-08-16T21:37:56Z

@jreback If a certain benchmark is actually part of the PR, what would you say is the most straightforward way to show that there isn't a degradation in performance? Here the time_corrwith function will not work for the current implementation because it uses all three correlation types. I suppose it could temporarily be modified to look only at Pearson?

WillAyd · 2018-08-16T21:38:56Z

You should have for all three. The two new ones will fail since they don't exist on master but that's OK - still sets a baseline going forward and we can still get insights out of the Pearson benchmark

dsaxton · 2018-08-16T21:51:53Z

Here are the results I get running the stat_ops.Correlation benchmarks (all DataFrames involved are comprised of 1000 x 30 Gaussian arrays). The Pearson calculation somehow took longer than that of DataFrame.corr, even though the latter should actually be doing more work since it computes the full 30 x 30 correlation matrix. I'll need to run it on the current corrwith to see how that performs.

dsaxton · 2018-08-17T00:08:15Z

I modified the body of the function to mimic how the current calculation is being done in the special case where method=pearson (in the event that there was somehow an overhead associated with repeatedly calling Series.corr) and I saw essentially the same performance as before (shown below). My guess then is that DataFrame.corr gets its speed from nancorr within pandas._libs.algos. Could using functions from algos inside corrwith as well make sense?

        if method in ['spearman', 'kendall']:
            correl = (left.apply(lambda x:
                                 x.corr(right[x.name], method=method)))
        else:
            correl = (((right - right.mean()) * (left - left.mean())).mean()
                      / right.std() / left.std())

jreback

pls add a whatsnew note in other enhancements

asv_bench/benchmarks/stat_ops.py

pandas/core/frame.py

jreback · 2018-08-17T00:19:37Z

pandas/core/frame.py

+        method : {'pearson', 'kendall', 'spearman'}
+            * pearson : standard correlation coefficient
+            * kendall : Kendall Tau correlation coefficient
+            * spearman : Spearman rank correlation

        Returns
        -------
        correls : Series
        """


can you add a See Also and revert to .corr (and add to .corr a See Also referring to .corrwith).

jreback

what do the asv's show ?

pandas/core/frame.py

jreback

pls add a whatsnew as well (with both issues)

pandas/tests/frame/test_analytics.py

jreback · 2018-08-17T10:14:38Z

don't post pictures of the asv's just copy-paste the text pls.

dsaxton · 2018-08-17T13:58:04Z

asv results for the Correlation class (DataFrames are built from np.random.randn(1000, 30)):

> asv continuous -f 1.1 origin/corrwith-dev HEAD -b stat_ops.Correlation
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Building a511a498 for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt........................................
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 6 total benchmarks (2 commits * 1 environments * 3 benchmarks)
[  0.00%] · For pandas commit hash a511a498:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ··· Running (stat_ops.Correlation.time_corr--)...
[ 16.67%] ··· stat_ops.Correlation.time_corr                                                                                         ok
[ 16.67%] ··· ========== =============
                method                
              ---------- -------------
               spearman     93.6±2ms  
               kendall      219±4ms   
               pearson    2.14±0.09ms 
              ========== =============

[ 33.33%] ··· stat_ops.Correlation.time_corrwith_cols                                                                                ok
[ 33.33%] ··· ========== ============
                method               
              ---------- ------------
               spearman    51.8±1ms  
               kendall     20.6±1ms  
               pearson    17.0±0.7ms 
              ========== ============

[ 50.00%] ··· stat_ops.Correlation.time_corrwith_rows                                                                                ok
[ 50.00%] ··· ========== ==========
                method             
              ---------- ----------
               spearman   758±30ms 
               kendall    493±7ms  
               pearson    258±7ms  
              ========== ==========

[ 50.00%] · For pandas commit hash a511a498:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running (stat_ops.Correlation.time_corr--)...
[ 66.67%] ··· stat_ops.Correlation.time_corr                                                                                         ok
[ 66.67%] ··· ========== =============
                method                
              ---------- -------------
               spearman     101±6ms   
               kendall      226±9ms   
               pearson    2.16±0.03ms 
              ========== =============

[ 83.33%] ··· stat_ops.Correlation.time_corrwith_cols                                                                                ok
[ 83.33%] ··· ========== ============
                method               
              ---------- ------------
               spearman    52.1±4ms  
               kendall     21.1±2ms  
               pearson    16.5±0.6ms 
              ========== ============

[100.00%] ··· stat_ops.Correlation.time_corrwith_rows                                                                                ok
[100.00%] ··· ========== ==========
                method             
              ---------- ----------
               spearman   752±20ms 
               kendall    490±4ms  
               pearson    266±3ms  
              ========== ==========


BENCHMARKS NOT SIGNIFICANTLY CHANGED.
>

doc/source/whatsnew/v0.24.0.txt

pandas/core/frame.py

TomAugspurger · 2018-08-23T12:54:50Z

Duplicate column names doesn't necessarily mean duplicate values. We handle this correctly on master, so you should be able to add a test on master that passes, and ensure it still passes on your branch.

…

On Thu, Aug 23, 2018 at 7:49 AM Daniel Saxton ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pandas/core/frame.py <#22375 (comment)>: > if axis == 1: left = left.T right = right.T - # demeaned data - ldem = left - left.mean() - rdem = right - right.mean() - - num = (ldem * rdem).sum() - dom = (left.count() - 1) * left.std() * right.std() + correl = (left.apply(lambda x: + x.corr(right[x.name], method=method))) I'll test that out. What would you say the expected behavior should be in this case? Would it make sense to first drop duplicate columns from self and other? — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#22375 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIoYTVuSsxAk89rEjHQ6FBOjIfF1aks5uTqTAgaJpZM4V_AKR> .

dsaxton · 2018-08-23T23:35:25Z

@TomAugspurger I just pushed some changes that include the above (zip / map instead of apply, more clarity in the whatsnew), and also added a relatively simple test to check if the method can handle duplicate columns (below). I may have mauled the git history a bit with all the merging, so a rebase might be needed.

Regarding the test, I am creating a DataFrame with three columns equal to the integers 0, 1, ... , 9, then creating another DataFrame which adds an additional column of the same form with a duplicate column name. The result of corrwith should be a Series of four ones (and should not generate an error). Please let me know if there's a more straightforward way to simply check that the method doesn't err out.

    def test_corrwith_dup_cols(self):
        # GH 21925
        df1 = pd.DataFrame(np.vstack([np.arange(10)] * 3).T)
        df2 = df1.copy()
        df2 = pd.concat((df2, df2[0]), axis=1)

        result = df1.corrwith(df2).values
        expected = np.ones(4)
        tm.assert_almost_equal(result, expected)

jreback · 2018-09-04T12:48:54Z

can you rebase

pep8speaks · 2018-09-05T16:03:40Z

Hello @dsaxton! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 28, 2018 at 02:42 Hours UTC

jreback

what is the perf diff with this (also add something with a lot of columns, like 100) to see what effect that has

jreback · 2018-09-18T13:22:40Z

pandas/core/frame.py

-
-        num = (ldem * rdem).sum()
-        dom = (left.count() - 1) * left.std() * right.std()
+        correl = Series(map(lambda x: Series(x[0]).corr(Series(x[1]),


instead of nesting this, can you create an in-line named function that you pass to the map

jreback · 2018-12-23T23:44:34Z

can you merge master and update to comments

mroeschke · 2018-12-26T22:23:08Z

Looks like there's a linting error:

2018-12-26T19:14:53.0223964Z ##[error]./asv_bench/benchmarks/stat_ops.py(115,1): error W293: blank line contains whitespace

doc/source/whatsnew/v0.24.0.rst

pandas/core/frame.py

jreback · 2018-12-27T00:17:42Z

pandas/core/frame.py

-        dom = (left.count() - 1) * left.std() * right.std()
+            correl = num / dom
+
+        else:


do elif method in ['.....], and add an else clause that raises (wrong method is passed)

jreback · 2018-12-27T00:18:09Z

pandas/tests/frame/test_analytics.py

+    @pytest.mark.xfail
+    def test_corrwith_dup_cols(self):
+        # GH 21925
+        df1 = pd.DataFrame(np.vstack([np.arange(10)] * 3).T)


can you add an example with an empty frame.

A test to ensure that the output is an empty Series (with the proper index)?

doc/source/whatsnew/v0.24.0.rst

TomAugspurger · 2018-12-27T03:08:20Z

pandas/tests/frame/test_analytics.py

@@ -466,6 +466,33 @@ def test_corrwith_mixed_dtypes(self):
        expected = pd.Series(data=corrs, index=['a', 'b'])
        tm.assert_series_equal(result, expected)

+    @pytest.mark.xfail


Perhaps I'm missing something, but why is this xfailed?

I had xfailed this because master was giving an error when columns were duplicated (ValueError: cannot reindex from a duplicate axis)

Could you rewrite the test to not have duplicate labels?

But I notice now that on master we do allow duplicates in DataFrame.corrwith

In [6]: df = pd.DataFrame([[1, 2], [3, 4]], columns=[0, 0], index=[1, 1]) In [7]: df.corrwith(df) Out[7]: 0 1.0 0 1.0 dtype: float64

so we may need to adjust the implementation to not regress on that. It'd be good to add some tests for that if they aren't already present.

Hmm, I wonder why that code sample works but this one gives an error:

import numpy as np import pandas as pd df1 = pd.DataFrame(np.random.random(size=(10, 2)), columns=["a", "b"]) df2 = pd.DataFrame(np.random.random(size=(10, 2)), columns=["a", "a"]) df1.corrwith(df2)

@TomAugspurger By duplicate labels do you mean the values within the DataFrame themselves (i.e., not the indices or column names)?

doc/source/whatsnew/v0.24.0.rst

pandas/core/frame.py

pandas/tests/frame/test_analytics.py

jreback · 2018-12-29T15:49:38Z

can you merge master, also update to handle duplicates

pandas/core/frame.py

* Remove incorrect error (didn't account for callables) * Add xfail to duplicate columns test * Fix transpose (was taken twice for Pearson) * Remove inplace usage for dropna

* Check for invalid method * Do not cast arrays to Series in function c

jreback

minor comment. ping on green.

pandas/core/frame.py

* Add comment for when drop is False * Check if len(idx_diff) > 0 * Remove unnecessary string casting in error message

jreback · 2018-12-30T22:53:47Z

lgtm. ping on green.

dsaxton · 2018-12-31T00:00:42Z

@jreback Looks like it's passing. Thanks for your help and patience!

jreback · 2018-12-31T13:18:00Z

thanks @dsaxton

…ev#22375)

jreback requested changes Aug 16, 2018

View reviewed changes

mroeschke reviewed Aug 16, 2018

View reviewed changes

jreback requested changes Aug 17, 2018

View reviewed changes

pandas/core/frame.py Show resolved Hide resolved

jreback added Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations labels Aug 17, 2018

jreback added this to the 0.24.0 milestone Aug 17, 2018

jreback requested changes Aug 17, 2018

View reviewed changes

pandas/tests/frame/test_analytics.py Outdated Show resolved Hide resolved

jreback requested a review from TomAugspurger August 23, 2018 10:47

TomAugspurger reviewed Aug 23, 2018

View reviewed changes

doc/source/whatsnew/v0.24.0.txt Outdated Show resolved Hide resolved

TomAugspurger reviewed Aug 23, 2018

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

jreback requested changes Sep 18, 2018

View reviewed changes

jreback requested changes Dec 27, 2018

View reviewed changes

TomAugspurger reviewed Dec 27, 2018

View reviewed changes

jreback requested changes Dec 27, 2018

View reviewed changes

doc/source/whatsnew/v0.24.0.rst Outdated Show resolved Hide resolved

pandas/core/frame.py Outdated Show resolved Hide resolved

jreback requested changes Dec 27, 2018

View reviewed changes

pandas/tests/frame/test_analytics.py Outdated Show resolved Hide resolved

dsaxton commented Dec 29, 2018

View reviewed changes

pandas/core/frame.py Show resolved Hide resolved

daniel saxton and others added 9 commits December 29, 2018 12:10

ENH: Enable corrwith to compute rank and callable correlation methods

97fbc51

Fix corrwith

0c84f85

* Remove incorrect error (didn't account for callables) * Add xfail to duplicate columns test * Fix transpose (was taken twice for Pearson) * Remove inplace usage for dropna

Remove test that's no longer correct

48f459e

Update corrwith

2356dea

* Check for invalid method * Do not cast arrays to Series in function c

Fix error message

9010169

Compare Series not arrays in test

6ebdd68

Allow duplicate index labels

c027574

Remove xfail from duplicate column test

11ac73a

Use Series not pd.Series

fdb5415

jreback requested changes Dec 30, 2018

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

jreback reviewed Dec 30, 2018

View reviewed changes

pandas/core/frame.py Show resolved Hide resolved

jreback added this to the 0.24.0 milestone Dec 30, 2018

Update corrwith

870d1a3

* Add comment for when drop is False * Check if len(idx_diff) > 0 * Remove unnecessary string casting in error message

jreback approved these changes Dec 30, 2018

View reviewed changes

jreback merged commit 08640c3 into pandas-dev:master Dec 31, 2018

dsaxton deleted the corrwith-dev branch December 31, 2018 15:50

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

ENH: Enable DataFrame.corrwith to compute rank correlations (pandas-d…

6f61f47

…ev#22375)

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

ENH: Enable DataFrame.corrwith to compute rank correlations (pandas-d…

7d561dd

…ev#22375)

TomAugspurger mentioned this pull request May 13, 2019

corrwith in 0.24 is much slower than 0.23 (especially if corr axis is smaller than other axis) #26368

Closed

ENH: Enable DataFrame.corrwith to compute rank correlations #22375

ENH: Enable DataFrame.corrwith to compute rank correlations #22375

Conversation

dsaxton commented Aug 16, 2018 • edited by jreback Loading

jreback left a comment

Choose a reason for hiding this comment

dsaxton commented Aug 16, 2018

dsaxton commented Aug 16, 2018

jreback commented Aug 16, 2018

codecov bot commented Aug 16, 2018 • edited Loading

Codecov Report

dsaxton commented Aug 16, 2018

dsaxton commented Aug 16, 2018 • edited Loading

WillAyd commented Aug 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsaxton commented Aug 16, 2018

WillAyd commented Aug 16, 2018

dsaxton commented Aug 16, 2018

dsaxton commented Aug 17, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback commented Aug 17, 2018

dsaxton commented Aug 17, 2018

TomAugspurger commented Aug 23, 2018 via email

dsaxton commented Aug 23, 2018 • edited Loading

jreback commented Sep 4, 2018

pep8speaks commented Sep 5, 2018 • edited Loading

Comment last updated on December 28, 2018 at 02:42 Hours UTC

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 23, 2018

mroeschke commented Dec 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 29, 2018

jreback left a comment

Choose a reason for hiding this comment

jreback commented Dec 30, 2018

dsaxton commented Dec 31, 2018

jreback commented Dec 31, 2018

dsaxton commented Aug 16, 2018 •

edited by jreback

Loading

codecov bot commented Aug 16, 2018 •

edited

Loading

dsaxton commented Aug 16, 2018 •

edited

Loading

dsaxton commented Aug 23, 2018 •

edited

Loading

pep8speaks commented Sep 5, 2018 •

edited

Loading