ENH: implement pd.Series.corr(method="distance") #22402

dsaxton · 2018-08-17T19:53:58Z

Distance correlation (https://en.wikipedia.org/wiki/Distance_correlation) is a powerful yet underused technique for comparing two distributions that I think would make a very nice addition to the existing correlation methods in pandas. For one, these measures have the unique property that two random variables $X$ and $Y$ are independent if and only if their distance correlation is zero, which cannot be said of Pearson, Spearman or Kendall.

The below code is an implementation in pure numpy (which could certainly be optimized / more elegantly written) that could be part of the Series class and then called within corr. Later it could be integrated seamlessly with corrwith, and if this feature were available I know personally it would be one of the first things I would look at when approaching a regression problem.

# self and other can be assumed to be aligned already
def nandistcorr(self, other):
    n = len(self)
    a = np.zeros(shape=(n, n))
    b = np.zeros(shape=(n, n))

    for i in range(n):
        for j in range(i+1, n):
            a[i, j] = abs(self[i] - self[j])
            b[i, j] = abs(other[i] - other[j])

    a = a + a.T
    b = b + b.T

    a_bar = np.vstack([np.nanmean(a, axis=0)] * n)
    b_bar = np.vstack([np.nanmean(b, axis=0)] * n)

    A = a - a_bar - a_bar.T + np.full(shape=(n, n), fill_value=a_bar.mean())
    B = b - b_bar - b_bar.T + np.full(shape=(n, n), fill_value=b_bar.mean())

    cov_ab = np.sqrt(np.nansum(A * B)) / n
    std_a = np.sqrt(np.sqrt(np.nansum(A**2)) / n)
    std_b = np.sqrt(np.sqrt(np.nansum(B**2)) / n)

    return cov_ab / std_a / std_b

Here's an example that shows how distance correlation can detect relationships that the other common correlation methods miss:

import numpy as np
import pandas as pd
np.random.seed(2357)

s1 = pd.Series(np.random.randn(1000))
s2 = s1**2

s1.corr(s2, method="pearson")
s1.corr(s2, method="spearman")
s1.corr(s2, method="kendall")
nandistcorr(s1.values, s2.values)

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-08-19T08:09:13Z

Interesting suggestion, except I'm a little concerned by the phrase "under-used," as it makes me wonder how much benefit this addition might have for the overall user base vs. maintenance work.

If you can write a relatively simple implementation in pandas, that would be good.

cc @jreback

dsaxton · 2018-08-19T13:33:54Z

@gfyoung I can understand that concern, as it could be argued that distance correlation is more interesting as a theoretical rather than applied measure. Is there a way to gauge interest in a feature within the pandas user base?

gfyoung · 2018-08-19T17:50:42Z

There is the pandas mailing list.

TomAugspurger · 2018-09-18T14:06:41Z

#22684 is allowing method to be a callable with a correct signature. I think we would happily accept a PR for a cookbook recipe with a distance correlation that uses #22684.

dsaxton · 2018-09-18T14:39:39Z

@TomAugspurger Sounds good, #22684 looks like a nice change. Can you point me to any documentation on how to create cookbook recipes?

TomAugspurger · 2018-09-18T14:55:37Z

It's nothing fancy, just a bunch of code snippets in https://github.com/pandas-dev/pandas/blob/master/doc/source/cookbook.rst

(it would be nice to enforce some consistency, and pick a better presentation format; but that's another issue).

dsaxton · 2018-09-18T18:53:17Z

I could add this within the "Computation" section if that makes sense.

nickcorona · 2020-07-28T21:50:50Z

Why closed?

WillAyd added Enhancement Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Aug 19, 2018

dsaxton mentioned this issue Sep 19, 2018

DOC: Add cookbook entry using callable method for DataFrame.corr #22761

Merged

dsaxton closed this as completed Jan 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: implement pd.Series.corr(method="distance") #22402

ENH: implement pd.Series.corr(method="distance") #22402

dsaxton commented Aug 17, 2018 •

edited

gfyoung commented Aug 19, 2018 •

edited

dsaxton commented Aug 19, 2018

gfyoung commented Aug 19, 2018

TomAugspurger commented Sep 18, 2018

dsaxton commented Sep 18, 2018

TomAugspurger commented Sep 18, 2018

dsaxton commented Sep 18, 2018

nickcorona commented Jul 28, 2020

ENH: implement pd.Series.corr(method="distance") #22402

ENH: implement pd.Series.corr(method="distance") #22402

Comments

dsaxton commented Aug 17, 2018 • edited

gfyoung commented Aug 19, 2018 • edited

dsaxton commented Aug 19, 2018

gfyoung commented Aug 19, 2018

TomAugspurger commented Sep 18, 2018

dsaxton commented Sep 18, 2018

TomAugspurger commented Sep 18, 2018

dsaxton commented Sep 18, 2018

nickcorona commented Jul 28, 2020

dsaxton commented Aug 17, 2018 •

edited

gfyoung commented Aug 19, 2018 •

edited