Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: implement pd.Series.corr(method="distance") #22402

Closed
dsaxton opened this issue Aug 17, 2018 · 8 comments
Closed

ENH: implement pd.Series.corr(method="distance") #22402

dsaxton opened this issue Aug 17, 2018 · 8 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement

Comments

@dsaxton
Copy link
Member

dsaxton commented Aug 17, 2018

Distance correlation (https://en.wikipedia.org/wiki/Distance_correlation) is a powerful yet underused technique for comparing two distributions that I think would make a very nice addition to the existing correlation methods in pandas. For one, these measures have the unique property that two random variables $X$ and $Y$ are independent if and only if their distance correlation is zero, which cannot be said of Pearson, Spearman or Kendall.

The below code is an implementation in pure numpy (which could certainly be optimized / more elegantly written) that could be part of the Series class and then called within corr. Later it could be integrated seamlessly with corrwith, and if this feature were available I know personally it would be one of the first things I would look at when approaching a regression problem.

# self and other can be assumed to be aligned already
def nandistcorr(self, other):
    n = len(self)
    a = np.zeros(shape=(n, n))
    b = np.zeros(shape=(n, n))

    for i in range(n):
        for j in range(i+1, n):
            a[i, j] = abs(self[i] - self[j])
            b[i, j] = abs(other[i] - other[j])

    a = a + a.T
    b = b + b.T

    a_bar = np.vstack([np.nanmean(a, axis=0)] * n)
    b_bar = np.vstack([np.nanmean(b, axis=0)] * n)

    A = a - a_bar - a_bar.T + np.full(shape=(n, n), fill_value=a_bar.mean())
    B = b - b_bar - b_bar.T + np.full(shape=(n, n), fill_value=b_bar.mean())

    cov_ab = np.sqrt(np.nansum(A * B)) / n
    std_a = np.sqrt(np.sqrt(np.nansum(A**2)) / n)
    std_b = np.sqrt(np.sqrt(np.nansum(B**2)) / n)

    return cov_ab / std_a / std_b

Here's an example that shows how distance correlation can detect relationships that the other common correlation methods miss:

import numpy as np
import pandas as pd
np.random.seed(2357)

s1 = pd.Series(np.random.randn(1000))
s2 = s1**2

s1.corr(s2, method="pearson")
s1.corr(s2, method="spearman")
s1.corr(s2, method="kendall")
nandistcorr(s1.values, s2.values)
@WillAyd WillAyd added Enhancement Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Aug 19, 2018
@gfyoung
Copy link
Member

gfyoung commented Aug 19, 2018

Interesting suggestion, except I'm a little concerned by the phrase "under-used," as it makes me wonder how much benefit this addition might have for the overall user base vs. maintenance work.

If you can write a relatively simple implementation in pandas, that would be good.

cc @jreback

@dsaxton
Copy link
Member Author

dsaxton commented Aug 19, 2018

@gfyoung I can understand that concern, as it could be argued that distance correlation is more interesting as a theoretical rather than applied measure. Is there a way to gauge interest in a feature within the pandas user base?

@gfyoung
Copy link
Member

gfyoung commented Aug 19, 2018

There is the pandas mailing list.

@TomAugspurger
Copy link
Contributor

#22684 is allowing method to be a callable with a correct signature. I think we would happily accept a PR for a cookbook recipe with a distance correlation that uses #22684.

@dsaxton
Copy link
Member Author

dsaxton commented Sep 18, 2018

@TomAugspurger Sounds good, #22684 looks like a nice change. Can you point me to any documentation on how to create cookbook recipes?

@TomAugspurger
Copy link
Contributor

It's nothing fancy, just a bunch of code snippets in https://github.com/pandas-dev/pandas/blob/master/doc/source/cookbook.rst

(it would be nice to enforce some consistency, and pick a better presentation format; but that's another issue).

@dsaxton
Copy link
Member Author

dsaxton commented Sep 18, 2018

I could add this within the "Computation" section if that makes sense.

@nickcorona
Copy link

Why closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement
Projects
None yet
Development

No branches or pull requests

5 participants