Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: correlation function accepts method being a callable #22684

Merged
merged 1 commit into from Sep 26, 2018

Conversation

Projects
None yet
5 participants
@shadiakiki1986
Copy link
Contributor

commented Sep 13, 2018

  • other than the listed strings for the method argument, accept a callable for generic correlation calculations
  • 
    
  • closes #xxxx (no issue opened for PR)
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry (not sure if I should do this or the person merging this PR)
@pep8speaks

This comment has been minimized.

Copy link

commented Sep 13, 2018

Hello @shadiakiki1986! Thanks for updating the PR.

Comment last updated on September 26, 2018 at 02:15 Hours UTC
@TomAugspurger
Copy link
Contributor

left a comment

I like the idea. Does this close an issue?

We'll need a release note in 0.24.0.txt

@@ -14,7 +14,7 @@ lxml
matplotlib
nbsphinx
numexpr
openpyxl=2.5.5
openpyxl==2.5.5

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Sep 13, 2018

Contributor

This file is autogenerated. #22689

This comment has been minimized.

Copy link
@shadiakiki1986

shadiakiki1986 Sep 13, 2018

Author Contributor

ok should I roll this edit back?

This comment has been minimized.

Copy link
@gfyoung

gfyoung Sep 14, 2018

Member

Yes, you should.

This comment has been minimized.

Copy link
@shadiakiki1986

shadiakiki1986 Sep 17, 2018

Author Contributor

done

.. ipython:: python
# histogram intersection
histogram_intersection = lambda a, b: np.minimum( np.true_divide(a, a.sum()), np.true_divide(b, b.sum())).sum()

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Sep 13, 2018

Contributor

Make this pep8 compliant

This comment has been minimized.

Copy link
@shadiakiki1986

shadiakiki1986 Sep 13, 2018

Author Contributor

fixed in new commit

@@ -6652,10 +6652,12 @@ def corr(self, method='pearson', min_periods=1):
Parameters
----------
method : {'pearson', 'kendall', 'spearman'}
method : {'pearson', 'kendall', 'spearman', callable}

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Sep 13, 2018

Contributor

I think it'll be `}, or callable on the outside of the options

This comment has been minimized.

Copy link
@shadiakiki1986

shadiakiki1986 Sep 13, 2018

Author Contributor

moved it to outside of the options

* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation
* callable: callable with input two numpy 1d-array

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Sep 13, 2018

Contributor

"callable expecting two 1d ndarrays and returning a float" (does the callable get ndarrays or series?)

This comment has been minimized.

Copy link
@shadiakiki1986

shadiakiki1986 Sep 13, 2018

Author Contributor

ndarrays

@@ -789,6 +789,41 @@ def test_corr_invalid_method(self):
with tm.assert_raises_regex(ValueError, msg):
s1.corr(s2, method="____")

def test_corr_callable_method(self):

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Sep 13, 2018

Contributor

Hmm I care less about testing this exact way of computing the correlation, and more about ensure that the method is dispatched to.

Would it be possible to define a very simple "correlation" function that just returns something like the index of the columns? So the correlation of the nth or and mth column would be like (n + m). Not sure if that's possible.

This comment has been minimized.

Copy link
@shadiakiki1986

shadiakiki1986 Sep 13, 2018

Author Contributor

ok will re-write the test tomorrow

This comment has been minimized.

Copy link
@shadiakiki1986

shadiakiki1986 Sep 17, 2018

Author Contributor

The test now includes a simpler correlation function. It is not possible to identify the nth/mth column as in your example because the correlation function itself does not know about the dataframe as a whole but only as each series on its own. The correlation function I chose is a simple 1 if exact equality else 0

@shadiakiki1986

This comment has been minimized.

Copy link
Contributor Author

commented Sep 13, 2018

This doesn't fix any issue, but I added a note in the 0.24.0 release notes

@@ -17,6 +17,10 @@ New features

- ``ExcelWriter`` now accepts ``mode`` as a keyword argument, enabling append to existing workbooks when using the ``openpyxl`` engine (:issue:`3441`)


- :meth:`DataFrame.corr` and :meth:`Series.corr` now accept a callable for generic calculation methods of correlation, e.g. histogram intersection

This comment has been minimized.

Copy link
@gfyoung

gfyoung Sep 14, 2018

Member

Use your PR as the issue number. Also, no new line above this sentence.

This comment has been minimized.

Copy link
@shadiakiki1986

shadiakiki1986 Sep 17, 2018

Author Contributor

done

@shadiakiki1986 shadiakiki1986 force-pushed the shadiakiki1986:corr-method-callable branch 2 times, most recently from 10dcb68 to 2e22403 Sep 17, 2018

@TomAugspurger
Copy link
Contributor

left a comment

cc @jreback if you have a chance.

This provides a nice, generic alternative to #21925.

@jreback
Copy link
Contributor

left a comment

small comments.

* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation
* callable: callable with input two 1d ndarrays

This comment has been minimized.

Copy link
@jreback

jreback Sep 18, 2018

Contributor

can you add an example in Examples

This comment has been minimized.

Copy link
@shadiakiki1986

shadiakiki1986 Sep 20, 2018

Author Contributor

done

@@ -764,6 +764,9 @@ def nancorr(a, b, method='pearson', min_periods=None):


def get_corr_func(method):
if callable(method):
return method

if method in ['kendall', 'spearman']:

This comment has been minimized.

Copy link
@jreback

jreback Sep 18, 2018

Contributor

elif

This comment has been minimized.

Copy link
@shadiakiki1986

shadiakiki1986 Sep 20, 2018

Author Contributor

done

* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation
* callable: callable with input two 1d ndarray

This comment has been minimized.

Copy link
@jreback

jreback Sep 18, 2018

Contributor

I am not sure how to doc-string this signature here

This comment has been minimized.

Copy link
@shadiakiki1986

shadiakiki1986 Sep 20, 2018

Author Contributor

Should I just leave it as is?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Sep 20, 2018

Contributor

Probably fine as is. The type would be Callable[[ndarray, ndarray], float], but I'm not sure how familiar people are with typing yet.

@jreback jreback added this to the 0.24.0 milestone Sep 18, 2018

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 18, 2018

@TomAugspurger yep agreed

@codecov

This comment has been minimized.

Copy link

commented Sep 20, 2018

Codecov Report

Merging #22684 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22684      +/-   ##
==========================================
+ Coverage   92.18%   92.19%   +<.01%     
==========================================
  Files         169      169              
  Lines       50819    50821       +2     
==========================================
+ Hits        46850    46852       +2     
  Misses       3969     3969
Flag Coverage Δ
#multiple 90.6% <100%> (ø) ⬆️
#single 42.37% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/nanops.py 95.14% <100%> (+0.01%) ⬆️
pandas/core/series.py 93.74% <100%> (ø) ⬆️
pandas/core/frame.py 97.2% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a936399...dbfd95f. Read the comment docs.

@shadiakiki1986 shadiakiki1986 force-pushed the shadiakiki1986:corr-method-callable branch from c75bc10 to 3ca092a Sep 20, 2018

>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],
... columns=['dogs', 'cats'])
>>> df.corr(method = histogram_intersection)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Sep 20, 2018

Contributor

This is failing our doctests. Is there an issue with the output?

Also, pep8: no spaces around the =

This comment has been minimized.

Copy link
@shadiakiki1986

shadiakiki1986 Sep 22, 2018

Author Contributor

fixed

>>> s1 = pd.Series([1, 0, 2, 1])
>>> s2 = pd.Series([2, 3, 0, 1])
>>> s1.corr(s2, method = histogram_intersection)
0.416667

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Sep 20, 2018

Contributor

Will need to round this, or write it out at full precision.

This comment has been minimized.

Copy link
@shadiakiki1986

shadiakiki1986 Sep 22, 2018

Author Contributor

fixed

* pearson : standard correlation coefficient
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation
* callable: callable with input two 1d ndarray

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Sep 20, 2018

Contributor

Probably fine as is. The type would be Callable[[ndarray, ndarray], float], but I'm not sure how familiar people are with typing yet.

@shadiakiki1986 shadiakiki1986 force-pushed the shadiakiki1986:corr-method-callable branch 2 times, most recently from df06de9 to dc87331 Sep 22, 2018

# simple correlation example
# returns 1 if exact equality, 0 otherwise
my_corr = lambda a, b: 1. if (a == b).all() else 0.

This comment has been minimized.

Copy link
@jreback

jreback Sep 23, 2018

Contributor

can you use result= and expected= here, rather than expected_1 and such. its much easier to follow

This comment has been minimized.

Copy link
@shadiakiki1986

shadiakiki1986 Sep 26, 2018

Author Contributor

done

@shadiakiki1986 shadiakiki1986 force-pushed the shadiakiki1986:corr-method-callable branch 2 times, most recently from 60b3498 to 4a69a70 Sep 26, 2018

ENH: correlation function accepts method being a callable
- other than the listed strings for the `method` argument, accept a callable for generic correlation calculations

@shadiakiki1986 shadiakiki1986 force-pushed the shadiakiki1986:corr-method-callable branch from 4a69a70 to dbfd95f Sep 26, 2018

@jreback jreback merged commit a393675 into pandas-dev:master Sep 26, 2018

6 checks passed

ci/circleci: py27_compat Your tests passed on CircleCI!
Details
ci/circleci: py35_ascii Your tests passed on CircleCI!
Details
ci/circleci: py36_locale Your tests passed on CircleCI!
Details
ci/circleci: py36_locale_slow Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20180926.6 succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 26, 2018

thanks @shadiakiki1986 nice change!

Sup3rGeo added a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.