Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) #46174

Merged
merged 14 commits into from
Mar 3, 2022

Conversation

fractionalhare
Copy link
Contributor

@fractionalhare fractionalhare commented Feb 28, 2022

Description of Changes

This PR modifies the corrwith() method defined in pandas/core/frame.py to be faster when all of the following (common) conditions are met.

  • the axis argument is set to 0
  • the other argument is a Pandas Series object
  • the method argument is equal to "pearson" OR "spearman"

When all these are true, we replace the apply() on self with column-wise iteration on NumPy arrays and the correlation function against other is implemented in NumPy's np.corrcoef(). If not all of those conditions are met, the method falls back to its existing functionality with no performance change. The Spearman rank correlation is calculated via np.corrcoef() by taking the Pearson correlation coefficient of the vector element ranks with argsort().

When method = "pearson", this change reduces execution time by 40 - 50% on small and large DataFrame objects.

When method = "spearman", this change reduces execution time by 70 - 80% on small and large DataFrame objects.

This change is still robust to null values on the left or right side despite the fact that np.corrcoef() is not robust to nulls, because it masks the null values with a boolean & across both objects when iterating the vectors. It returns accurate results.

Testing Environment & Performance Assessment

My testing environment is as follows:

  • Hardware: MacBook Pro (16-inc, 2021) with Apple M1 Max Chip and 64GB RAM
  • OS: macOS Monterey v12.2
  • Python: 3.8.12
  • Pandas: v1.41 (Control, Unmodified) and v1.5.0.dev (Self-Built Test)
  • NumPy: v1.22.2
  • Installer/Package Manager: Anaconda3 (miniforge3 for Apple Silicon/arm64)

My test script was the following:

import timeit
import numpy as np
import pandas as pd

pearson_results = np.array([])
spearman_results = np.array([])
num_tests = 10
num_runs = 10

for k in range(num_tests):
    print(f'Running test {k + 1}...')
    data = np.random.uniform(-1, 1, size=(100, 50000))
    df = pd.DataFrame(data)
    df.columns = [str(c) for c in df.columns]
    pearson_time = timeit.timeit("df.corrwith(df['0'], axis=0, method='pearson')", "from __main__ import df", number=num_runs, timer=timeit.default_timer)
    spearman_time = timeit.timeit("df.corrwith(df['0'], axis=0, method='spearman')", "from __main__ import df", number=num_runs, timer=timeit.default_timer)
    pearson_results = np.append(pearson_results, pearson_time)
    spearman_results = np.append(spearman_results, spearman_time)

print(f'Average Pearson Correlation Time Across {num_tests} DataFrames with {num_runs} runs each: {pearson_results.mean() / num_runs}')
print(f'Average Spearman Correlation Time Across {num_tests} DataFrames {num_runs} runs each: {spearman_results.mean() / num_runs}')

First I created a fresh conda environment with the current stable version of Pandas (v1.41). For a DataFrame with 100 rows and 50,000 columns, using unmodified corrwith() with method="pearson" took approximately 3.0697 seconds on average across 10 runs each on 10 different randomly generated DataFrames. With method="spearman", the same evaluation took approximately 8.5808 seconds on average.

Then I made my changes to my fork of Pandas main and built it in a new Conda environment, then ran the same script. For an equivalently sized DataFrame with random numerical data, the modified corrwith() with method="pearson" took approximately 1.5342 seconds. With method="spearman", it took approximately 2.0556 seconds.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2022

do we have asv's for this?

@jreback jreback added Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Feb 28, 2022
@fractionalhare
Copy link
Contributor Author

I haven't used asv before, but I can generate some test results from doing so.

@fractionalhare
Copy link
Contributor Author

fractionalhare commented Feb 28, 2022

@jreback I'm having difficulty getting asv to work after following the documentation here: https://pandas.pydata.org/docs/development/contributing_codebase.html?highlight=asv#running-the-performance-test-suite

I did the following to install and prep:

git remote add upstream https://github.com/pandas-dev/pandas
conda install asv
cd pandas-fractionalhare/asv_bench

Attempting to run from any upstream/{branch} branch results in an Unknwon commit upstream/{branch} error. If I run it locally instead, on my branch corrwith-perf-improv like so:

asv continuous -f 1.1 -E conda corrwith-perf-improv

then it fails with the following error:

(base) Dylans-MacBook-Pro:asv_bench dhoulihan$ asv continuous -f 1.1 -E conda corrwith-perf-improv
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.8-Cython0.29.24-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
·· Installing c4ea97f3 <corrwith-perf-improv> into conda-py3.8-Cython0.29.24-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Error running /Users/dhoulihan/PycharmProjects/pandas-fractionalhare/asv_bench/env/45e7123d25d88e958f183355b52ee854/bin/python /Users/dhoulihan/miniforge3/lib/python3.9/site-packages/asv/benchmark.py discover /Users/dhoulihan/PycharmProjects/pandas-fractionalhare/asv_bench/benchmarks /var/folders/_j/2v9pmrp15qgcxw0ml45sj9zs_4bd6t/T/tmpnrx32tqr/result.json (exit status -11)
   STDOUT -------->
   
   STDERR -------->

What am I doing wrong? I was able to build my fork of Pandas locally using setup.py. I also tried with pip install git+https://github.com/airspeed-velocity/asv instead and there was no change.

@fractionalhare
Copy link
Contributor Author

Separately, I do see one test consistently failing in the automated suite: test_corrwith_mixed_dtypes. I will resolve this and push.

@fractionalhare
Copy link
Contributor Author

@jreback looks like the full asv benchmark ran via GitHub Actions, so we have results there. also looks like all the other GHA tests passed (except for an an Azure build which timed out)

@jreback
Copy link
Contributor

jreback commented Feb 28, 2022

@jreback looks like the full asv benchmark ran via GitHub Actions, so we have results there. also looks like all the other GHA tests passed (except for an an Azure build which timed out)

@fractionalhare a specific asv for this case (I think we do have it but pls check). then show the asv results

@fractionalhare
Copy link
Contributor Author

Yep, I see it here: https://github.com/pandas-dev/pandas/blob/main/asv_bench/benchmarks/stat_ops.py#L124. I'll run and add results here.

@fractionalhare
Copy link
Contributor Author

fractionalhare commented Feb 28, 2022

@jreback I can't get asv to work on my machine, it has odd build errors about hdf5 and llvmlite regardless of whether I run with conda or virtualenv. This seems like it has to do with the platform dependency of tables vs pytables - I'm not sure if this is related to me trying to run on it on an M1. Even if I try to run it with my existing environment I get the following error:

asv continuous -f 1.1 -E existing corrwith-perf-improv -b stat_ops.Correlation
No range spec may be specified if benchmarking in an existing environment

However, you can see the results of just the relevant benchmark from the automated test suite at this line:
https://github.com/pandas-dev/pandas/runs/5354404445?check_suite_focus=true#step:6:8761. The results report as ok

@fractionalhare
Copy link
Contributor Author

Adding to discussion here: is it worth adding more tests of corrwith correctness with expected outputs for sample inputs, up to precision with something like np.isclose()?

@jreback
Copy link
Contributor

jreback commented Mar 1, 2022

Adding to discussion here: is it worth adding more tests of corrwith correctness with expected outputs for sample inputs, up to precision with something like np.isclose()?

yes

@jreback
Copy link
Contributor

jreback commented Mar 1, 2022

@fractionalhare show the asv result above (or at the very least show a timeit with before/after of the sample)

@fractionalhare
Copy link
Contributor Author

Adding to discussion here: is it worth adding more tests of corrwith correctness with expected outputs for sample inputs, up to precision with something like np.isclose()?

yes

Okay, I'll add to this PR.

@fractionalhare
Copy link
Contributor Author

fractionalhare commented Mar 1, 2022

@fractionalhare show the asv result above (or at the very least show a timeit with before/after of the sample)

Here are the corrwith ASV results from the latest GHA run:

[ 39.31%] ··· stat_ops.Correlation.time_corrwith_cols                         ok
[ 39.31%] ··· ========== ==========
                method             
              ---------- ----------
               spearman   314±0ms  
               kendall    326±0ms  
               pearson    10.7±0ms 
              ========== ==========

[ 39.36%] ··· stat_ops.Correlation.time_corrwith_rows                         ok
[ 39.36%] ··· ========== ==========
                method             
              ---------- ----------
               spearman   656±0ms  
               kendall    637±0ms  
               pearson    10.7±0ms 
              ========== ==========

There are only two that apply to this function.

And here are my timeit comparisons. I created a 100 x 50,000 DataFrame to test, then pickled it and read it back in to run the timeit test with stable Pandas and my build with this change. Tests were executed like so:

import timeit
import numpy as np
import pandas as pd
pearson_results = np.array([])
spearman_results = np.array([])
num_runs = 1
# data = np.random.uniform(-1, 1, size=(100, 50000))
# df = pd.DataFrame(data)
# df.to_pickle('~/test_dataframe')
df = pd.read_pickle('~/test_dataframe')
print(f'Test DataFrame has shape {df.shape}')
df.columns = [str(c) for c in df.columns]
pearson_time = timeit.timeit("df.corrwith(df['0'], axis=0, method='pearson')", "from __main__ import df", number=num_runs, timer=timeit.default_timer)
spearman_time = timeit.timeit("df.corrwith(df['0'], axis=0, method='spearman')", "from __main__ import df", number=num_runs, timer=timeit.default_timer)

Results before:

Test DataFrame has shape (100, 50000)
Average Pearson Correlation Time with 1 runs: 2.997473542
Average Spearman Correlation Time with 1 runs: 8.714665584

Results after:

Test DataFrame has shape (100, 50000)
Average Pearson Correlation Time with 1 runs: 1.5199412919999986
Average Spearman Correlation Time with 1 runs: 2.0341787089999883

@fractionalhare
Copy link
Contributor Author

fractionalhare commented Mar 1, 2022

If I instead do it on a small, 100 x 100 DataFrame, I get similar results in terms of percent change.

Results before:

Test DataFrame has shape (100, 100)
Average Pearson Correlation Time with 1 runs: 0.008117292000001441
Average Spearman Correlation Time with 1 runs: 0.01831337500000174

Results after:

Test DataFrame has shape (100, 100)
Average Pearson Correlation Time with 1 runs: 0.004528458000000235
Average Spearman Correlation Time with 1 runs: 0.00520883300000019

In both cases (small and large), it's a reduction of 40% - 50% execution time for Pearson and 70% - 80% for Spearman.

@fractionalhare fractionalhare changed the title PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) Mar 2, 2022
@fractionalhare
Copy link
Contributor Author

Unclear why the ASV benchmark failed, a test in unrelated/unmodified code triggered an exit code 1 in the benchmarks after I merged main into this branch, but the ASVs for corrwith remain clean: https://github.com/pandas-dev/pandas/runs/5392988518?check_suite_focus=true#step:6:9179

[ 39.33%] ··· stat_ops.Correlation.time_corrwith_cols                         ok
[ 39.33%] ··· ========== ==========
                method             
              ---------- ----------
               spearman   204±0ms  
               kendall    202±0ms  
               pearson    6.61±0ms 
              ========== ==========

[ 39.38%] ··· stat_ops.Correlation.time_corrwith_rows                         ok
[ 39.38%] ··· ========== ==========
                method             
              ---------- ----------
               spearman   384±0ms  
               kendall    366±0ms  
               pearson    6.84±0ms 
              ========== ==========

Tests were all green on this PR prior to me pulling main just now, so maybe this is ephemeral?

@fractionalhare
Copy link
Contributor Author

fractionalhare commented Mar 2, 2022

@jreback Apologies since this is my first time using asv, so I might be wrong here but - it looks like there is a performance regression in master now, if I'm interpreting the last two automated asv runs correctly.

My last commit with an actual code change to the corrwith function was b843595, and that one was all green for test results. My last two commits to this branch have been merge commits to catch up with main, and those are failing in an unrelated asv benchmark (0d870f9 and 280e618).

Am I interpreting this correctly?

Other than that I have some perf results for asv in this thread along with more explicit before/after results for timeit, do you want me to make any changes to the dostring or inline comments mentioning this PR number? Correctness coverage looks okay, it's already comparing to np.corrcoef() with np.isclose().

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also pls add a whtasnew note in 1.5 in the perf improvements section

pandas/core/frame.py Show resolved Hide resolved
@fractionalhare
Copy link
Contributor Author

@jreback In addition to the changes you asked for and the perf tests I've shown with timeit, here is a demonstration that the method is 1) still robust to nulls and 2) equivalent functionally to the existing logic. Given an unmodified build of Pandas:

import timeit
import numpy as np
import pandas as pd
from typing import Union

# test modified method against df.corrwith in unmodified Pandas build by making standalone function
def corr_with(df: pd.DataFrame, other: Union[pd.Series, pd.DataFrame], method: str, axis: int) -> pd.Series:
    if axis == 0 and method in ['pearson', 'spearman'] and isinstance(other, pd.Series):
        corrs = {}
        numeric_cols = df.select_dtypes(include=np.number).columns
        ndf = df[numeric_cols].values.transpose()
        k = other.values
        if method == "pearson":
            for i, r in enumerate(ndf):
                nonnull_mask = ~np.isnan(r) & ~np.isnan(k)
                corrs[numeric_cols[i]] = np.corrcoef(
                    r[nonnull_mask], k[nonnull_mask]
                )[0, 1]
        else:
            for i, r in enumerate(ndf):
                nonnull_mask = ~np.isnan(r) & ~np.isnan(k)
                corrs[numeric_cols[i]] = np.corrcoef(
                    r[nonnull_mask].argsort().argsort(),
                    k[nonnull_mask].argsort().argsort(),
                )[0, 1]
        return pd.Series(corrs)
    else:
        return df.corrwith(other, method=method, axis=axis)

# create a sample dataframe and series, ideally with nans
df = pd.DataFrame({'a': [1.4, 3.5, np.nan, 5.6, np.nan, 7.8], 'b': [5.7, 3.6, 2.3, 1.5, 8.7, 3.2]})

s = pd.Series([1.4, 6.5, 7.4, 2.6, np.nan, 3.4])

a = df.corrwith(s, method='pearson', axis=0)

b = corr_with(df, s, 'pearson', 0)

a = df.corrwith(s, method='pearson', axis=0)
b = corr_with(df, s, 'pearson', 0)
np.isclose(a, b)

which will output

Out[14]: array([ True,  True])

Likewise if we do with Spearman instead of Pearson:

a = df.corrwith(s, method='spearman', axis=0)
b = corr_with(df, s, 'spearman', 0)
np.isclose(a, b)

we have

Out[17]: array([ True,  True])

@jreback jreback added this to the 1.5 milestone Mar 3, 2022
@jreback jreback merged commit 5efb570 into pandas-dev:main Mar 3, 2022
@jreback
Copy link
Contributor

jreback commented Mar 3, 2022

thanks @fractionalhare

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
@MarcoGorelli MarcoGorelli mentioned this pull request Oct 7, 2022
MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this pull request Oct 17, 2022
…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)"

This reverts commit 5efb570.
MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this pull request Oct 17, 2022
…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)"

This reverts commit 5efb570.
MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this pull request Oct 17, 2022
…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)"

This reverts commit 5efb570.
MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this pull request Oct 17, 2022
…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)"

This reverts commit 5efb570.
mroeschke pushed a commit that referenced this pull request Oct 17, 2022
…tion when other is a Series and axis = 0 (column-wise) (#46174)" (#49140)

* Update frame.py

* Revert "PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (#46174)"

This reverts commit 5efb570.

* fix GH issue number in test

* add test from original issue

* skip if no scipy

* add extra test case

Co-authored-by: Yuanhao Geng <41546976+GYHHAHA@users.noreply.github.com>
Co-authored-by: MarcoGorelli <>
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Oct 17, 2022
…r pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)"
MarcoGorelli added a commit that referenced this pull request Oct 17, 2022
…hod for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (#46174)") (#49151)

Backport PR #49140: Revert "PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (#46174)"

Co-authored-by: Marco Edward Gorelli <33491632+MarcoGorelli@users.noreply.github.com>
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)" (pandas-dev#49140)

* Update frame.py

* Revert "PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)"

This reverts commit 5efb570.

* fix GH issue number in test

* add test from original issue

* skip if no scipy

* add extra test case

Co-authored-by: Yuanhao Geng <41546976+GYHHAHA@users.noreply.github.com>
Co-authored-by: MarcoGorelli <>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants