PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) #46174

fractionalhare · 2022-02-28T00:33:26Z

Description of Changes

This PR modifies the corrwith() method defined in pandas/core/frame.py to be faster when all of the following (common) conditions are met.

the axis argument is set to 0
the other argument is a Pandas Series object
the method argument is equal to "pearson" OR "spearman"

When all these are true, we replace the apply() on self with column-wise iteration on NumPy arrays and the correlation function against other is implemented in NumPy's np.corrcoef(). If not all of those conditions are met, the method falls back to its existing functionality with no performance change. The Spearman rank correlation is calculated via np.corrcoef() by taking the Pearson correlation coefficient of the vector element ranks with argsort().

When method = "pearson", this change reduces execution time by 40 - 50% on small and large DataFrame objects.

When method = "spearman", this change reduces execution time by 70 - 80% on small and large DataFrame objects.

This change is still robust to null values on the left or right side despite the fact that np.corrcoef() is not robust to nulls, because it masks the null values with a boolean & across both objects when iterating the vectors. It returns accurate results.

Testing Environment & Performance Assessment

My testing environment is as follows:

Hardware: MacBook Pro (16-inc, 2021) with Apple M1 Max Chip and 64GB RAM
OS: macOS Monterey v12.2
Python: 3.8.12
Pandas: v1.41 (Control, Unmodified) and v1.5.0.dev (Self-Built Test)
NumPy: v1.22.2
Installer/Package Manager: Anaconda3 (miniforge3 for Apple Silicon/arm64)

My test script was the following:

import timeit
import numpy as np
import pandas as pd

pearson_results = np.array([])
spearman_results = np.array([])
num_tests = 10
num_runs = 10

for k in range(num_tests):
    print(f'Running test {k + 1}...')
    data = np.random.uniform(-1, 1, size=(100, 50000))
    df = pd.DataFrame(data)
    df.columns = [str(c) for c in df.columns]
    pearson_time = timeit.timeit("df.corrwith(df['0'], axis=0, method='pearson')", "from __main__ import df", number=num_runs, timer=timeit.default_timer)
    spearman_time = timeit.timeit("df.corrwith(df['0'], axis=0, method='spearman')", "from __main__ import df", number=num_runs, timer=timeit.default_timer)
    pearson_results = np.append(pearson_results, pearson_time)
    spearman_results = np.append(spearman_results, spearman_time)

print(f'Average Pearson Correlation Time Across {num_tests} DataFrames with {num_runs} runs each: {pearson_results.mean() / num_runs}')
print(f'Average Spearman Correlation Time Across {num_tests} DataFrames {num_runs} runs each: {spearman_results.mean() / num_runs}')

First I created a fresh conda environment with the current stable version of Pandas (v1.41). For a DataFrame with 100 rows and 50,000 columns, using unmodified corrwith() with method="pearson" took approximately 3.0697 seconds on average across 10 runs each on 10 different randomly generated DataFrames. With method="spearman", the same evaluation took approximately 8.5808 seconds on average.

Then I made my changes to my fork of Pandas main and built it in a new Conda environment, then ran the same script. For an equivalently sized DataFrame with random numerical data, the modified corrwith() with method="pearson" took approximately 1.5342 seconds. With method="spearman", it took approximately 2.0556 seconds.

jreback · 2022-02-28T00:41:21Z

do we have asv's for this?

fractionalhare · 2022-02-28T01:10:44Z

I haven't used asv before, but I can generate some test results from doing so.

fractionalhare · 2022-02-28T02:04:03Z

@jreback I'm having difficulty getting asv to work after following the documentation here: https://pandas.pydata.org/docs/development/contributing_codebase.html?highlight=asv#running-the-performance-test-suite

I did the following to install and prep:

git remote add upstream https://github.com/pandas-dev/pandas
conda install asv
cd pandas-fractionalhare/asv_bench

Attempting to run from any upstream/{branch} branch results in an Unknwon commit upstream/{branch} error. If I run it locally instead, on my branch corrwith-perf-improv like so:

asv continuous -f 1.1 -E conda corrwith-perf-improv

then it fails with the following error:

(base) Dylans-MacBook-Pro:asv_bench dhoulihan$ asv continuous -f 1.1 -E conda corrwith-perf-improv
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.8-Cython0.29.24-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
·· Installing c4ea97f3 <corrwith-perf-improv> into conda-py3.8-Cython0.29.24-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pyarrow-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Error running /Users/dhoulihan/PycharmProjects/pandas-fractionalhare/asv_bench/env/45e7123d25d88e958f183355b52ee854/bin/python /Users/dhoulihan/miniforge3/lib/python3.9/site-packages/asv/benchmark.py discover /Users/dhoulihan/PycharmProjects/pandas-fractionalhare/asv_bench/benchmarks /var/folders/_j/2v9pmrp15qgcxw0ml45sj9zs_4bd6t/T/tmpnrx32tqr/result.json (exit status -11)
   STDOUT -------->
   
   STDERR -------->

What am I doing wrong? I was able to build my fork of Pandas locally using setup.py. I also tried with pip install git+https://github.com/airspeed-velocity/asv instead and there was no change.

fractionalhare · 2022-02-28T02:21:12Z

Separately, I do see one test consistently failing in the automated suite: test_corrwith_mixed_dtypes. I will resolve this and push.

fractionalhare · 2022-02-28T05:45:16Z

@jreback looks like the full asv benchmark ran via GitHub Actions, so we have results there. also looks like all the other GHA tests passed (except for an an Azure build which timed out)

jreback · 2022-02-28T13:49:47Z

@jreback looks like the full asv benchmark ran via GitHub Actions, so we have results there. also looks like all the other GHA tests passed (except for an an Azure build which timed out)

@fractionalhare a specific asv for this case (I think we do have it but pls check). then show the asv results

fractionalhare · 2022-02-28T14:20:19Z

Yep, I see it here: https://github.com/pandas-dev/pandas/blob/main/asv_bench/benchmarks/stat_ops.py#L124. I'll run and add results here.

fractionalhare · 2022-02-28T15:35:03Z

@jreback I can't get asv to work on my machine, it has odd build errors about hdf5 and llvmlite regardless of whether I run with conda or virtualenv. This seems like it has to do with the platform dependency of tables vs pytables - I'm not sure if this is related to me trying to run on it on an M1. Even if I try to run it with my existing environment I get the following error:

asv continuous -f 1.1 -E existing corrwith-perf-improv -b stat_ops.Correlation
No range spec may be specified if benchmarking in an existing environment

However, you can see the results of just the relevant benchmark from the automated test suite at this line:
https://github.com/pandas-dev/pandas/runs/5354404445?check_suite_focus=true#step:6:8761. The results report as ok

fractionalhare · 2022-03-01T06:17:31Z

Adding to discussion here: is it worth adding more tests of corrwith correctness with expected outputs for sample inputs, up to precision with something like np.isclose()?

jreback · 2022-03-01T11:26:51Z

Adding to discussion here: is it worth adding more tests of corrwith correctness with expected outputs for sample inputs, up to precision with something like np.isclose()?

yes

jreback · 2022-03-01T13:06:53Z

@fractionalhare show the asv result above (or at the very least show a timeit with before/after of the sample)

…e/pandas into corrwith-perf-improv merge

fractionalhare · 2022-03-01T13:44:36Z

Adding to discussion here: is it worth adding more tests of corrwith correctness with expected outputs for sample inputs, up to precision with something like np.isclose()?

yes

Okay, I'll add to this PR.

fractionalhare · 2022-03-01T13:47:32Z

@fractionalhare show the asv result above (or at the very least show a timeit with before/after of the sample)

Here are the corrwith ASV results from the latest GHA run:

[ 39.31%] ··· stat_ops.Correlation.time_corrwith_cols                         ok
[ 39.31%] ··· ========== ==========
                method             
              ---------- ----------
               spearman   314±0ms  
               kendall    326±0ms  
               pearson    10.7±0ms 
              ========== ==========

[ 39.36%] ··· stat_ops.Correlation.time_corrwith_rows                         ok
[ 39.36%] ··· ========== ==========
                method             
              ---------- ----------
               spearman   656±0ms  
               kendall    637±0ms  
               pearson    10.7±0ms 
              ========== ==========

There are only two that apply to this function.

And here are my timeit comparisons. I created a 100 x 50,000 DataFrame to test, then pickled it and read it back in to run the timeit test with stable Pandas and my build with this change. Tests were executed like so:

import timeit
import numpy as np
import pandas as pd
pearson_results = np.array([])
spearman_results = np.array([])
num_runs = 1
# data = np.random.uniform(-1, 1, size=(100, 50000))
# df = pd.DataFrame(data)
# df.to_pickle('~/test_dataframe')
df = pd.read_pickle('~/test_dataframe')
print(f'Test DataFrame has shape {df.shape}')
df.columns = [str(c) for c in df.columns]
pearson_time = timeit.timeit("df.corrwith(df['0'], axis=0, method='pearson')", "from __main__ import df", number=num_runs, timer=timeit.default_timer)
spearman_time = timeit.timeit("df.corrwith(df['0'], axis=0, method='spearman')", "from __main__ import df", number=num_runs, timer=timeit.default_timer)

Results before:

Test DataFrame has shape (100, 50000)
Average Pearson Correlation Time with 1 runs: 2.997473542
Average Spearman Correlation Time with 1 runs: 8.714665584

Results after:

Test DataFrame has shape (100, 50000)
Average Pearson Correlation Time with 1 runs: 1.5199412919999986
Average Spearman Correlation Time with 1 runs: 2.0341787089999883

fractionalhare · 2022-03-01T13:52:20Z

If I instead do it on a small, 100 x 100 DataFrame, I get similar results in terms of percent change.

Results before:

Test DataFrame has shape (100, 100)
Average Pearson Correlation Time with 1 runs: 0.008117292000001441
Average Spearman Correlation Time with 1 runs: 0.01831337500000174

Results after:

Test DataFrame has shape (100, 100)
Average Pearson Correlation Time with 1 runs: 0.004528458000000235
Average Spearman Correlation Time with 1 runs: 0.00520883300000019

In both cases (small and large), it's a reduction of 40% - 50% execution time for Pearson and 70% - 80% for Spearman.

fractionalhare · 2022-03-02T15:14:35Z

Unclear why the ASV benchmark failed, a test in unrelated/unmodified code triggered an exit code 1 in the benchmarks after I merged main into this branch, but the ASVs for corrwith remain clean: https://github.com/pandas-dev/pandas/runs/5392988518?check_suite_focus=true#step:6:9179

[ 39.33%] ··· stat_ops.Correlation.time_corrwith_cols                         ok
[ 39.33%] ··· ========== ==========
                method             
              ---------- ----------
               spearman   204±0ms  
               kendall    202±0ms  
               pearson    6.61±0ms 
              ========== ==========

[ 39.38%] ··· stat_ops.Correlation.time_corrwith_rows                         ok
[ 39.38%] ··· ========== ==========
                method             
              ---------- ----------
               spearman   384±0ms  
               kendall    366±0ms  
               pearson    6.84±0ms 
              ========== ==========

Tests were all green on this PR prior to me pulling main just now, so maybe this is ephemeral?

fractionalhare · 2022-03-02T22:06:44Z

@jreback Apologies since this is my first time using asv, so I might be wrong here but - it looks like there is a performance regression in master now, if I'm interpreting the last two automated asv runs correctly.

My last commit with an actual code change to the corrwith function was b843595, and that one was all green for test results. My last two commits to this branch have been merge commits to catch up with main, and those are failing in an unrelated asv benchmark (0d870f9 and 280e618).

Am I interpreting this correctly?

Other than that I have some perf results for asv in this thread along with more explicit before/after results for timeit, do you want me to make any changes to the dostring or inline comments mentioning this PR number? Correctness coverage looks okay, it's already comparing to np.corrcoef() with np.isclose().

jreback

also pls add a whtasnew note in 1.5 in the perf improvements section

pandas/core/frame.py

…e/pandas into corrwith-perf-improv merge

fractionalhare · 2022-03-03T02:44:32Z

@jreback In addition to the changes you asked for and the perf tests I've shown with timeit, here is a demonstration that the method is 1) still robust to nulls and 2) equivalent functionally to the existing logic. Given an unmodified build of Pandas:

import timeit
import numpy as np
import pandas as pd
from typing import Union

# test modified method against df.corrwith in unmodified Pandas build by making standalone function
def corr_with(df: pd.DataFrame, other: Union[pd.Series, pd.DataFrame], method: str, axis: int) -> pd.Series:
    if axis == 0 and method in ['pearson', 'spearman'] and isinstance(other, pd.Series):
        corrs = {}
        numeric_cols = df.select_dtypes(include=np.number).columns
        ndf = df[numeric_cols].values.transpose()
        k = other.values
        if method == "pearson":
            for i, r in enumerate(ndf):
                nonnull_mask = ~np.isnan(r) & ~np.isnan(k)
                corrs[numeric_cols[i]] = np.corrcoef(
                    r[nonnull_mask], k[nonnull_mask]
                )[0, 1]
        else:
            for i, r in enumerate(ndf):
                nonnull_mask = ~np.isnan(r) & ~np.isnan(k)
                corrs[numeric_cols[i]] = np.corrcoef(
                    r[nonnull_mask].argsort().argsort(),
                    k[nonnull_mask].argsort().argsort(),
                )[0, 1]
        return pd.Series(corrs)
    else:
        return df.corrwith(other, method=method, axis=axis)

# create a sample dataframe and series, ideally with nans
df = pd.DataFrame({'a': [1.4, 3.5, np.nan, 5.6, np.nan, 7.8], 'b': [5.7, 3.6, 2.3, 1.5, 8.7, 3.2]})

s = pd.Series([1.4, 6.5, 7.4, 2.6, np.nan, 3.4])

a = df.corrwith(s, method='pearson', axis=0)

b = corr_with(df, s, 'pearson', 0)

a = df.corrwith(s, method='pearson', axis=0)
b = corr_with(df, s, 'pearson', 0)
np.isclose(a, b)

which will output

Out[14]: array([ True,  True])

Likewise if we do with Spearman instead of Pearson:

a = df.corrwith(s, method='spearman', axis=0)
b = corr_with(df, s, 'spearman', 0)
np.isclose(a, b)

we have

Out[17]: array([ True,  True])

jreback · 2022-03-03T19:36:11Z

thanks @fractionalhare

…n other is a Series and axis = 0 (column-wise) (pandas-dev#46174)

…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)" This reverts commit 5efb570.

…tion when other is a Series and axis = 0 (column-wise) (#46174)" (#49140) * Update frame.py * Revert "PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (#46174)" This reverts commit 5efb570. * fix GH issue number in test * add test from original issue * skip if no scipy * add extra test case Co-authored-by: Yuanhao Geng <41546976+GYHHAHA@users.noreply.github.com> Co-authored-by: MarcoGorelli <>

…r pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)"

…hod for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (#46174)") (#49151) Backport PR #49140: Revert "PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (#46174)" Co-authored-by: Marco Edward Gorelli <33491632+MarcoGorelli@users.noreply.github.com>

…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)" (pandas-dev#49140) * Update frame.py * Revert "PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)" This reverts commit 5efb570. * fix GH issue number in test * add test from original issue * skip if no scipy * add extra test case Co-authored-by: Yuanhao Geng <41546976+GYHHAHA@users.noreply.github.com> Co-authored-by: MarcoGorelli <>

$fractionalhare$

fractionalhare added 2 commits February 27, 2022 17:44

PERF: faster corrwith for pearson, spearman on Series

87eaa2c

fix axis typo

c4ea97f

jreback added Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Feb 28, 2022

fractionalhare added 2 commits February 27, 2022 21:40

change to be robust to mixed dtypes

02a06fd

resolve issue where columns could report out of order in mixed dtypes

3f87e89

fractionalhare added 4 commits March 1, 2022 01:00

$@fractionalhare$

PERF: faster corrwith for pearson, spearman on Series

fa1a173

$@fractionalhare$

fix axis typo

4252404

$@fractionalhare$

change to be robust to mixed dtypes

f81e675

$@fractionalhare$

resolve issue where columns could report out of order in mixed dtypes

b843595

Merge branch 'corrwith-perf-improv' of ssh://github.com/fractionalhar…

18b064f

…e/pandas into corrwith-perf-improv merge

$@fractionalhare$ fractionalhare changed the title ~~PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0~~ PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) Mar 2, 2022

$@fractionalhare$

Merge branch 'main' into corrwith-perf-improv

08d70f9

$@fractionalhare$

Merge branch 'main' into corrwith-perf-improv

280e618

jreback requested changes Mar 3, 2022

View reviewed changes

pandas/core/frame.py Show resolved Hide resolved

fractionalhare added 3 commits March 2, 2022 21:04

Merge branch 'corrwith-perf-improv' of ssh://github.com/fractionalhar…

f5577db

…e/pandas into corrwith-perf-improv merge

add comments and whats new

74207ee

clarify axis=0

5626557

$@fractionalhare$ fractionalhare requested a review from jreback March 3, 2022 13:34

jreback added this to the 1.5 milestone Mar 3, 2022

jreback approved these changes Mar 3, 2022

View reviewed changes

jreback merged commit 5efb570 into pandas-dev:main Mar 3, 2022

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

$@fractionalhare$

PERF: faster corrwith method for pearson and spearman correlation whe…

95a12d2

…n other is a Series and axis = 0 (column-wise) (pandas-dev#46174)

This was referenced Sep 28, 2022

BUG: The behavior for DataFrame.corrwith is changed in pandas 1.5.0 ?? #48826

Closed

[SPARK-40589][PS][TEST] Fix test for DataFrame.corr_with skip the pandas regression apache/spark#38031

Closed

MarcoGorelli mentioned this pull request Oct 7, 2022

RLS: pandas 1.5.1 #48991

Closed

MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this pull request Oct 17, 2022

Revert "PERF: faster corrwith method for pearson and spearman correla…

7eb8f3a

…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)" This reverts commit 5efb570.

MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this pull request Oct 17, 2022

Revert "PERF: faster corrwith method for pearson and spearman correla…

b545bf2

…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)" This reverts commit 5efb570.

MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this pull request Oct 17, 2022

Revert "PERF: faster corrwith method for pearson and spearman correla…

b03b76e

…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)" This reverts commit 5efb570.

MarcoGorelli pushed a commit to MarcoGorelli/pandas that referenced this pull request Oct 17, 2022

Revert "PERF: faster corrwith method for pearson and spearman correla…

7a21912

…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)" This reverts commit 5efb570.

meeseeksmachine mentioned this pull request Oct 17, 2022

Backport PR #49140 on branch 1.5.x (Revert "PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (#46174)") #49151

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Oct 17, 2022

Backport PR pandas-dev#49140: Revert "PERF: faster corrwith method fo…

6ea4ab4

…r pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) #46174

PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) #46174

$@fractionalhare$ fractionalhare commented Feb 28, 2022 •

edited

Loading

jreback commented Feb 28, 2022

fractionalhare commented Feb 28, 2022

fractionalhare commented Feb 28, 2022 •

edited

Loading

fractionalhare commented Feb 28, 2022

fractionalhare commented Feb 28, 2022

jreback commented Feb 28, 2022

fractionalhare commented Feb 28, 2022

fractionalhare commented Feb 28, 2022 •

edited

Loading

fractionalhare commented Mar 1, 2022

jreback commented Mar 1, 2022

jreback commented Mar 1, 2022

fractionalhare commented Mar 1, 2022

fractionalhare commented Mar 1, 2022 •

edited

Loading

fractionalhare commented Mar 1, 2022 •

edited

Loading

fractionalhare commented Mar 2, 2022

fractionalhare commented Mar 2, 2022 •

edited

Loading

jreback left a comment

fractionalhare commented Mar 3, 2022

jreback commented Mar 3, 2022

PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) #46174

PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) #46174

Conversation

fractionalhare commented Feb 28, 2022 • edited Loading

Description of Changes

Testing Environment & Performance Assessment

jreback commented Feb 28, 2022

fractionalhare commented Feb 28, 2022

fractionalhare commented Feb 28, 2022 • edited Loading

fractionalhare commented Feb 28, 2022

fractionalhare commented Feb 28, 2022

jreback commented Feb 28, 2022

fractionalhare commented Feb 28, 2022

fractionalhare commented Feb 28, 2022 • edited Loading

fractionalhare commented Mar 1, 2022

jreback commented Mar 1, 2022

jreback commented Mar 1, 2022

fractionalhare commented Mar 1, 2022

fractionalhare commented Mar 1, 2022 • edited Loading

fractionalhare commented Mar 1, 2022 • edited Loading

fractionalhare commented Mar 2, 2022

fractionalhare commented Mar 2, 2022 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

fractionalhare commented Mar 3, 2022

jreback commented Mar 3, 2022

$@fractionalhare$ fractionalhare commented Feb 28, 2022 •

edited

Loading

fractionalhare commented Feb 28, 2022 •

edited

Loading

fractionalhare commented Feb 28, 2022 •

edited

Loading

fractionalhare commented Mar 1, 2022 •

edited

Loading

fractionalhare commented Mar 1, 2022 •

edited

Loading

fractionalhare commented Mar 2, 2022 •

edited

Loading