corr return without duplicates and sorted by correlation strength #24728

chrisluedtke · 2019-01-11T17:33:23Z

Code Sample, a copy-pastable example if possible

Functionalized example of what I'm seeking to implement in pandas as a corr argument or separate function:

import numpy as np
import pandas as pd

def correlate_sort(df: pd.DataFrame, method: str = 'pearson') -> pd.DataFrame:
  """
  pd.DataFrame.corr() without redundancy and sorted by strength
  """
  df = df.corr(method)
  df = df.mask(np.tril(np.ones(df.shape)).astype(np.bool))
  df = df.stack().reset_index()
  df = df.rename(columns={0:method})
  
  df['sort'] = df[method].abs()
  df = df.sort_values('sort', ascending=False)
  
  return df.drop('sort', axis=1).reset_index(drop=True)

Problem description

pd.DataFrame.corr() returns a table with redundancies. I'm interested in implementing an enhancement (as an argument option or function, etc.) to return a DataFrame without redundancy and sorted by correlation strength.

import pandas as pd

data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data'
df = pd.read_csv(data_url, names=['Age', 'op_year', 'pos_nodes', '5_yr_outcome'])
df.corr()

                   Age   op_year  pos_nodes  5_yr_outcome
Age           1.000000  0.089529  -0.063176     -0.067950
op_year       0.089529  1.000000  -0.003764      0.004768
pos_nodes    -0.063176 -0.003764   1.000000     -0.286768
5_yr_outcome -0.067950  0.004768  -0.286768      1.000000

Expected Output

     level_0       level_1   pearson
0  pos_nodes  5_yr_outcome -0.286768
1        Age       op_year  0.089529
2        Age  5_yr_outcome -0.067950
3        Age     pos_nodes -0.063176
4    op_year  5_yr_outcome  0.004768
5    op_year     pos_nodes -0.003764

Output of `pd.show_versions()`

/usr/local/lib/python3.6/dist-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: . """)

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.79+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.10.1
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.14.6
scipy: 1.1.0
pyarrow: None
xarray: 0.11.2
IPython: 5.5.0
sphinx: 1.8.3
patsy: 0.5.1
dateutil: 2.5.3
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 2.1.2
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.6
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: 0.2.0
fastparquet: None
pandas_gbq: 0.4.1
pandas_datareader: 0.7.0

The text was updated successfully, but these errors were encountered:

mroeschke · 2019-01-11T20:01:03Z

This sounds like more of a usage question. We recommend using StackOverflow for these types of questions.

Issues is reserved for bug tracking and enhancement requests

chrisluedtke · 2019-01-11T20:59:59Z

@mroeschke I want to make an enhancement. Is this a feature that would be useful?

mroeschke · 2019-01-11T21:06:06Z

Oh I see. I'll open it back up for discussion. I could see this as a useful cookbook example in our documentation; we are somewhat hessitant to expand pandas' large API without a large interest from the community.

jbrockmendel · 2020-09-24T01:26:05Z

i agree with @mroeschke; have used this pattern before myself, dont need it implemented in pandas

mroeschke closed this as completed Jan 11, 2019

mroeschke added the Usage Question label Jan 11, 2019

mroeschke added this to the No action milestone Jan 11, 2019

mroeschke reopened this Jan 11, 2019

mroeschke added Needs Discussion Requires discussion from core team before further action and removed Usage Question labels Jan 11, 2019

mroeschke removed this from the No action milestone Jan 11, 2019

mroeschke added Docs Numeric Operations Arithmetic, Comparison, and Logical operations labels Jan 11, 2019

jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Sep 24, 2020

mroeschke removed Needs Discussion Requires discussion from core team before further action Closing Candidate May be closeable, needs more eyeballs labels Jul 5, 2022

jbrockmendel added cov/corr and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corr return without duplicates and sorted by correlation strength #24728

corr return without duplicates and sorted by correlation strength #24728

chrisluedtke commented Jan 11, 2019 •

edited

INSTALLED VERSIONS

mroeschke commented Jan 11, 2019

chrisluedtke commented Jan 11, 2019

mroeschke commented Jan 11, 2019

jbrockmendel commented Sep 24, 2020

corr return without duplicates and sorted by correlation strength #24728

corr return without duplicates and sorted by correlation strength #24728

Comments

chrisluedtke commented Jan 11, 2019 • edited

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

mroeschke commented Jan 11, 2019

chrisluedtke commented Jan 11, 2019

mroeschke commented Jan 11, 2019

jbrockmendel commented Sep 24, 2020

chrisluedtke commented Jan 11, 2019 •

edited

Output of `pd.show_versions()`