Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corr return without duplicates and sorted by correlation strength #24728

Open
chrisluedtke opened this issue Jan 11, 2019 · 4 comments
Open

corr return without duplicates and sorted by correlation strength #24728

chrisluedtke opened this issue Jan 11, 2019 · 4 comments

Comments

@chrisluedtke
Copy link

chrisluedtke commented Jan 11, 2019

Code Sample, a copy-pastable example if possible

Functionalized example of what I'm seeking to implement in pandas as a corr argument or separate function:

import numpy as np
import pandas as pd

def correlate_sort(df: pd.DataFrame, method: str = 'pearson') -> pd.DataFrame:
  """
  pd.DataFrame.corr() without redundancy and sorted by strength
  """
  df = df.corr(method)
  df = df.mask(np.tril(np.ones(df.shape)).astype(np.bool))
  df = df.stack().reset_index()
  df = df.rename(columns={0:method})
  
  df['sort'] = df[method].abs()
  df = df.sort_values('sort', ascending=False)
  
  return df.drop('sort', axis=1).reset_index(drop=True)

Problem description

pd.DataFrame.corr() returns a table with redundancies. I'm interested in implementing an enhancement (as an argument option or function, etc.) to return a DataFrame without redundancy and sorted by correlation strength.

import pandas as pd

data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data'
df = pd.read_csv(data_url, names=['Age', 'op_year', 'pos_nodes', '5_yr_outcome'])
df.corr()
                   Age   op_year  pos_nodes  5_yr_outcome
Age           1.000000  0.089529  -0.063176     -0.067950
op_year       0.089529  1.000000  -0.003764      0.004768
pos_nodes    -0.063176 -0.003764   1.000000     -0.286768
5_yr_outcome -0.067950  0.004768  -0.286768      1.000000

Expected Output

     level_0       level_1   pearson
0  pos_nodes  5_yr_outcome -0.286768
1        Age       op_year  0.089529
2        Age  5_yr_outcome -0.067950
3        Age     pos_nodes -0.063176
4    op_year  5_yr_outcome  0.004768
5    op_year     pos_nodes -0.003764

Output of pd.show_versions()

/usr/local/lib/python3.6/dist-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: . """)

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.79+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.10.1
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.14.6
scipy: 1.1.0
pyarrow: None
xarray: 0.11.2
IPython: 5.5.0
sphinx: 1.8.3
patsy: 0.5.1
dateutil: 2.5.3
pytz: 2018.9
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 2.1.2
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.6
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: 0.2.0
fastparquet: None
pandas_gbq: 0.4.1
pandas_datareader: 0.7.0

@mroeschke
Copy link
Member

This sounds like more of a usage question. We recommend using StackOverflow for these types of questions.

Issues is reserved for bug tracking and enhancement requests

@mroeschke mroeschke added this to the No action milestone Jan 11, 2019
@chrisluedtke
Copy link
Author

@mroeschke I want to make an enhancement. Is this a feature that would be useful?

@mroeschke
Copy link
Member

Oh I see. I'll open it back up for discussion. I could see this as a useful cookbook example in our documentation; we are somewhat hessitant to expand pandas' large API without a large interest from the community.

@mroeschke mroeschke reopened this Jan 11, 2019
@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed Usage Question labels Jan 11, 2019
@mroeschke mroeschke removed this from the No action milestone Jan 11, 2019
@mroeschke mroeschke added Docs Numeric Operations Arithmetic, Comparison, and Logical operations labels Jan 11, 2019
@jbrockmendel
Copy link
Member

i agree with @mroeschke; have used this pattern before myself, dont need it implemented in pandas

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Sep 24, 2020
@mroeschke mroeschke removed Needs Discussion Requires discussion from core team before further action Closing Candidate May be closeable, needs more eyeballs labels Jul 5, 2022
@jbrockmendel jbrockmendel added cov/corr and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants