Columns and Index share the same numpy object underneath when pd.DataFrame.cov is used #14617

Closed
kapilsh opened this Issue Nov 8, 2016 · 4 comments

Comments

Projects
None yet
3 participants

kapilsh commented Nov 8, 2016 edited

A small, complete example of the issue

In [1]: import pandas as pd
        import numpy as np
In [2]: df = pd.DataFrame(np.random.randn(4 * 1000).reshape(1000, 4), columns=list("abcd"))
In [3]: c = df.cov()

In [4]: c.index is c.columns
Out[4]: True

In [5]: c.index.name = "ABC"

In [6]: c.columns.name
Out[6]: 'ABC'

Expected Output

In[7]: c.index is c.columns
False

Output of pd.show_versions()

In [8]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.36.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.0
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext)
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

kapilsh commented Nov 8, 2016

In my use case, I am doing something like below:

In [96]: df = pd.DataFrame({"Value": np.random.randn(1000), "Kind": map(chr, np.random.randint(65, 69, 1000))})

In [97]: df.pivot(values="Value", columns="Kind").ffill().diff().cov()
Out[97]: 
Kind             A             B             C             D
Kind                                                        
A     6.094439e-01  1.864854e-06 -5.956038e-07 -1.130525e-08
B     1.864854e-06  5.643768e-01  1.384354e-06  2.627663e-08
C    -5.956038e-07  1.384354e-06  4.964671e-01 -1.802524e-08
D    -1.130525e-08  2.627663e-08 -1.802524e-08  3.862837e-01

In [98]: cc = df.pivot(values="Value", columns="Kind").ffill().diff().cov()

In [99]: cc.index is cc.columns
Out[99]: True

As a result,

cc.unstack().reset_index()

fails.

Contributor

jreback commented Nov 8, 2016

yeah it should shallow copy the index first rather than setting the same object so that meta data will not be shared

want to do a PR ?

jreback added this to the Next Major Release milestone Nov 8, 2016

kapilsh commented Nov 8, 2016

Sure! I can do a PR. Feel free to assign it to me.

@kapilsh kapilsh added a commit to kapilsh/pandas that referenced this issue Nov 16, 2016

@kapilsh kapilsh Updates for Bug #14617
Index now copies the columns array instead of using the same object in
covariance and correlations stats methods of dataframe
428cb37

kapilsh commented Nov 16, 2016

@jreback Made the changes to cov and corr.

jreback closed this in d0a281f Feb 28, 2017

@AnkurDedania AnkurDedania added a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017

@mroeschke @AnkurDedania mroeschke + AnkurDedania BUG: DataFrame index & column returned by corr & cov are the same (#1…
…4617)


closes #14617

Author: Matt Roeschke <emailformattr@gmail.com>

Closes #15528 from mroeschke/fix_14617 and squashes the following commits:

5a46f0a [Matt Roeschke] Bug:DataFrame index & column returned by corr & cov are the same (#14617)
4a3035b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment