Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.delevel infer dtypes better #440

Closed
wesm opened this issue Dec 2, 2011 · 2 comments
Closed

DataFrame.delevel infer dtypes better #440

wesm opened this issue Dec 2, 2011 · 2 comments
Milestone

Comments

@wesm
Copy link
Member

wesm commented Dec 2, 2011

cc @lodagro


MultiIndex seems to store the level data always as dtype('object').
When using DataFrame.delevel() the added columns from the index also have dtype('object').
This prevents from using DataFrame.delevel.corr() to have a look at the correlation between the original DataFrame columns and the index level values. Does anyone have an idea to work around this?

See example below:

In [1]: import pandas

In [2]: import numpy as np

In [3]: import itertools

In [4]: tuples = [tuple for tuple in itertools.product(['foo', 'bar'], [10, 20], [1.0, 1.1])]

In [5]: index = pandas.MultiIndex.from_tuples(tuples, names=['prm0', 'prm1', 'prm2'])

In [6]: df = pandas.DataFrame(np.random.randn(8,3), columns=['A', 'B', 'C'], index=index)

In [7]: df
Out[7]:
                A       B       C
prm0 prm1 prm2
foo  10   1.0   0.2074  0.3425 -1.295
          1.1   0.3194  0.8114  2.133
foo  20   1.0  -0.1798 -1.162   0.5774
          1.1  -0.4635  1.436   1.419
bar  10   1.0  -1.013   0.7605 -1.184
          1.1  -0.4716  0.6983  0.5209
bar  20   1.0  -0.87   -0.3788  0.272
          1.1   1.018  -0.4496  1.132

In [8]: df.corr()
Out[8]:
   A       B        C
A  1      -0.2445   0.3852
B -0.2445  1        0.08211
C  0.3852  0.08211  1

In [9]: df.delevel().corr()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...
   2535         cols = self.columns
   2536         mat = self.as_matrix(cols).T
-> 2537         baseCov = np.cov(mat)
   2538
   2539         sigma = np.sqrt(np.diag(baseCov))

.../python2.7/site-packages/numpy/lib/function_base.pyc in cov(m, y, rowvar, bias, ddof)
   1920         raise ValueError("ddof must be integer")
   1921
-> 1922     X = array(m, ndmin=2, dtype=float)
   1923     if X.shape[0] == 1:
   1924         rowvar = 1

ValueError: setting an array element with a sequence.

My guess is that this exception is related to the fact corr can not work with strings.
So let`s try it without the strings. 

In [10]: df.delevel()[['prm1', 'prm2', 'A', 'B', 'C']]
Out[10]:
   prm1  prm2  A       B       C
0  10    1     0.2074  0.3425 -1.295
1  10    1.1   0.3194  0.8114  2.133
2  20    1    -0.1798 -1.162   0.5774
3  20    1.1  -0.4635  1.436   1.419
4  10    1    -1.013   0.7605 -1.184
5  10    1.1  -0.4716  0.6983  0.5209
6  20    1    -0.87   -0.3788  0.272
7  20    1.1   1.018  -0.4496  1.132

In [11]: df.delevel()[['prm1', 'prm2', 'A', 'B', 'C']].corr()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[...]
TypeError: function not supported for these types, and can't coerce safely to supported types

In [12]: df.delevel()['prm1'].values.dtype
Out[12]: dtype('object')

In [13]: df.delevel()['prm1']
Out[13]:
0    10
1    10
2    20
3    20
4    10
5    10
6    20
7    20
Name: prm1

In [14]: index.levels
Out[14]:
[Index([bar, foo], dtype=object),
 Index([10, 20], dtype=object),
 Index([1.0, 1.1], dtype=object)]
@wesm
Copy link
Member Author

wesm commented Dec 2, 2011

Alright, I think I've got this working like it should. I also changed DataFrame.corr so it automatically excludes non-numeric dtypes per above. So deleveled.corr() will work in the example you cited now

@wesm wesm closed this as completed Dec 2, 2011
@lodagro
Copy link
Contributor

lodagro commented Feb 7, 2012

Below an example of where after reset_index() (delevel() before) dtype could be float but object is used.
This is on master.

In [36]: s
Out[36]:
time
0.0                 0.0000
0.707106781187      2.4525
1.41421356237       9.8100
2.12132034356      22.0725
2.82842712475      39.2400
3.53553390593      61.3125
4.24264068712      88.2900
4.94974746831     120.1725
5.65685424949     156.9600
6.36396103068     198.6525
7.07106781187     245.2500
7.77817459305     296.7525
8.48528137424     353.1600
9.19238815543     414.4725
9.89949493661     480.6900
Name: speed

In [37]: df = s.reset_index()

In [38]: df
Out[38]:
         time     speed
0           0    0.0000
1   0.7071068    2.4525
2    1.414214    9.8100
3     2.12132   22.0725
4    2.828427   39.2400
5    3.535534   61.3125
6    4.242641   88.2900
7    4.949747  120.1725
8    5.656854  156.9600
9    6.363961  198.6525
10   7.071068  245.2500
11   7.778175  296.7525
12   8.485281  353.1600
13   9.192388  414.4725
14   9.899495  480.6900

In [39]: df['time'].dtype
Out[39]: dtype('object')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants