###  4.1 层次化索引

#### 1. Series的层次化索引
1. 层次化索引表示 , 一个轴上, 有2个以上的索引级别  
 创建一个多级索引结构的Series, 由多个列表/数组构成的列表即多级索引  
 索引之间的间隔, 表示直接使用上面的标签
2. 通过最外层索引选取数据, 若原先是Series,则返回Series. 若原先是DataFrame,则返回DataFrame  
 返回的数据是索引减少1级的数据结构 
3. Series内层索引选取, 使用obj[outerIdx,innerIdx]

In [31]:
import pandas as pd
from pandas import Series,DataFrame
import numpy as np

# 多级索引结构的Series
obj = Series(np.random.randn(9),
            index = [list('aaabbccdd'),list('123131223')])
obj


a  1    0.617027
   2   -0.383758
   3    0.555424
b  1   -0.887932
   3    0.535598
c  1    0.919304
   2   -2.137841
d  2   -1.580804
   3    0.998615
dtype: float64

In [32]:
# 多集索引
print obj.index
print "================[1]=============="
# 返回Series
print obj['b']
print "================[2]=============="
# 选取多个外层索引的数据
print obj[['b','c']]
print "================[3]=============="
# 选取内层索引的数据
obj.loc[:,'2']

MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [u'1', u'2', u'3']],
           labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])
1   -0.887932
3    0.535598
dtype: float64
b  1   -0.887932
   3    0.535598
c  1    0.919304
   2   -2.137841
dtype: float64


a   -0.383758
c   -2.137841
d   -1.580804
dtype: float64

#### 2. stack与unstack
1. Series.unstack() : 解决堆叠, 将Series的内层索引, 拓宽成columnIdx. 返回DataFrame
2. DataFrame.stack() : 堆叠 把columnIdx拉长成内层索引, 返回Series

In [38]:
df = obj.unstack()
df

Unnamed: 0,1,2,3
a,0.617027,-0.383758,0.555424
b,-0.887932,,0.535598
c,0.919304,-2.137841,
d,,-1.580804,0.998615


In [39]:
df.stack()

a  1    0.617027
   2   -0.383758
   3    0.555424
b  1   -0.887932
   3    0.535598
c  1    0.919304
   2   -2.137841
d  2   -1.580804
   3    0.998615
dtype: float64

#### 3. DataFrame的层次化索引
1. DataFrame的每个轴上, 都可以是层次化索引. 即index和columns都能是列表/数组组成的列表
2. 可以对每层的索引起名字:  
  1. DataFrame.index.names=[str] : 对rowIdx的每层起名字
  2. DataFrame.columns.name = [str] , 对columnIdx每层起名字
  
 3. 在多层columnIdx上选取DataFrame  
  DataFrame['outIdx','innerIdx']或者精确到内层索引: DataFrame['outIdx','innerIdx']

In [48]:
# index和column均为多集索引的DataFrame
df = DataFrame(np.arange(12).reshape(4,3),
              index = [list('aaba'),list('1212')],
              columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
a,2,9,10,11


In [97]:
df.index.names=['key1','key2']
df.columns.names=['state','color']

df

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
a,2,9,10,11


In [98]:
df['Ohio','Green']

key1  key2
a     1       0
      2       3
b     1       6
a     2       9
Name: (Ohio, Green), dtype: int64

#### 4. DataFrame的索引层级重排与索引排序
1. 交换DataFrame的内外层行索引顺序 :  
 DataFrame.swaplevel('rowIdx1','rowIdx2',axis=0) : 默认axis=0, 交换行索引
2. DataFrame.Sort_index(level=[n,n]) : DataFrame分别在某个级别上的索引进行排序

In [99]:
print df.index.name
# df.swaplevel('key1', 'key2',axis=0)
df.swaplevel(0,1,axis=0)

['key1', 'key2']


Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,a,9,10,11


In [104]:
df.swaplevel('key1','key2',axis=0).sort_index(level=[0,1])
# df.swaplevel(0,1,axis=0).sort_index(level=[0,1])

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,a,9,10,11


#### 5. 多级索引下的求和
1. 可以在多集索引的某个level上求和.  
 DataFrame.sum(level='',axis=n)

In [102]:
df.sum(level='color',axis=1)


Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
a,2,20,10


#### 6. DataFrame的列转换成行索引
1. DataFrame.set_index([columnIdx1,columnIdx2]) :  
 将1个/多个columnIdx作为行索引. 并将这两个column删除和原先的索引  
 若要保留这2个column, 指定参数frop=False
 若要保留原先的index, 指定参数append=True
 
2. DataFrame.reset_index() : 将所有层次的行索引变成columnIdx

In [107]:
df = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'], 'd': [0, 1, 2, 0, 1, 2, 3]},
                 index=list('zyxwvut'))
df

Unnamed: 0,a,b,c,d
z,0,7,one,0
y,1,6,one,1
x,2,5,one,2
w,3,4,two,0
v,4,3,two,1
u,5,2,two,2
t,6,1,two,3


In [119]:
df2 = df.set_index(['c','d'])
print df2.index
df2

MultiIndex(levels=[[u'one', u'two'], [0, 1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 0, 1, 2, 3]],
           names=[u'c', u'd'])


Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [127]:
df3 = df.set_index(['c','d'],append=True)
print df3.index
df3

MultiIndex(levels=[[u't', u'u', u'v', u'w', u'x', u'y', u'z'], [u'one', u'two'], [0, 1, 2, 3]],
           labels=[[6, 5, 4, 3, 2, 1, 0], [0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 0, 1, 2, 3]],
           names=[None, u'c', u'd'])


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,a,b
Unnamed: 0_level_1,c,d,Unnamed: 3_level_1,Unnamed: 4_level_1
z,one,0,0,7
y,one,1,1,6
x,one,2,2,5
w,two,0,3,4
v,two,1,4,3
u,two,2,5,2
t,two,3,6,1


In [126]:
df3.reset_index()

Unnamed: 0,level_0,c,d,a,b
0,z,one,0,0,7
1,y,one,1,1,6
2,x,one,2,2,5
3,w,two,0,3,4
4,v,two,1,4,3
5,u,two,2,5,2
6,t,two,3,6,1


### 4.2 合并数据集