###  4.1 层次化索引

#### 1. Series的层次化索引
1. 层次化索引表示 , 一个轴上, 有2个以上的索引级别  
 创建一个多级索引结构的Series, 由多个列表/数组构成的列表即多级索引  
 索引之间的间隔, 表示直接使用上面的标签
2. 通过最外层索引选取数据, 若原先是Series,则返回Series. 若原先是DataFrame,则返回DataFrame  
 返回的数据是索引减少1级的数据结构 
3. Series内层索引选取, 使用obj[outerIdx,innerIdx]

In [3]:
import pandas as pd
from pandas import Series,DataFrame
import numpy as np

# 多级索引结构的Series
obj = Series(np.random.randn(9),
            index = [list('aaabbccdd'),list('123131223')])
obj


a  1   -0.146974
   2    0.019822
   3    0.005051
b  1   -0.456671
   3    0.284298
c  1    1.432425
   2   -1.274916
d  2   -0.454502
   3   -0.846909
dtype: float64

In [4]:
# 多集索引
print obj.index
print "================[1]=============="
# 返回Series
print obj['b']
print "================[2]=============="
# 选取多个外层索引的数据
print obj[['b','c']]
print "================[3]=============="
# 选取内层索引的数据
obj.loc[:,'2']

MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [u'1', u'2', u'3']],
           labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])
1   -0.456671
3    0.284298
dtype: float64
b  1   -0.456671
   3    0.284298
c  1    1.432425
   2   -1.274916
dtype: float64


a    0.019822
c   -1.274916
d   -0.454502
dtype: float64

#### 2. stack与unstack
1. Series.unstack() : 解决堆叠, 将Series的内层索引, 拓宽成columnIdx. 返回DataFrame
2. DataFrame.stack() : 堆叠 把columnIdx拉长成内层索引, 返回Series

In [5]:
df = obj.unstack()
df

Unnamed: 0,1,2,3
a,-0.146974,0.019822,0.005051
b,-0.456671,,0.284298
c,1.432425,-1.274916,
d,,-0.454502,-0.846909


In [6]:
df.stack()

a  1   -0.146974
   2    0.019822
   3    0.005051
b  1   -0.456671
   3    0.284298
c  1    1.432425
   2   -1.274916
d  2   -0.454502
   3   -0.846909
dtype: float64

#### 3. DataFrame的层次化索引
1. DataFrame的每个轴上, 都可以是层次化索引. 即index和columns都能是列表/数组组成的列表
2. 可以对每层的索引起名字:  
  1. DataFrame.index.names=[str] : 对rowIdx的每层起名字
  2. DataFrame.columns.name = [str] , 对columnIdx每层起名字
  
 3. 在多层columnIdx上选取DataFrame  
  DataFrame['outIdx','innerIdx']或者精确到内层索引: DataFrame['outIdx','innerIdx']

In [7]:
# index和column均为多集索引的DataFrame
df = DataFrame(np.arange(12).reshape(4,3),
              index = [list('aaba'),list('1212')],
              columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
a,2,9,10,11


In [8]:
df.index.names=['key1','key2']
df.columns.names=['state','color']

df

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
a,2,9,10,11


In [9]:
df['Ohio','Green']

key1  key2
a     1       0
      2       3
b     1       6
a     2       9
Name: (Ohio, Green), dtype: int64

#### 4. DataFrame的索引层级重排与索引排序
1. 交换DataFrame的内外层行索引顺序 :  
 DataFrame.swaplevel('rowIdx1','rowIdx2',axis=0) : 默认axis=0, 交换行索引
2. DataFrame.Sort_index(level=[n,n]) : DataFrame分别在某个级别上的索引进行排序

In [10]:
print df.index.name
# df.swaplevel('key1', 'key2',axis=0)
df.swaplevel(0,1,axis=0)

None


Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,a,9,10,11


In [11]:
df.swaplevel('key1','key2',axis=0).sort_index(level=[0,1])
# df.swaplevel(0,1,axis=0).sort_index(level=[0,1])

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,a,9,10,11


#### 5. 多级索引下的求和
1. 可以在多集索引的某个level上求和.  
 DataFrame.sum(level='',axis=n)

In [12]:
df.sum(level='color',axis=1)


Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
a,2,20,10


#### 6. DataFrame的列转换成行索引
1. DataFrame.set_index([columnIdx1,columnIdx2]) :  
 将1个/多个columnIdx作为行索引. 并将这两个column删除和原先的索引  
 若要保留这2个column, 指定参数frop=False  
 若要保留原先的index, 指定参数append=True
 
2. DataFrame.reset_index() : 将所有层次的行索引变成columnIdx

In [13]:
df = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'], 'd': [0, 1, 2, 0, 1, 2, 3]},
                 index=list('zyxwvut'))
df

Unnamed: 0,a,b,c,d
z,0,7,one,0
y,1,6,one,1
x,2,5,one,2
w,3,4,two,0
v,4,3,two,1
u,5,2,two,2
t,6,1,two,3


In [14]:
df2 = df.set_index(['c','d'])
print df2.index
df2

MultiIndex(levels=[[u'one', u'two'], [0, 1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 0, 1, 2, 3]],
           names=[u'c', u'd'])


Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [15]:
df3 = df.set_index(['c','d'],append=True)
print df3.index
df3

MultiIndex(levels=[[u't', u'u', u'v', u'w', u'x', u'y', u'z'], [u'one', u'two'], [0, 1, 2, 3]],
           labels=[[6, 5, 4, 3, 2, 1, 0], [0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 0, 1, 2, 3]],
           names=[None, u'c', u'd'])


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,a,b
Unnamed: 0_level_1,c,d,Unnamed: 3_level_1,Unnamed: 4_level_1
z,one,0,0,7
y,one,1,1,6
x,one,2,2,5
w,two,0,3,4
v,two,1,4,3
u,two,2,5,2
t,two,3,6,1


In [16]:
df3.reset_index()

Unnamed: 0,level_0,c,d,a,b
0,z,one,0,0,7
1,y,one,1,1,6
2,x,one,2,2,5
3,w,two,0,3,4
4,v,two,1,4,3
5,u,two,2,5,2
6,t,two,3,6,1


### 4.2 合并数据集

#### 1. DataFrame的join操作 : pandas.merge(df1,df2)
1. merge(df1,df2) :   
 默认使用重合的columnIdx作为键进行连接. 但最好门明确制定join的外键(参数on='columnIdx')  
2. 若join的两个DataFrame的columnIdx名称不一致,   
 则可指定参数left_on='left_columnIdx', right_on='right_columnIdx'  
 merge默认进行inner join, 2个外键对应不上的值会被抛弃
3.  参数how='outer/left/right' :  
 分别指定连接方式为"全连接","左链接", "右连接"  
4. 有时2个DataFrame啊join后, 存在重叠的columnIdx, 为了区分这些columnIdx来自于哪个DataFrame, 使用参数suffixes=('left_suffix','right_suffix')进行区分

In [18]:
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data1': np.arange(7)})
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [20]:
df2 = DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})
df2

Unnamed: 0,data2,key
0,0,a
1,1,b
2,2,d


In [33]:
# columnIdx一致的外键
pd.merge(df1,df2,on='key')

Unnamed: 0,data1,key,data2
0,0,b,1
1,1,b,1
2,6,b,1
3,2,a,0
4,4,a,0
5,5,a,0


In [26]:
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],  'data2': range(3)})

In [24]:
df3

Unnamed: 0,data1,lkey
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [25]:
df4

Unnamed: 0,data2,rkey
0,0,a
1,1,b
2,2,d


In [28]:
# 2个DataFrame, join的column name不同的情况
# merge默认inner join, 发现lkey取值为c,rkey取值为d的2中数据已被抛弃
pd.merge(df3,df4,left_on='lkey',right_on='rkey')

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,1,b
1,1,b,1,b
2,6,b,1,b
3,2,a,0,a
4,4,a,0,a
5,5,a,0,a


In [29]:
# 全连接 (保留对应不上的2个外键)- 会出现NaN值
pd.merge(df3,df4,left_on='lkey',right_on='rkey',how='outer')

Unnamed: 0,data1,lkey,data2,rkey
0,0.0,b,1.0,b
1,1.0,b,1.0,b
2,6.0,b,1.0,b
3,2.0,a,0.0,a
4,4.0,a,0.0,a
5,5.0,a,0.0,a
6,3.0,c,,
7,,,2.0,d


In [36]:
pd.merge(df3,df3,left_on="lkey",right_on="lkey",how="inner",suffixes=("_left","_right"))

Unnamed: 0,data1_left,lkey,data1_right
0,0,b,0
1,0,b,1
2,0,b,6
3,1,b,0
4,1,b,1
5,1,b,6
6,6,b,0
7,6,b,1
8,6,b,6
9,2,a,2


#### 2.  多列clumnId作为merge的外键
1. 多列合并, 只有当这多个columnIdx的取值全都相同时, join才能成功
2. 多列合并, 可理解为 : 多个键形成一系列元祖, 并将其当做单个连接键进行join (事实上不是这么回事)

In [31]:
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],'key2': ['one', 'two', 'one'], 'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],'key2': ['one', 'one', 'one', 'two'],'rval': [4, 5, 6, 7]})
pd.merge(left,right,on=['key1','key2'],how='outer')

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


#### 3. 索引上的join
1. 参数left_index=True/right_index=True指定左表或右表使用index作为外键.  
2. 左表的index可以和右表的columnIdx作连接, 使用(left_index=True, right_on="columnIdx")
3. 若两个表都使用index进行join, 则使用left_index=True,right_index=True   
4. 多级索引的情况, 假如左表的索引为多级的, left_index=True, 则指明右表的columnIdx为多个 right_on=['columnIdx1',''columnIdx2]

In [44]:
# 左表的列和右表的index进行join
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
print left1
print "==============[1]============="
print right1
print "==============[2]============="
pd.merge(left1,right1,left_on="key",right_index=True, how="left")

  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5
   group_val
a        3.5
b        7.0


Unnamed: 0,key,value,group_val
0,a,0,3.5
1,b,1,7.0
2,a,2,3.5
3,a,3,3.5
4,b,4,7.0
5,c,5,


In [48]:
# 多级索引和多个column进行join
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio','Nevada', 'Nevada'],
                                    'key2': [2000, 2001, 2002, 2001, 2002],
                                    'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
                      index=[['Nevada', 'Nevada', 'Ohio', 'Ohio','Ohio', 'Ohio'],
                                      [2001, 2000, 2000, 2000, 2001, 2002]],
                      columns=['event1', 'event2'])
print lefth
print "==============[1]============="
print righth
print "==============[2]============="
pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True,how='outer')

   data    key1  key2
0   0.0    Ohio  2000
1   1.0    Ohio  2001
2   2.0    Ohio  2002
3   3.0  Nevada  2001
4   4.0  Nevada  2002
             event1  event2
Nevada 2001       0       1
       2000       2       3
Ohio   2000       4       5
       2000       6       7
       2001       8       9
       2002      10      11


Unnamed: 0,data,key1,key2,event1,event2
0,0.0,Ohio,2000,4.0,5.0
0,0.0,Ohio,2000,6.0,7.0
1,1.0,Ohio,2001,8.0,9.0
2,2.0,Ohio,2002,10.0,11.0
3,3.0,Nevada,2001,0.0,1.0
4,4.0,Nevada,2002,,
4,,Nevada,2000,2.0,3.0


#### 6. 轴向连接