#### Lecture 3. 索引
索引是一切数据处理的基础，熟练掌握索引的相关知识有助于提升数据处理的能力。

主要讲解单级索引、多级索引和常用索引三种方法。


In [1]:
import pandas as pd
import numpy as np

##### 3.1. 单级索引：

pandas中，每一行数据都会存在一个元素与该行数据相对应，这些元素的序列被称为这个Series或者DataFrame的行索引。如果元素指定的是单个变量，那么对应的就是单级索引。

例如，用学号作为学生序列的索引，就是单级索引。

如果指定的是多个变量的组合，那么对应的就是多级索引，例如用所在班级以及班内学号这两个变量来作为学生序列的索引。


a) Seires上的行索引

可以使用字符串为索引，也可以使用数字为索引。



In [20]:
# 取出单个索引对应的元素，可以使用[item]格式获取。如果一个索引对应单个值，则返回相应的数值；如果有多个值对应，则返回Series。例如
s=pd.Series([1,2,3,4,5,6],index=['a','b','a','a','a','c'])
s['b']

2

In [45]:
s['a']

a    1
a    3
a    4
a    5
dtype: int64

In [46]:
#也可以取出多个索引对应的元素，将索引的名字作为列表放在[]中。例如：
s[['c','b']]

c    6
b    2
dtype: int64

In [47]:
#如果要取出某两个索引之间的元素，并且这两个索引在整个索引中是唯一的，则可以使用切片的方法。例如：

In [48]:
s['c':'b':-2]

#注意：这里的切片会包含两个断点。

c    6
a    4
b    2
dtype: int64

In [49]:
#如果前后端点的值存在重复，即非唯一值，那么需要经过排序后才使用切片。例如，如下代码会出现错误
s['a':'b']

KeyError: "Cannot get left slice bound for non-unique label: 'a'"

In [None]:
#针对这种问题，需要先对齐进行排序，然后在使用切片。例如：
s.sort_index()['a':'b']

a    1
a    3
a    4
a    5
b    2
dtype: int64

In [None]:
#可以以整数为索引：回去出对应索引位置的值。此时的切片方法与Python相同，左闭右开区间。
s[1:-1:2]

b    2
a    4
dtype: int64

b) DataFrame上的索引

包含行索引和列索引两种。

列索引是常见的索引形式，一般通过[...]的形式实现，通过列名从DataFrame中取出相应的列。如果是单列，返回的是Series，如果是多列，返回的是DataFrame。

In [21]:
df=pd.read_csv('data/learn_pandas.csv',usecols=['School','Grade','Name','Gender','Weight','Transfer'])
df['Name'].head() # 或者也可以采用下面这种方式
# df.Name.head()

0      Gaopeng Yang
1    Changqiang You
2           Mei Sun
3      Xiaojuan Sun
4       Gaojuan You
Name: Name, dtype: object

In [4]:
df[['Name','Gender']].head()

Unnamed: 0,Name,Gender
0,Gaopeng Yang,Female
1,Changqiang You,Male
2,Mei Sun,Male
3,Xiaojuan Sun,Female
4,Gaojuan You,Male


也可以对DataFrame按行选取。有两种索引方式：基于元素的loc索引器，基于位置的iloc索引器。
* loc索引器的一般形式是loc[*,*],第一个参数表示行，第二个参数表示列。如果省略第二个参数的话，可以写成loc[*]。 *号的位置可以是单个元素、元素列表、切片和布尔值

In [None]:
# 取出单行
df.loc[4]

School      Fudan University
Grade              Sophomore
Name             Gaojuan You
Gender                  Male
Weight                  74.0
Transfer                   N
Name: 4, dtype: object

In [None]:
# 同时选择单行和单列
df.loc[3,'Name']

'Xiaojuan Sun'

In [None]:
#同时取出多行
df.loc[[1,3,5]]

Unnamed: 0,School,Grade,Name,Gender,Weight,Transfer
1,Peking University,Freshman,Changqiang You,Male,70.0,N
3,Fudan University,Sophomore,Xiaojuan Sun,Female,41.0,N
5,Tsinghua University,Freshman,Xiaoli Qian,Female,51.0,N


In [None]:
#同时取出多行和多列
df.loc[[1,3,5],['School','Name','Gender']]

Unnamed: 0,School,Name,Gender
1,Peking University,Changqiang You,Male
3,Fudan University,Xiaojuan Sun,Female
5,Tsinghua University,Xiaoli Qian,Female


In [None]:
#通过切片方式获得多行多列
df.loc[1:10,'School':'Name']

Unnamed: 0,School,Grade,Name
1,Peking University,Freshman,Changqiang You
2,Shanghai Jiao Tong University,Senior,Mei Sun
3,Fudan University,Sophomore,Xiaojuan Sun
4,Fudan University,Sophomore,Gaojuan You
5,Tsinghua University,Freshman,Xiaoli Qian
6,Shanghai Jiao Tong University,Freshman,Qiang Chu
7,Tsinghua University,Junior,Gaoqiang Qian
8,Tsinghua University,Freshman,Changli Zhang
9,Peking University,Junior,Juan Xu
10,Shanghai Jiao Tong University,Freshman,Xiaopeng Zhou


In [23]:
#根据条件筛选行
#筛选体重大于70的学生
df.Weight>70
df.loc[df.Weight>70].head()

Unnamed: 0,School,Grade,Name,Gender,Weight,Transfer
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,89.0,N
4,Fudan University,Sophomore,Gaojuan You,Male,74.0,N
10,Shanghai Jiao Tong University,Freshman,Xiaopeng Zhou,Male,74.0,N
18,Tsinghua University,Senior,Xiaofeng Sun,Male,71.0,N
23,Shanghai Jiao Tong University,Senior,Qiang Zheng,Male,87.0,N


In [25]:
#也可以使用isin,以及|，&等组合的复杂条件。
#例1：筛选出School是Fudan University和 Shanghai Jiao Tong University的学生
df.School.isin(['Fudan University','Shanghai Jiao Tong University'])
df.loc[df.School.isin(['Fudan University','Shanghai Jiao Tong University'])].head()

Unnamed: 0,School,Grade,Name,Gender,Weight,Transfer
0,Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,46.0,N
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,89.0,N
3,Fudan University,Sophomore,Xiaojuan Sun,Female,41.0,N
4,Fudan University,Sophomore,Gaojuan You,Male,74.0,N
6,Shanghai Jiao Tong University,Freshman,Qiang Chu,Female,52.0,N


In [11]:
#例2：筛选出体重超过70kg的大四学生，
condition=(df.Grade=='Senior') & (df.Weight>70)
# condition
df.loc[condition].head()

Unnamed: 0,School,Grade,Name,Gender,Weight,Transfer
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,89.0,N
18,Tsinghua University,Senior,Xiaofeng Sun,Male,71.0,N
23,Shanghai Jiao Tong University,Senior,Qiang Zheng,Male,87.0,N
66,Fudan University,Senior,Chengpeng Zhou,Male,81.0,N
127,Peking University,Senior,Changquan Han,Male,77.0,N


* iloc索引方式，与loc类似，只不过它是针对位置进行筛选。*号的位置可以是整数、整数列表、整数切片和布尔查询。

In [26]:
#例如，获得第2行第2列的元素
df
df.iloc[1,1]

'Freshman'

In [None]:
#获得多行多列
df.iloc[[1,3,5],[2,3,4]]

Unnamed: 0,Name,Gender,Weight
1,Changqiang You,Male,70.0
3,Xiaojuan Sun,Female,41.0
5,Xiaoli Qian,Female,51.0


In [None]:
#使用切片获得多行多列
df.iloc[1:4,2:5]

Unnamed: 0,Name,Gender,Weight
1,Changqiang You,Male,70.0
2,Mei Sun,Male,89.0
3,Xiaojuan Sun,Female,41.0


In [31]:
#传入布尔值
s=df.Weight>80
# s
# s.values
df.iloc[s.values]  #注意，不能传入series，而必须要传入序列的values.

Unnamed: 0,School,Grade,Name,Gender,Weight,Transfer
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,89.0,N
23,Shanghai Jiao Tong University,Senior,Qiang Zheng,Male,87.0,N
38,Peking University,Freshman,Qiang Han,Male,87.0,N
66,Fudan University,Senior,Chengpeng Zhou,Male,81.0,N
71,Shanghai Jiao Tong University,Sophomore,Feng Han,Male,82.0,N
99,Peking University,Freshman,Changpeng Zhao,Male,83.0,N
117,Shanghai Jiao Tong University,Freshman,Chunli Zhao,Male,83.0,N
134,Shanghai Jiao Tong University,Senior,Gaoli Zhao,Male,83.0,N


##### 3.2 多级索引

    多级索引是单级索引的扩展，包括多级索引的结构、索引器及构造

##### 1. 多级索引的构造

常用的方法有from_tuples()、from_arrays()和from_product()三种方法，他们都是MultiIndex对象的方法。

* from_tuples()根据传入由元组组成的列表进行构造

In [None]:
my_tuple = [('a','cat'),('a','dog'),('b','cat'),('b','dog')]
pd.MultiIndex.from_tuples(my_tuple, names=['First','Second'])

MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])


* from_arrays()根据传入列表中对应层的列表进行构造


In [None]:
my_array = [list('aabb'), ['cat', 'dog']*2]
pd.MultiIndex.from_arrays(my_array, names=['First','Second'])

MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

* from_product()根据指定的多个列表的交叉组合进行构造

In [18]:
my_list1 = ['a','b']
my_list2 = ['cat','dog']
pd.MultiIndex.from_product([my_list1, my_list2], names=['First','Second'])

MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

##### 2. 多级索引表结构

In [32]:
# 构造多级索引的DataFrame
np.random.seed(0)
multi_index = pd.MultiIndex.from_product([list('ABCD'), df.Gender.unique()], names=('School', 'Gender'))
multi_column = pd.MultiIndex.from_product([['Height', 'Weight'], df.Grade.unique()], names=('Indicator', 'Grade'))
df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5 + 163).tolist(), (np.random.randn(8,4)*5 + 65).tolist()],
                        index = multi_index, columns = multi_column).round(1)
df_multi

Unnamed: 0_level_0,Indicator,Height,Height,Height,Height,Weight,Weight,Weight,Weight
Unnamed: 0_level_1,Grade,Freshman,Senior,Sophomore,Junior,Freshman,Senior,Sophomore,Junior
School,Gender,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
A,Female,171.8,165.0,167.9,174.2,60.6,55.1,63.3,65.8
A,Male,172.3,158.1,167.8,162.2,71.2,71.0,63.1,63.5
B,Female,162.5,165.1,163.7,170.3,59.8,57.9,56.5,74.8
B,Male,166.8,163.6,165.2,164.7,62.5,62.8,58.7,68.9
C,Female,170.5,162.0,164.6,158.7,56.9,63.9,60.5,66.9
C,Male,150.2,166.3,167.3,159.3,62.4,59.1,64.9,67.1
D,Female,174.3,155.7,163.2,162.1,65.3,66.5,61.8,63.2
D,Male,170.7,170.3,163.8,164.9,61.6,63.2,60.9,56.4


下图通过颜色区分，标记了`DataFrame`的结构。与单层索引的表一样，具备元素值、行索引和列索引三个部分。其中，这里的行索引和列索引都是`MultiIndex`类型，只不过**索引中的一个元素是元组**而不是单层索引中的标量。例如，行索引的第四个元素为`("B", "Male")`，列索引的第二个元素为`("Height", "Senior")`，这里需要注意，外层连续出现相同的值时，第一次之后出现的会被隐藏显示，使结果的可读性增强。

<img src="img/multi_index.png" width="50%">

与单层索引类似，`MultiIndex`也具有名字属性，图中的`School`和`Gender`分别对应了表的第一层和第二层行索引的名字，`Indicator`和`Grade`分别对应了第一层和第二层列索引的名字。

索引的名字和值属性分别可以通过`names`和`values`获得：


In [36]:
df_multi.index.names

FrozenList(['School', 'Gender'])

In [67]:
df_multi.index.values

array([('A', 'Female'), ('A', 'Male'), ('B', 'Female'), ('B', 'Male'),
       ('C', 'Female'), ('C', 'Male'), ('D', 'Female'), ('D', 'Male')],
      dtype=object)

In [37]:
df_multi.columns

MultiIndex([('Height',  'Freshman'),
            ('Height',    'Senior'),
            ('Height', 'Sophomore'),
            ('Height',    'Junior'),
            ('Weight',  'Freshman'),
            ('Weight',    'Senior'),
            ('Weight', 'Sophomore'),
            ('Weight',    'Junior')],
           names=['Indicator', 'Grade'])

In [68]:
df_multi.columns.values

array([('Height', 'Freshman'), ('Height', 'Senior'),
       ('Height', 'Sophomore'), ('Height', 'Junior'),
       ('Weight', 'Freshman'), ('Weight', 'Senior'),
       ('Weight', 'Sophomore'), ('Weight', 'Junior')], dtype=object)

In [38]:
# 如果想要得到某一层的索引，需要使用get_level_values()

df_multi.index.get_level_values(0)

Index(['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'], dtype='object', name='School')

In [40]:
df_multi.columns.get_level_values(1)

Index(['Freshman', 'Senior', 'Sophomore', 'Junior', 'Freshman', 'Senior',
       'Sophomore', 'Junior'],
      dtype='object', name='Grade')

##### 3. loc索引

利用前面的learn_pandas数据集，将学校和年级设为索引，可以使用set_index函数。

In [42]:
df_multi=df.set_index(['School','Grade'])
df_multi.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,46.0,N
Peking University,Freshman,Changqiang You,Male,70.0,N
Shanghai Jiao Tong University,Senior,Mei Sun,Male,89.0,N
Fudan University,Sophomore,Xiaojuan Sun,Female,41.0,N
Fudan University,Sophomore,Gaojuan You,Male,74.0,N


多级索引中的单个元素是以元组为单位，loc使用方法与单索引中使用方法相同，只需要把标量的位置替换成对应的元组即可。

In [43]:
df_multi.loc[('Peking University','Senior')]

  df_multi.loc[('Peking University','Senior')]


Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Peking University,Senior,Changli Lv,Female,41.0,N
Peking University,Senior,Feng Zheng,Female,49.0,N
Peking University,Senior,Feng Zhao,Male,66.0,N
Peking University,Senior,Changquan Han,Male,77.0,N
Peking University,Senior,Mei Feng,Female,51.0,N
Peking University,Senior,Chunpeng Qian,Female,,N
Peking University,Senior,Juan You,Male,69.0,
Peking University,Senior,Yanmei Qian,Female,49.0,


为了避免上面的警告信息，可以先对索引进行排序，使用sort_index()函数

In [44]:
df_multi_s=df_multi.sort_index()
df_multi_s.loc[('Peking University','Senior')]

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Peking University,Senior,Changli Lv,Female,41.0,N
Peking University,Senior,Feng Zheng,Female,49.0,N
Peking University,Senior,Feng Zhao,Male,66.0,N
Peking University,Senior,Changquan Han,Male,77.0,N
Peking University,Senior,Mei Feng,Female,51.0,N
Peking University,Senior,Chunpeng Qian,Female,,N
Peking University,Senior,Juan You,Male,69.0,
Peking University,Senior,Yanmei Qian,Female,49.0,


In [45]:
# 也可以取出多个索引下的行,多个元组放到列表中，元组之间用逗号分隔

df_multi_s.loc[[('Peking University','Senior'),('Shanghai Jiao Tong University','Freshman')]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Peking University,Senior,Changli Lv,Female,41.0,N
Peking University,Senior,Feng Zheng,Female,49.0,N
Peking University,Senior,Feng Zhao,Male,66.0,N
Peking University,Senior,Changquan Han,Male,77.0,N
Peking University,Senior,Mei Feng,Female,51.0,N
Peking University,Senior,Chunpeng Qian,Female,,N
Peking University,Senior,Juan You,Male,69.0,
Peking University,Senior,Yanmei Qian,Female,49.0,
Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,46.0,N
Shanghai Jiao Tong University,Freshman,Qiang Chu,Female,52.0,N


In [79]:
# 也可以指定条件表达式
df_multi_s.loc[df_multi_s.Weight>70].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fudan University,Freshman,Feng Wang,Male,74.0,N
Fudan University,Junior,Chunqiang Chu,Male,72.0,N
Fudan University,Junior,Changfeng Lv,Male,76.0,N
Fudan University,Senior,Chengpeng Zhou,Male,81.0,N
Fudan University,Senior,Chengpeng Qian,Male,73.0,Y


In [82]:
#也可以使用切片的方法，但是必须先对索引排序后才可以切片，否则会出错。
df_multi_s.loc[('Peking University','Senior'):].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Peking University,Senior,Changli Lv,Female,41.0,N
Peking University,Senior,Feng Zheng,Female,49.0,N
Peking University,Senior,Feng Zhao,Male,66.0,N
Peking University,Senior,Changquan Han,Male,77.0,N
Peking University,Senior,Mei Feng,Female,51.0,N


#### 3.3 索引操作方法

前面讲的是如何领索引对数据进行操作，也可以对索引本身进行操作

##### 1. 索引层的交换和删除。

In [46]:
# 构造三级索引数据

np.random.seed(0)
L1,L2,L3 = ['A','B'],['a','b'],['alpha','beta']
mul_index1 = pd.MultiIndex.from_product([L1,L2,L3], names=('Upper', 'Lower','Extra'))
L4,L5,L6 = ['C','D'],['c','d'],['cat','dog']
mul_index2 = pd.MultiIndex.from_product([L4,L5,L6], names=('Big', 'Small', 'Other'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(8,8)), index=mul_index1,  columns=mul_index2)
df_ex

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,cat,dog,cat,dog,cat,dog,cat,dog
Upper,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9
B,a,beta,-9,-5,-4,-3,-1,8,6,-5
B,b,alpha,0,1,-8,-8,-2,0,-6,-3
B,b,beta,2,5,9,-9,5,-6,3,1


索引层的交换可由swaplevel()和reorder_level()完成。

swaplevel()只能交换2层

reorder_level()可以交换任意层。

在使用时都需要指定交换的“轴”，即交换行索引或列索引。


In [84]:
df_ex.swaplevel(0,2,axis=1) #把列索引的第一层和第三层交换。

Unnamed: 0_level_0,Unnamed: 1_level_0,Other,cat,dog,cat,dog,cat,dog,cat,dog
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Big,C,C,C,C,D,D,D,D
Upper,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9
B,a,beta,-9,-5,-4,-3,-1,8,6,-5
B,b,alpha,0,1,-8,-8,-2,0,-6,-3
B,b,beta,2,5,9,-9,5,-6,3,1


In [85]:
df_ex.swaplevel(0,2,axis=0) #把行索引的第一层和第三层交换。

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,cat,dog,cat,dog,cat,dog,cat,dog
Extra,Lower,Upper,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
alpha,a,A,3,6,-9,-6,-6,-2,0,9
beta,a,A,-5,-3,3,-8,-3,-2,5,8
alpha,b,A,-4,4,-1,0,7,-4,6,6
beta,b,A,-9,9,-6,8,5,-2,-9,-8
alpha,a,B,0,-9,1,-6,2,9,-7,-9
beta,a,B,-9,-5,-4,-3,-1,8,6,-5
alpha,b,B,0,1,-8,-8,-2,0,-6,-3
beta,b,B,2,5,9,-9,5,-6,3,1


In [47]:
df_ex.reorder_levels([2,0,1],axis=0) #把行索引的第2、0、1层交换。

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,cat,dog,cat,dog,cat,dog,cat,dog
Extra,Upper,Lower,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
alpha,A,a,3,6,-9,-6,-6,-2,0,9
beta,A,a,-5,-3,3,-8,-3,-2,5,8
alpha,A,b,-4,4,-1,0,7,-4,6,6
beta,A,b,-9,9,-6,8,5,-2,-9,-8
alpha,B,a,0,-9,1,-6,2,9,-7,-9
beta,B,a,-9,-5,-4,-3,-1,8,6,-5
alpha,B,b,0,1,-8,-8,-2,0,-6,-3
beta,B,b,2,5,9,-9,5,-6,3,1


若想删除某一层的索引，可以使用droplevel()方法

In [88]:
df_ex.droplevel(1,axis=1).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Other,cat,dog,cat,dog,cat,dog,cat,dog
Upper,Lower,Extra,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9


In [89]:
df_ex.droplevel([0,1],axis=0).head()

Big,C,C,C,C,D,D,D,D
Small,c,c,d,d,c,c,d,d
Other,cat,dog,cat,dog,cat,dog,cat,dog
Extra,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
alpha,3,6,-9,-6,-6,-2,0,9
beta,-5,-3,3,-8,-3,-2,5,8
alpha,-4,4,-1,0,7,-4,6,6
beta,-9,9,-6,8,5,-2,-9,-8
alpha,0,-9,1,-6,2,9,-7,-9


##### 2. 索引属性的修改

通过rename_axis()可以对索引层的名字进行修改。常用的修改方式是传入字典的映射。

In [90]:
df_ex.rename_axis(index={'Upper':'Changed_row'},columns={'Other':'Changed_col'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Changed_col,cat,dog,cat,dog,cat,dog,cat,dog
Changed_row,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9
B,a,beta,-9,-5,-4,-3,-1,8,6,-5
B,b,alpha,0,1,-8,-8,-2,0,-6,-3
B,b,beta,2,5,9,-9,5,-6,3,1


通过rename()可以对索引元素进行修改。如果是多级索引，需要指定修改的层号level

In [48]:
df_ex.rename(columns={'cat':'not_cat'},level=2).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,not_cat,dog,not_cat,dog,not_cat,dog,not_cat,dog
Upper,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9


In [92]:
#传入的参数也可以是函数，其输入值就是索引元素
df_ex.rename(index=lambda x:str.upper(x),level=2).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,cat,dog,cat,dog,cat,dog,cat,dog
Upper,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,ALPHA,3,6,-9,-6,-6,-2,0,9
A,a,BETA,-5,-3,3,-8,-3,-2,5,8
A,b,ALPHA,-4,4,-1,0,7,-4,6,6
A,b,BETA,-9,9,-6,8,5,-2,-9,-8
B,a,ALPHA,0,-9,1,-6,2,9,-7,-9


##### 3. 索引的设置与重置

In [93]:
df_new = pd.DataFrame({'A':list('aacd'), 'B':list('PQRT'), 'C':[1,2,3,4]})
df_new

Unnamed: 0,A,B,C
0,a,P,1
1,a,Q,2
2,c,R,3
3,d,T,4


In [94]:
# 把A设置为索引
df_new.set_index('A')

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,P,1
a,Q,2
c,R,3
d,T,4


In [95]:
# 把A设置为索引，同时保留原来的索引
df_new.set_index('A',append=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,B,C
Unnamed: 0_level_1,A,Unnamed: 2_level_1,Unnamed: 3_level_1
0,a,P,1
1,a,Q,2
2,c,R,3
3,d,T,4


In [96]:
#指定多列为索引
df_new.set_index(['A','B'])

Unnamed: 0_level_0,Unnamed: 1_level_0,C
A,B,Unnamed: 2_level_1
a,P,1
a,Q,2
c,R,3
d,T,4


reset_index()是set_index的逆函数，将索引转化为列。

In [98]:
df_new=df_new.set_index(['A','B'])
df_new

Unnamed: 0_level_0,Unnamed: 1_level_0,C
A,B,Unnamed: 2_level_1
a,P,1
a,Q,2
c,R,3
d,T,4


In [99]:
df_new.reset_index('B')

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,P,1
a,Q,2
c,R,3
d,T,4


In [100]:
#如果要把索引丢弃掉，则可以指定参数drop=True
df_new.reset_index('B',drop=True)

Unnamed: 0_level_0,C
A,Unnamed: 1_level_1
a,1
a,2
c,3
d,4


##### 课后练习

1. 对公司员工数据的索引操作

现有一个公司员工数据集：data/Ex/company.csv,实现以下功能

(1) 使用loc索引器选出年龄不超过40岁，且工作部门为Dairy或Bakey的男性员工

(2) 选出员工号为奇数的所在行的第一、三和导数第二列

(3) 按照以下步骤进行索引操作：
* 把后三列设为索引后交换内外两层
* 恢复中间层索引
* 修改外层索引名为Gender
* 用下划线合并两层行索引
* 把行索引拆分为原状态
* 修改索引名为原表名称
* 恢复默认索引并将列保持为原表的相对位置。

In [1]:
import pandas as pd
df=pd.read_csv('data/Ex/Company.csv')
df.head()

Unnamed: 0,EmployeeID,birthdate_key,age,city_name,department,job_title,gender
0,1318,1/3/1954,61,Vancouver,Executive,CEO,M
1,1319,1/3/1957,58,Vancouver,Executive,VP Stores,F
2,1320,1/2/1955,60,Vancouver,Executive,Legal Counsel,F
3,1321,1/2/1959,56,Vancouver,Executive,VP Human Resources,M
4,1322,1/9/1958,57,Vancouver,Executive,VP Finance,M


In [5]:
# 1. 使用loc索引器选出年龄不超过40岁，且工作部门为Dairy或Bakey的男性员工
df.loc[(df.age<=40) & (df.department.isin(['Dairy','Bakey']))]

Unnamed: 0,EmployeeID,birthdate_key,age,city_name,department,job_title,gender
3608,5788,1/12/1975,40,Chilliwack,Dairy,Dairy Person,F
3609,5789,1/13/1975,40,Chilliwack,Dairy,Dairy Person,F
3610,5790,1/13/1975,40,Kelowna,Dairy,Dairy Person,F
3611,5791,1/14/1975,40,Kelowna,Dairy,Dairy Person,M
3615,5795,1/30/1975,40,Nanaimo,Dairy,Dairy Person,M
...,...,...,...,...,...,...,...
6132,8331,12/16/1994,21,Vancouver,Dairy,Dairy Person,F
6136,8335,12/28/1994,21,Vancouver,Dairy,Dairy Person,F
6137,8336,12/31/1994,21,Vancouver,Dairy,Dairy Person,M
6270,6312,5/14/1979,36,Grand Forks,Dairy,Dairy Person,M


In [9]:
#选出员工号为奇数的所在行的第一、三和导数第二列
df.loc[df.EmployeeID%2!=0].iloc[:,[1,3,-2]]

Unnamed: 0,birthdate_key,city_name,job_title
1,1/3/1957,Vancouver,VP Stores
3,1/2/1959,Vancouver,VP Human Resources
5,1/9/1962,Vancouver,"Exec Assistant, VP Stores"
6,1/13/1964,Vancouver,"Exec Assistant, Legal Counsel"
8,1/23/1967,Terrace,Store Manager
...,...,...,...
6276,9/28/1989,Trail,Cashier
6277,4/7/1990,Nanaimo,Cashier
6278,10/18/1990,Abbotsford,Dairy Person
6280,9/26/1993,Prince George,Cashier


In [20]:
# 3) 按照以下步骤进行索引操作：
#把后三列设为索引后交换内外两层

df_tmp=df.set_index(['department','job_title','gender']).reorder_levels([2,1,0],axis=0)
df_tmp.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,EmployeeID,birthdate_key,age,city_name
gender,job_title,department,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M,CEO,Executive,1318,1/3/1954,61,Vancouver
F,VP Stores,Executive,1319,1/3/1957,58,Vancouver
F,Legal Counsel,Executive,1320,1/2/1955,60,Vancouver
M,VP Human Resources,Executive,1321,1/2/1959,56,Vancouver
M,VP Finance,Executive,1322,1/9/1958,57,Vancouver


In [21]:
# * 恢复中间层索引
df_tmp.reset_index('job_title',inplace=True)

In [27]:
# * 修改外层索引名为Gender
df_tmp.rename_axis(index={'gender':'Gender'})

Unnamed: 0_level_0,Unnamed: 1_level_0,job_title,EmployeeID,birthdate_key,age,city_name
Gender,department,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M,Executive,CEO,1318,1/3/1954,61,Vancouver
F,Executive,VP Stores,1319,1/3/1957,58,Vancouver
F,Executive,Legal Counsel,1320,1/2/1955,60,Vancouver
M,Executive,VP Human Resources,1321,1/2/1959,56,Vancouver
M,Executive,VP Finance,1322,1/9/1958,57,Vancouver
...,...,...,...,...,...,...
F,Customer Service,Cashier,8036,8/9/1992,23,New Westminister
M,Customer Service,Cashier,8181,9/26/1993,22,Prince George
M,Customer Service,Cashier,8223,2/11/1994,21,Trail
F,Customer Service,Cashier,8226,2/16/1994,21,Victoria


In [30]:
# * 用下划线合并两层行索引

new_index=df_tmp.index.map(lambda x:(x[0]+'_'+x[1]))
df_tmp.index=new_index
df_tmp.head()

Unnamed: 0,job_title,EmployeeID,birthdate_key,age,city_name
M_Executive,CEO,1318,1/3/1954,61,Vancouver
F_Executive,VP Stores,1319,1/3/1957,58,Vancouver
F_Executive,Legal Counsel,1320,1/2/1955,60,Vancouver
M_Executive,VP Human Resources,1321,1/2/1959,56,Vancouver
M_Executive,VP Finance,1322,1/9/1958,57,Vancouver


In [33]:
# * 把行索引拆分为原状态

new_index=df_tmp.index.map(lambda x:tuple(x.split('_')))
df_tmp.index=new_index
df_tmp.head()

Unnamed: 0,Unnamed: 1,job_title,EmployeeID,birthdate_key,age,city_name
M,Executive,CEO,1318,1/3/1954,61,Vancouver
F,Executive,VP Stores,1319,1/3/1957,58,Vancouver
F,Executive,Legal Counsel,1320,1/2/1955,60,Vancouver
M,Executive,VP Human Resources,1321,1/2/1959,56,Vancouver
M,Executive,VP Finance,1322,1/9/1958,57,Vancouver


In [52]:
# * 修改索引名为原表名称
df_tmp=df_tmp.rename_axis(index=['gender','department'])
df_tmp.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,job_title,EmployeeID,birthdate_key,age,city_name
gender,department,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M,Executive,CEO,1318,1/3/1954,61,Vancouver
F,Executive,VP Stores,1319,1/3/1957,58,Vancouver
F,Executive,Legal Counsel,1320,1/2/1955,60,Vancouver
M,Executive,VP Human Resources,1321,1/2/1959,56,Vancouver
M,Executive,VP Finance,1322,1/9/1958,57,Vancouver


In [54]:
# * 恢复默认索引并将列保持为原表的相对位置。
df_tmp.reset_index().reindex(df.columns,axis=1)

Unnamed: 0,EmployeeID,birthdate_key,age,city_name,department,job_title,gender
0,1318,1/3/1954,61,Vancouver,Executive,CEO,M
1,1319,1/3/1957,58,Vancouver,Executive,VP Stores,F
2,1320,1/2/1955,60,Vancouver,Executive,Legal Counsel,F
3,1321,1/2/1959,56,Vancouver,Executive,VP Human Resources,M
4,1322,1/9/1958,57,Vancouver,Executive,VP Finance,M
...,...,...,...,...,...,...,...
6279,8036,8/9/1992,23,New Westminister,Customer Service,Cashier,F
6280,8181,9/26/1993,22,Prince George,Customer Service,Cashier,M
6281,8223,2/11/1994,21,Trail,Customer Service,Cashier,M
6282,8226,2/16/1994,21,Victoria,Customer Service,Cashier,F
