<center><h1>Chapter 3 Index</h1></center>

In [1]:
import numpy as np
import pandas as pd

## 1. Indexer
### 1. Table column index
Column index is the most common index form, generally implemented by `[]`. `[column name]` can be used to extract the corresponding column from `DataFrame`, and the return value is `Series`. For example, extract the name column from the table:

In [2]:
df = pd.read_csv('../data/learn_pandas.csv', usecols = ['School', 'Grade', 'Name', 'Gender', 'Weight', 'Transfer'])
df['Name'].head()

0      Gaopeng Yang
1    Changqiang You
2           Mei Sun
3      Xiaojuan Sun
4       Gaojuan You
Name: Name, dtype: object

If you want to extract multiple columns, you can use `[list of column names]`, and the return value is a `DataFrame`. For example, extract the gender and name columns from the table:

In [3]:
df[['Gender', 'Name']].head()

Unnamed: 0,Gender,Name
0,Female,Gaopeng Yang
1,Male,Changqiang You
2,Male,Mei Sun
3,Female,Xiaojuan Sun
4,Male,Gaojuan You


In addition, if you want to extract a single column and the column name does not contain spaces, you can use `.column name` to extract it, which is equivalent to `[column name]`:

In [4]:
df.Name.head()

0      Gaopeng Yang
1    Changqiang You
2           Mei Sun
3      Xiaojuan Sun
4       Gaojuan You
Name: Name, dtype: object

### 2. Row index of a sequence

[a] `Series` with string index

If you want to extract the corresponding element of a single index, you can use `[item]`. If the `Series` has only a single corresponding value, the scalar value is returned. If there are multiple corresponding values, a `Series` is returned:

In [5]:
s = pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'a', 'a', 'a', 'c'])
s['a']

a    1
a    3
a    4
a    5
dtype: int64

In [6]:
s['b']

2

If you want to retrieve the corresponding elements of multiple indices, you can use `[list of items]`:

In [7]:
s[['c', 'b']]

c    6
b    2
dtype: int64

If you want to take out the elements between two indexes, and these two indexes are unique in the entire index, you can use a slice, and note that the slice here will contain two endpoints:

In [8]:
s['c': 'b': -2]

c    6
a    4
b    2
dtype: int64

If the values ​​of the front and back endpoints are repeated, they need to be sorted before using the slice:

In [9]:
try:
    s['a': 'b']
except Exception as e:
    Err_Msg = e
Err_Msg

KeyError("Cannot get left slice bound for non-unique label: 'a'")

In [10]:
s.sort_index()['a': 'b']

a    1
a    3
a    4
a    5
b    2
dtype: int64

[b] `Series` with integer index

When using the data reading function, if the corresponding column is not specified as the index, an integer index starting from 0 will be generated as the default index. Of course, any set of integers that meet the length requirements can be used as an index.

Like strings, if `[int]` or `[int_list]` is used, the value of the corresponding index **element** can be retrieved:

In [11]:
s = pd.Series(['a', 'b', 'c', 'd', 'e', 'f'], index=[1, 3, 1, 2, 5, 4])
s[1]

1    a
1    c
dtype: object

In [12]:
s[[2,3]]

2    d
3    b
dtype: object

If you use integer slicing, the value at the corresponding index **position** will be taken out. Note that the integer slicing here does not include the right endpoint, just like the slicing in `Python`:

In [13]:
s[1:-1:2]

3    b
2    d
dtype: object

#### 【WARNING】Instructions on index types

If you don't want to get into trouble, please don't use pure floating point and any mixed types (mixture of strings, integers, floating point types, etc.) as indexes, otherwise you may get an error or return unexpected results in the specific operation, and there is no motivation to do so in actual data analysis.

#### 【END】

### 3. loc indexer

We talked about selecting columns of `DataFrame` earlier, and now we will discuss its row selection. For tables, there are two indexers, one is the **element**-based `loc` indexer, and the other is the **position**-based `iloc` indexer.

The general form of the `loc` indexer is `loc[*, *]`, where the first `*` represents the selection of rows, and the second `*` represents the selection of columns. If the second position is omitted and written as `loc[*]`, this `*` refers to the selection of rows. Among them, there are five types of legal objects in the `*` position, namely: single element, element list, element slice, Boolean list and function, which will be explained in turn below.

To demonstrate the corresponding operation, we first use the `set_index` method to set the `Name` column as an index. Other uses of this function will be introduced in the chapter on multi-level indexing.

In [14]:
df_demo = df.set_index('Name')
df_demo.head()

Unnamed: 0_level_0,School,Grade,Gender,Weight,Transfer
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Gaopeng Yang,Shanghai Jiao Tong University,Freshman,Female,46.0,N
Changqiang You,Peking University,Freshman,Male,70.0,N
Mei Sun,Shanghai Jiao Tong University,Senior,Male,89.0,N
Xiaojuan Sun,Fudan University,Sophomore,Female,41.0,N
Gaojuan You,Fudan University,Sophomore,Male,74.0,N


【a】`*` is a single element

At this time, directly take out the corresponding row or column. If the element is repeated in the index, the result is `DataFrame`, otherwise it is `Series`:

In [15]:
df_demo.loc['Qiang Sun'] # 多个人叫此名字

Unnamed: 0_level_0,School,Grade,Gender,Weight,Transfer
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Qiang Sun,Tsinghua University,Junior,Female,53.0,N
Qiang Sun,Tsinghua University,Sophomore,Female,40.0,N
Qiang Sun,Shanghai Jiao Tong University,Junior,Female,,N


In [16]:
df_demo.loc['Quan Zhao'] # 名字唯一

School      Shanghai Jiao Tong University
Grade                              Junior
Gender                             Female
Weight                               53.0
Transfer                                N
Name: Quan Zhao, dtype: object

It is also possible to select rows and columns simultaneously:

In [17]:
df_demo.loc['Qiang Sun', 'School'] # 返回Series

Name
Qiang Sun              Tsinghua University
Qiang Sun              Tsinghua University
Qiang Sun    Shanghai Jiao Tong University
Name: School, dtype: object

In [18]:
df_demo.loc['Quan Zhao', 'School'] # 返回单个元素

'Shanghai Jiao Tong University'

【b】`*` is an element list

At this time, take out the rows or columns corresponding to all element values ​​in the list:

In [19]:
df_demo.loc[['Qiang Sun','Quan Zhao'], ['School','Gender']]

Unnamed: 0_level_0,School,Gender
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Qiang Sun,Tsinghua University,Female
Qiang Sun,Tsinghua University,Female
Qiang Sun,Shanghai Jiao Tong University,Female
Quan Zhao,Shanghai Jiao Tong University,Female


【c】`*` is a slice

When using string indexes in the previous `Series`, it was mentioned that if the start and end characters are unique values, then a slice can be used and both endpoints are included. If they are not unique, an error is reported:

In [20]:
df_demo.loc['Gaojuan You':'Gaoqiang Qian', 'School':'Gender']

Unnamed: 0_level_0,School,Grade,Gender
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Gaojuan You,Fudan University,Sophomore,Male
Xiaoli Qian,Tsinghua University,Freshman,Female
Qiang Chu,Shanghai Jiao Tong University,Freshman,Female
Gaoqiang Qian,Tsinghua University,Junior,Female


Note that if `DataFrame` uses integer indexes, the requirements for integer slicing are the same as those for string indexes above, that is, **element** slicing, including the endpoints, and no duplicate values ​​are allowed at the start and end points.

In [21]:
df_loc_slice_demo = df_demo.copy()
df_loc_slice_demo.index = range(df_demo.shape[0],0,-1)
df_loc_slice_demo.loc[5:3]

Unnamed: 0,School,Grade,Gender,Weight,Transfer
5,Fudan University,Junior,Female,46.0,N
4,Tsinghua University,Senior,Female,50.0,N
3,Shanghai Jiao Tong University,Senior,Female,45.0,N


In [22]:
df_loc_slice_demo.loc[3:5] # 没有返回，说明不是整数位置切片

Unnamed: 0,School,Grade,Gender,Weight,Transfer


[d] `*` is a Boolean list

In actual data processing, it is very common to filter rows based on conditions. Here, the Boolean list passed to `loc` has the same length as `DataFrame`, and the rows corresponding to the positions where the list is `True` will be selected, and `False` will be removed.

For example, select students weighing more than 70kg:

In [23]:
df_demo.loc[df_demo.Weight>70].head()

Unnamed: 0_level_0,School,Grade,Gender,Weight,Transfer
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mei Sun,Shanghai Jiao Tong University,Senior,Male,89.0,N
Gaojuan You,Fudan University,Sophomore,Male,74.0,N
Xiaopeng Zhou,Shanghai Jiao Tong University,Freshman,Male,74.0,N
Xiaofeng Sun,Tsinghua University,Senior,Male,71.0,N
Qiang Zheng,Shanghai Jiao Tong University,Senior,Male,87.0,N


The list of elements passed in mentioned above can also be written equivalently using the Boolean list returned by the `isin` method, for example, to select all the information of freshmen and seniors:

In [24]:
df_demo.loc[df_demo.Grade.isin(['Freshman', 'Senior'])].head()

Unnamed: 0_level_0,School,Grade,Gender,Weight,Transfer
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Gaopeng Yang,Shanghai Jiao Tong University,Freshman,Female,46.0,N
Changqiang You,Peking University,Freshman,Male,70.0,N
Mei Sun,Shanghai Jiao Tong University,Senior,Male,89.0,N
Xiaoli Qian,Tsinghua University,Freshman,Female,51.0,N
Qiang Chu,Shanghai Jiao Tong University,Freshman,Female,52.0,N


For compound conditions, you can use a combination of | (or), & (and), and ~ (negation) to achieve it. For example, select the senior students in Fudan University who weigh more than 70kg, or the male students in Peking University who weigh more than 80kg but are not seniors:

In [25]:
condition_1_1 = df_demo.School == 'Fudan University'
condition_1_2 = df_demo.Grade == 'Senior'
condition_1_3 = df_demo.Weight > 70
condition_1 = condition_1_1 & condition_1_2 & condition_1_3
condition_2_1 = df_demo.School == 'Peking University'
condition_2_2 = df_demo.Grade == 'Senior'
condition_2_3 = df_demo.Weight > 80
condition_2 = condition_2_1 & (~condition_2_2) & condition_2_3
df_demo.loc[condition_1 | condition_2]

Unnamed: 0_level_0,School,Grade,Gender,Weight,Transfer
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Qiang Han,Peking University,Freshman,Male,87.0,N
Chengpeng Zhou,Fudan University,Senior,Male,81.0,N
Changpeng Zhao,Peking University,Freshman,Male,83.0,N
Chengpeng Qian,Fudan University,Senior,Male,73.0,Y


#### 【Practice】

`select_dtypes` is a practical function that can select columns of the corresponding type from a table. To select all numeric columns, just use `.select_dtypes('number')`. Please use the Boolean list selection method combined with the `dtypes` attribute of `DataFrame` to implement this function on the `learn_pandas` dataset.

#### 【END】

【e】`*` is a function

The function here must return one of the four legal forms mentioned above, and the input value of the function is `DataFrame` itself. Assuming that it is still the example of compound condition screening mentioned above, the logic can be written into a function and then returned. It should be noted that the formal parameter `x` of the function is essentially `df_demo`:

In [26]:
def condition(x):
    condition_1_1 = x.School == 'Fudan University'
    condition_1_2 = x.Grade == 'Senior'
    condition_1_3 = x.Weight > 70
    condition_1 = condition_1_1 & condition_1_2 & condition_1_3
    condition_2_1 = x.School == 'Peking University'
    condition_2_2 = x.Grade == 'Senior'
    condition_2_3 = x.Weight > 80
    condition_2 = condition_2_1 & (~condition_2_2) & condition_2_3
    result = condition_1 | condition_2
    return result
df_demo.loc[condition]

Unnamed: 0_level_0,School,Grade,Gender,Weight,Transfer
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Qiang Han,Peking University,Freshman,Male,87.0,N
Chengpeng Zhou,Fudan University,Senior,Male,81.0,N
Changpeng Zhao,Peking University,Freshman,Male,83.0,N
Chengpeng Qian,Fudan University,Senior,Male,73.0,Y


In addition, lambda expressions are also supported, and their return value must also be one of the four forms mentioned previously:

In [27]:
df_demo.loc[lambda x:'Quan Zhao', lambda x:'Gender']

'Female'

Since the function cannot return a slice format such as `start: end: step`, the slice must be wrapped with a `slice` object when returning it:

In [28]:
df_demo.loc[lambda x: slice('Gaojuan You', 'Gaoqiang Qian')]

Unnamed: 0_level_0,School,Grade,Gender,Weight,Transfer
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Gaojuan You,Fudan University,Sophomore,Male,74.0,N
Xiaoli Qian,Tsinghua University,Freshman,Female,51.0,N
Qiang Chu,Shanghai Jiao Tong University,Freshman,Female,52.0,N
Gaoqiang Qian,Tsinghua University,Junior,Female,50.0,N


Finally, it should be pointed out that the loc index can also be used for Series. The principle followed is exactly the same as the loc[*] used for row filtering in DataFrame, so I will not go into details here.

#### 【WARNING】Do not use chained assignment

When assigning values ​​to a table or sequence, you should directly perform the assignment operation after using a layer of indexers. This is because after multiple indexes, the assignment is assigned to the temporarily returned `copy` copy, and the element is not actually modified, thus reporting the `SettingWithCopyWarning` warning. For example, the following example:

In [29]:
df_chain = pd.DataFrame([[0,0],[1,0],[-1,0]], columns=list('AB'))
df_chain
import warnings
with warnings.catch_warnings():
    warnings.filterwarnings('error')
    try:
        df_chain[df_chain.A!=0].B = 1 # 使用方括号列索引后，再使用点的列索引
    except Warning as w:
        Warning_Msg = w
print(Warning_Msg)
df_chain


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,A,B
0,0,0
1,1,0
2,-1,0


In [30]:
df_chain.loc[df_chain.A!=0,'B'] = 1
df_chain

Unnamed: 0,A,B
0,0,0
1,1,1
2,-1,1


#### 【END】

### 4. iloc indexer

The usage of `iloc` is exactly the same as `loc`, except that it is based on the position for filtering. There are also five types of legal objects at the corresponding `*` position, namely: integer, integer list, integer slice, Boolean list and function. The return value of the function must be one of the above four types of legal objects, and its input is also `DataFrame` itself.

In [31]:
df_demo.iloc[1, 1] # 第二行第二列

'Freshman'

In [32]:
df_demo.iloc[[0, 1], [0, 1]] # 前两行前两列

Unnamed: 0_level_0,School,Grade
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Gaopeng Yang,Shanghai Jiao Tong University,Freshman
Changqiang You,Peking University,Freshman


In [33]:
df_demo.iloc[1: 4, 2:4] # 切片不包含结束端点

Unnamed: 0_level_0,Gender,Weight
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Changqiang You,Male,70.0
Mei Sun,Male,89.0
Xiaojuan Sun,Female,41.0


In [34]:
df_demo.iloc[lambda x: slice(1, 4)] # 传入切片为返回值的函数

Unnamed: 0_level_0,School,Grade,Gender,Weight,Transfer
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Changqiang You,Peking University,Freshman,Male,70.0,N
Mei Sun,Shanghai Jiao Tong University,Senior,Male,89.0,N
Xiaojuan Sun,Fudan University,Sophomore,Female,41.0,N


When using Boolean lists, you must pay special attention to not passing in `Series` but `values` of the sequence, otherwise an error will be reported. Therefore, when using Boolean filtering, the `loc` method should be given priority.

For example, to select students weighing more than 80kg:

In [35]:
df_demo.iloc[(df_demo.Weight>80).values].head()

Unnamed: 0_level_0,School,Grade,Gender,Weight,Transfer
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mei Sun,Shanghai Jiao Tong University,Senior,Male,89.0,N
Qiang Zheng,Shanghai Jiao Tong University,Senior,Male,87.0,N
Qiang Han,Peking University,Freshman,Male,87.0,N
Chengpeng Zhou,Fudan University,Senior,Male,81.0,N
Feng Han,Shanghai Jiao Tong University,Sophomore,Male,82.0,N


For `Series`, you can also use `iloc` to return the value or subsequence at the corresponding position:

In [36]:
df_demo.School.iloc[1]

'Peking University'

In [37]:
df_demo.School.iloc[1:5:2]

Name
Changqiang You    Peking University
Xiaojuan Sun       Fudan University
Name: School, dtype: object

### 5. query method

In `pandas`, you can pass a string-like query expression to the `query` method to query data. The execution result of the expression must return a Boolean list. When performing complex indexing, this retrieval method does not need to repeatedly use the `DataFrame` name to refer to the column name like the normal method, which generally reduces the code length without reducing readability.

For example, the compound condition query example in the `loc` section can be rewritten as follows:

In [38]:
df.query('((School == "Fudan University")&'
         ' (Grade == "Senior")&'
         ' (Weight > 70))|'
         '((School == "Peking University")&'
         ' (Grade != "Senior")&'
         ' (Weight > 80))')

Unnamed: 0,School,Grade,Name,Gender,Weight,Transfer
38,Peking University,Freshman,Qiang Han,Male,87.0,N
66,Fudan University,Senior,Chengpeng Zhou,Male,81.0,N
99,Peking University,Freshman,Changpeng Zhao,Male,83.0,N
131,Fudan University,Senior,Chengpeng Qian,Male,73.0,Y


In the query expression, all column names from DataFrame are registered for the user. All methods belonging to the Series can be called, which is no different from normal function calls. For example, to query students whose weight exceeds the mean:

In [39]:
df.query('Weight > Weight.mean()').head()

Unnamed: 0,School,Grade,Name,Gender,Weight,Transfer
1,Peking University,Freshman,Changqiang You,Male,70.0,N
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,89.0,N
4,Fudan University,Sophomore,Gaojuan You,Male,74.0,N
10,Shanghai Jiao Tong University,Freshman,Xiaopeng Zhou,Male,74.0,N
14,Tsinghua University,Senior,Xiaomei Zhou,Female,57.0,N


#### 【NOTE】Quote column names with spaces in query

For column names with spaces, you need to use `` `col name` `` to quote them.

#### 【END】

At the same time, several English literal usages are registered in `query` to help improve readability, such as: `or, and, or, in, not in`. For example, filter out male students who are not freshmen or sophomores:

In [40]:
df.query('(Grade not in ["Freshman", "Sophomore"]) and (Gender == "Male")').head()

Unnamed: 0,School,Grade,Name,Gender,Weight,Transfer
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,89.0,N
16,Tsinghua University,Junior,Xiaoqiang Qin,Male,68.0,N
17,Tsinghua University,Junior,Peng Wang,Male,65.0,N
18,Tsinghua University,Senior,Xiaofeng Sun,Male,71.0,N
21,Shanghai Jiao Tong University,Senior,Xiaopeng Shen,Male,62.0,


In addition, when a comparison with a list appears in a string, `==` and `!=` respectively indicate that the element appears in the list and does not appear in the list, which is equivalent to `in` and `not in`. For example, to query all junior and senior students:

In [41]:
df.query('Grade == ["Junior", "Senior"]').head()

Unnamed: 0,School,Grade,Name,Gender,Weight,Transfer
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,89.0,N
7,Tsinghua University,Junior,Gaoqiang Qian,Female,50.0,N
9,Peking University,Junior,Juan Xu,Female,,N
11,Tsinghua University,Junior,Xiaoquan Lv,Female,43.0,N
12,Shanghai Jiao Tong University,Senior,Peng You,Female,48.0,


For the string in `query`, if you want to reference an external variable, just add the `@` symbol before the variable name. For example, to retrieve students whose weight is between 70kg and 80kg:

In [42]:
low, high =70, 80
df.query('(Weight >= @low) & (Weight <= @high)').head()

Unnamed: 0,School,Grade,Name,Gender,Weight,Transfer
1,Peking University,Freshman,Changqiang You,Male,70.0,N
4,Fudan University,Sophomore,Gaojuan You,Male,74.0,N
10,Shanghai Jiao Tong University,Freshman,Xiaopeng Zhou,Male,74.0,N
18,Tsinghua University,Senior,Xiaofeng Sun,Male,71.0,N
35,Peking University,Freshman,Gaoli Zhao,Male,78.0,N


### 6. Random Sampling

If you regard each row of `DataFrame` as a sample, or each column as a feature, and the entire `DataFrame` as a population, you can use the `sample` function to randomly sample samples or features. Sometimes after getting a large data set, you want to calculate the statistical features to understand the approximate distribution of the data, but this is very time-consuming. At the same time, since many statistical features are unbiased estimates of the statistical features of the population under the condition of simple random sampling with equal probability and no replacement, such as the sample mean and the population mean, you can first extract a part from the entire table to make an approximate estimate.

The main parameters in the `sample` function are `n, axis, frac, replace, weights`. The first three refer to the number of samples, the direction of sampling (0 for rows and 1 for columns), and the sampling ratio (0.3 means 30% of the samples are drawn from the population).

`replace` and `weights` refer to whether to replace and the relative probability of sampling each sample, respectively. When `replace = True`, it means sampling with replacement. For example, the following constructed `df_sample` uses the relative size of `value` as the sampling probability to perform a sampling with replacement, and the number of samples is 3.

In [43]:
df_sample = pd.DataFrame({'id': list('abcde'), 'value': [1, 2, 3, 4, 90]})
df_sample

Unnamed: 0,id,value
0,a,1
1,b,2
2,c,3
3,d,4
4,e,90


In [44]:
df_sample.sample(3, replace = True, weights = df_sample.value)

Unnamed: 0,id,value
4,e,90
4,e,90
4,e,90


## 2. Multi-level index
### 1. Multi-level index and its table structure

In order to more clearly illustrate the `DataFrame` structure with multi-level index, a new table is constructed below. Readers can ignore the construction method here, which will be explained in more detail in Section 4.

In [45]:
np.random.seed(0)
multi_index = pd.MultiIndex.from_product([list('ABCD'), df.Gender.unique()], names=('School', 'Gender'))
multi_column = pd.MultiIndex.from_product([['Height', 'Weight'], df.Grade.unique()], names=('Indicator', 'Grade'))
df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5 + 163).tolist(), (np.random.randn(8,4)*5 + 65).tolist()],
                        index = multi_index, columns = multi_column).round(1)
df_multi

Unnamed: 0_level_0,Indicator,Height,Height,Height,Height,Weight,Weight,Weight,Weight
Unnamed: 0_level_1,Grade,Freshman,Senior,Sophomore,Junior,Freshman,Senior,Sophomore,Junior
School,Gender,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
A,Female,171.8,165.0,167.9,174.2,60.6,55.1,63.3,65.8
A,Male,172.3,158.1,167.8,162.2,71.2,71.0,63.1,63.5
B,Female,162.5,165.1,163.7,170.3,59.8,57.9,56.5,74.8
B,Male,166.8,163.6,165.2,164.7,62.5,62.8,58.7,68.9
C,Female,170.5,162.0,164.6,158.7,56.9,63.9,60.5,66.9
C,Male,150.2,166.3,167.3,159.3,62.4,59.1,64.9,67.1
D,Female,174.3,155.7,163.2,162.1,65.3,66.5,61.8,63.2
D,Male,170.7,170.3,163.8,164.9,61.6,63.2,60.9,56.4


The figure below uses colors to mark the structure of `DataFrame`. Like a single-layer index table, it has three parts: element value, row index, and column index. Among them, the row index and column index here are both `MultiIndex` types, except that **one element in the index is a tuple** instead of a scalar in a single-layer index. For example, the fourth element of the row index is `("B", "Male")`, and the second element of the column index is `("Height", "Senior")`. It should be noted here that when the same value appears continuously in the outer layer, the first one after the first one will be hidden and displayed, making the result more readable.

<img src="../source/_static/multi_index.png" width="50%">

Similar to a single-layer index, `MultiIndex` also has a name attribute. In the figure, `School` and `Gender` correspond to the names of the first and second-layer row indexes of the table, respectively, and `Indicator` and `Grade` correspond to the names of the first and second-layer column indexes, respectively.

The names and values ​​of the index attributes are available via `names` and `values` respectively:

In [46]:
df_multi.index.names

FrozenList(['School', 'Gender'])

In [47]:
df_multi.columns.names

FrozenList(['Indicator', 'Grade'])

In [48]:
df_multi.index.values

array([('A', 'Female'), ('A', 'Male'), ('B', 'Female'), ('B', 'Male'),
       ('C', 'Female'), ('C', 'Male'), ('D', 'Female'), ('D', 'Male')],
      dtype=object)

In [49]:
df_multi.columns.values

array([('Height', 'Freshman'), ('Height', 'Senior'),
       ('Height', 'Sophomore'), ('Height', 'Junior'),
       ('Weight', 'Freshman'), ('Weight', 'Senior'),
       ('Weight', 'Sophomore'), ('Weight', 'Junior')], dtype=object)

If you want to get the index of a certain level, you need to get it through `get_level_values`:

In [50]:
df_multi.index.get_level_values(0)

Index(['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'], dtype='object', name='School')

However, for indexes, whether single-level or multi-level, users cannot modify elements by `index_obj[0] = item`, nor can they modify names by `index_name[0] = new_name`. The topic of how to modify these attributes will be discussed in Section 3.

### 2. Loc indexer in multi-level index

After getting familiar with the structure, now return to the original table and set school and grade as indexes. At this time, the rows are multi-level indexes and the columns are single-level indexes. Since the default column index does not contain a name, the index name positions corresponding to `Indicator` and `Grade` in the figure just now are empty.

In [51]:
df_multi = df.set_index(['School', 'Grade'])
df_multi.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,46.0,N
Peking University,Freshman,Changqiang You,Male,70.0,N
Shanghai Jiao Tong University,Senior,Mei Sun,Male,89.0,N
Fudan University,Sophomore,Xiaojuan Sun,Female,41.0,N
Fudan University,Sophomore,Gaojuan You,Male,74.0,N


Since the single element in the multi-level index is a tuple, the ``loc`` and ``iloc`` methods introduced in the first section can be copied completely, just replace the scalar position with the corresponding tuple.

When passing in a list of tuples or a single tuple, or a function that returns the former or the latter, you need to sort the indexes first to avoid performance warnings:

In [52]:
with warnings.catch_warnings():
    warnings.filterwarnings('error')
    try:
        df_multi.loc[('Fudan University', 'Junior')].head()
    except Warning as w:
        Warning_Msg = w
Warning_Msg



In [53]:
df_sorted = df_multi.sort_index()
df_sorted.loc[('Fudan University', 'Junior')].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fudan University,Junior,Yanli You,Female,48.0,N
Fudan University,Junior,Chunqiang Chu,Male,72.0,N
Fudan University,Junior,Changfeng Lv,Male,76.0,N
Fudan University,Junior,Yanjuan Lv,Female,49.0,
Fudan University,Junior,Gaoqiang Zhou,Female,43.0,N


In [54]:
df_sorted.loc[[('Fudan University', 'Senior'), ('Shanghai Jiao Tong University', 'Freshman')]].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fudan University,Senior,Chengpeng Zheng,Female,38.0,N
Fudan University,Senior,Feng Zhou,Female,47.0,N
Fudan University,Senior,Gaomei Lv,Female,34.0,N
Fudan University,Senior,Chunli Lv,Female,56.0,N
Fudan University,Senior,Chengpeng Zhou,Male,81.0,N


In [55]:
df_sorted.loc[df_sorted.Weight > 70].head() # 布尔列表也是可用的

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fudan University,Freshman,Feng Wang,Male,74.0,N
Fudan University,Junior,Chunqiang Chu,Male,72.0,N
Fudan University,Junior,Changfeng Lv,Male,76.0,N
Fudan University,Senior,Chengpeng Zhou,Male,81.0,N
Fudan University,Senior,Chengpeng Qian,Male,73.0,Y


In [56]:
df_sorted.loc[lambda x:('Fudan University','Junior')].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fudan University,Junior,Yanli You,Female,48.0,N
Fudan University,Junior,Chunqiang Chu,Male,72.0,N
Fudan University,Junior,Changfeng Lv,Male,76.0,N
Fudan University,Junior,Yanjuan Lv,Female,49.0,
Fudan University,Junior,Gaoqiang Zhou,Female,43.0,N


When using slices, please note that in a single-level index, as long as the slice endpoint elements are unique, then slicing can be performed. However, in a multi-level index, regardless of whether the tuples appear repeatedly in the index, they must be sorted before slicing can be used, otherwise an error will be reported:

In [57]:
try:
    df_multi.loc[('Fudan University', 'Senior'):].head()
except Exception as e:
    Err_Msg = e
Err_Msg

pandas.errors.UnsortedIndexError('Key length (2) was greater than MultiIndex lexsort depth (0)')

In [58]:
df_sorted.loc[('Fudan University', 'Senior'):].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fudan University,Senior,Chengpeng Zheng,Female,38.0,N
Fudan University,Senior,Feng Zhou,Female,47.0,N
Fudan University,Senior,Gaomei Lv,Female,34.0,N
Fudan University,Senior,Chunli Lv,Female,56.0,N
Fudan University,Senior,Chengpeng Zhou,Male,81.0,N


In [59]:
df_unique = df.drop_duplicates(subset=['School','Grade']).set_index(['School', 'Grade'])
df_unique.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,46.0,N
Peking University,Freshman,Changqiang You,Male,70.0,N
Shanghai Jiao Tong University,Senior,Mei Sun,Male,89.0,N
Fudan University,Sophomore,Xiaojuan Sun,Female,41.0,N
Tsinghua University,Freshman,Xiaoli Qian,Female,51.0,N


In [60]:
try:
    df_unique.loc[('Fudan University', 'Senior'):].head()
except Exception as e:
    Err_Msg = e
Err_Msg

pandas.errors.UnsortedIndexError('Key length (2) was greater than MultiIndex lexsort depth (0)')

In [61]:
df_unique.sort_index().loc[('Fudan University', 'Senior'):].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fudan University,Senior,Chengpeng Zheng,Female,38.0,N
Fudan University,Sophomore,Xiaojuan Sun,Female,41.0,N
Peking University,Freshman,Changqiang You,Male,70.0,N
Peking University,Junior,Juan Xu,Female,,N
Peking University,Senior,Changli Lv,Female,41.0,N


In addition, there is a special usage of tuples in multi-level indexes, which can be used to cross-combine and index multiple layers of elements, but at the same time, the column of `loc` needs to be specified, and `:` is used to select all. The elements to be selected at each level are stored in a list, and the form of passing to `loc` is `[(level_0_list, level_1_list), cols]`. For example, if you want to get all the sophomores and juniors of Peking University and Fudan University, you can write it as follows:

In [62]:
res = df_multi.loc[(['Peking University', 'Fudan University'], ['Sophomore', 'Junior']), :]
res.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Peking University,Sophomore,Changmei Xu,Female,43.0,N
Peking University,Sophomore,Xiaopeng Qin,Male,,N
Peking University,Sophomore,Mei Xu,Female,39.0,N
Peking University,Sophomore,Xiaoli Zhou,Female,55.0,N
Peking University,Sophomore,Peng Han,Female,34.0,


In [63]:
res.shape

(33, 4)

The following statement is similar to the above, but it still passes in a list of elements (tuples in this case). Their meanings are different, indicating that the third-year students of Peking University and the sophomores of Fudan University are selected:

In [64]:
res = df_multi.loc[[('Peking University', 'Junior'), ('Fudan University', 'Sophomore')]]
res.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Gender,Weight,Transfer
School,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Peking University,Junior,Juan Xu,Female,,N
Peking University,Junior,Changjuan You,Female,47.0,N
Peking University,Junior,Gaoli Xu,Female,48.0,N
Peking University,Junior,Gaoquan Zhou,Male,70.0,N
Peking University,Junior,Qiang You,Female,56.0,N


In [65]:
res.shape

(16, 4)

### 3. IndexSlice object

The method introduced above can only slice the tuple as a whole, but not each layer, even when the index is not repeated. It is also not allowed to mix slices with Boolean lists. The introduction of the `IndexSlice` object can solve this problem. The `Slice` object has two forms, the first is the `loc[idx[*,*]]` type, and the second is the `loc[idx[*,*],idx[*,*]]` type, which will be introduced below. For the convenience of demonstration, the following constructs a **DataFrame` with non-duplicate indexes:

In [66]:
np.random.seed(0)
L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(9,9)), index=mul_index1, columns=mul_index2)
df_ex

Unnamed: 0_level_0,Big,D,D,D,E,E,E,F,F,F
Unnamed: 0_level_1,Small,d,e,f,d,e,f,d,e,f
Upper,Lower,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
A,a,3,6,-9,-6,-6,-2,0,9,-5
A,b,-3,3,-8,-3,-2,5,8,-4,4
A,c,-1,0,7,-4,6,6,-9,9,-6
B,a,8,5,-2,-9,-8,0,-9,1,-6
B,b,2,9,-7,-9,-9,-5,-4,-3,-1
B,c,8,6,-5,0,1,-8,-8,-2,0
C,a,-6,-3,2,5,9,-9,5,-6,3
C,b,1,2,-5,-3,-5,6,-6,3,-5
C,c,-1,5,6,-6,6,4,7,8,-4


In order to use the silce object, you must first define it:

In [67]:
idx = pd.IndexSlice

[a] `loc[idx[*,*]]` type

In this case, multiple layers of slicing cannot be performed separately. The first `*` indicates the selection of rows, and the second `*` indicates the selection of columns, which is similar to the simple `loc`:

In [68]:
df_ex.loc[idx['C':, ('D', 'f'):]]

Unnamed: 0_level_0,Big,D,E,E,E,F,F,F
Unnamed: 0_level_1,Small,f,d,e,f,d,e,f
Upper,Lower,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
C,a,2,5,9,-9,5,-6,3
C,b,-5,-3,-5,6,-6,3,-5
C,c,6,-6,6,4,7,8,-4


Additionally, indexing of Boolean sequences is supported:

In [69]:
df_ex.loc[idx[:'A', lambda x:x.sum()>0]] # 列和大于0

Unnamed: 0_level_0,Big,D,D,F
Unnamed: 0_level_1,Small,d,e,e
Upper,Lower,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
A,a,3,6,9
A,b,-3,3,-4
A,c,-1,0,9


【b】`loc[idx[*,*],idx[*,*]]` type

In this case, slicing can be done in layers, with the first `idx` referring to the row index and the second one to the column index.

In [70]:
df_ex.loc[idx[:'A', 'b':], idx['E':, 'e':]]

Unnamed: 0_level_0,Big,E,E,F,F
Unnamed: 0_level_1,Small,e,f,e,f
Upper,Lower,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
A,b,-2,5,-4,4
A,c,6,6,9,-6


But please note that the use of functions is not supported at this time:

In [71]:
try:
    df_ex.loc[idx[:'A', lambda x: 'b'], idx['E':, 'e':]]
except Exception as e:
    Err_Msg = e
Err_Msg

KeyError(<function __main__.<lambda>(x)>)

### 4. Construction of multi-level index

The structure and slicing of multi-level index table were mentioned above. So, in addition to using `set_index`, how to construct multi-level index by yourself? There are three commonly used methods: `from_tuples`, from_arrays, and from_product`, which are all functions under the `pd.MultiIndex` object.

`from_tuples` means to construct according to the list of tuples passed in:

In [72]:
my_tuple = [('a','cat'),('a','dog'),('b','cat'),('b','dog')]
pd.MultiIndex.from_tuples(my_tuple, names=['First','Second'])

MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

`from_arrays` means constructing the list of corresponding layers according to the passed list:

In [73]:
my_array = [list('aabb'), ['cat', 'dog']*2]
pd.MultiIndex.from_arrays(my_array, names=['First','Second'])

MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

`from_product` constructs a list from the Cartesian product of the given lists:

In [74]:
my_list1 = ['a','b']
my_list2 = ['cat','dog']
pd.MultiIndex.from_product([my_list1, my_list2], names=['First','Second'])

MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

## 3. Common methods of indexing
### 1. Exchange and deletion of index layers
To facilitate understanding of the exchange process, here is an example of constructing a three-level index:

In [75]:
np.random.seed(0)
L1,L2,L3 = ['A','B'],['a','b'],['alpha','beta']
mul_index1 = pd.MultiIndex.from_product([L1,L2,L3], names=('Upper', 'Lower','Extra'))
L4,L5,L6 = ['C','D'],['c','d'],['cat','dog']
mul_index2 = pd.MultiIndex.from_product([L4,L5,L6], names=('Big', 'Small', 'Other'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(8,8)), index=mul_index1,  columns=mul_index2)
df_ex

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,cat,dog,cat,dog,cat,dog,cat,dog
Upper,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9
B,a,beta,-9,-5,-4,-3,-1,8,6,-5
B,b,alpha,0,1,-8,-8,-2,0,-6,-3
B,b,beta,2,5,9,-9,5,-6,3,1


The exchange of index levels is done by `swaplevel` and `reorder_levels`. The former can only exchange two levels, while the latter can exchange any level. Both can specify which axis is exchanged, that is, row index or column index:

In [76]:
df_ex.swaplevel(0,2,axis=1).head() # 列索引的第一层和第三层交换

Unnamed: 0_level_0,Unnamed: 1_level_0,Other,cat,dog,cat,dog,cat,dog,cat,dog
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Big,C,C,C,C,D,D,D,D
Upper,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9


In [77]:
df_ex.reorder_levels([2,0,1],axis=0).head() # 列表数字指代原来索引中的层

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,cat,dog,cat,dog,cat,dog,cat,dog
Extra,Upper,Lower,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
alpha,A,a,3,6,-9,-6,-6,-2,0,9
beta,A,a,-5,-3,3,-8,-3,-2,5,8
alpha,A,b,-4,4,-1,0,7,-4,6,6
beta,A,b,-9,9,-6,8,5,-2,-9,-8
alpha,B,a,0,-9,1,-6,2,9,-7,-9


#### 【NOTE】Index exchange between axes
This only involves the exchange of row or column indexes. The exchange of indexes in different directions will be discussed in Chapter 5.
#### 【END】
If you want to delete the index of a certain level, you can use the `droplevel` method:

In [78]:
df_ex.droplevel(1,axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Other,cat,dog,cat,dog,cat,dog,cat,dog
Upper,Lower,Extra,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9
B,a,beta,-9,-5,-4,-3,-1,8,6,-5
B,b,alpha,0,1,-8,-8,-2,0,-6,-3
B,b,beta,2,5,9,-9,5,-6,3,1


In [79]:
df_ex.droplevel([0,1],axis=0)

Big,C,C,C,C,D,D,D,D
Small,c,c,d,d,c,c,d,d
Other,cat,dog,cat,dog,cat,dog,cat,dog
Extra,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
alpha,3,6,-9,-6,-6,-2,0,9
beta,-5,-3,3,-8,-3,-2,5,8
alpha,-4,4,-1,0,7,-4,6,6
beta,-9,9,-6,8,5,-2,-9,-8
alpha,0,-9,1,-6,2,9,-7,-9
beta,-9,-5,-4,-3,-1,8,6,-5
alpha,0,1,-8,-8,-2,0,-6,-3
beta,2,5,9,-9,5,-6,3,1


### 2. Modify index attributes
The name of the index layer can be modified through `rename_axis`. The common modification method is to pass in the dictionary mapping:

In [80]:
df_ex.rename_axis(index={'Upper':'Changed_row'}, columns={'Other':'Changed_Col'}).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Changed_Col,cat,dog,cat,dog,cat,dog,cat,dog
Changed_row,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9


The index value can be modified through `rename`. If it is a multi-level index, the modified level number `level` needs to be specified:

In [81]:
df_ex.rename(columns={'cat':'not_cat'}, level=2).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,not_cat,dog,not_cat,dog,not_cat,dog,not_cat,dog
Upper,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9


The passed parameter can also be a function, whose input value is the index element:

In [82]:
df_ex.rename(index=lambda x:str.upper(x), level=2).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,cat,dog,cat,dog,cat,dog,cat,dog
Upper,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,ALPHA,3,6,-9,-6,-6,-2,0,9
A,a,BETA,-5,-3,3,-8,-3,-2,5,8
A,b,ALPHA,-4,4,-1,0,7,-4,6,6
A,b,BETA,-9,9,-6,8,5,-2,-9,-8
B,a,ALPHA,0,-9,1,-6,2,9,-7,-9


#### 【Practice】
Try to use the function in `rename_axis` to complete the same function as in the example, that is, replace `Upper` and `Other` with `Changed_row` and `Changed_col` respectively.
#### 【END】
For the replacement of elements of the entire index, it can be implemented using iterators:

In [83]:
new_values = iter(list('abcdefgh'))
df_ex.rename(index=lambda x:next(new_values), level=2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,cat,dog,cat,dog,cat,dog,cat,dog
Upper,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,a,3,6,-9,-6,-6,-2,0,9
A,a,b,-5,-3,3,-8,-3,-2,5,8
A,b,c,-4,4,-1,0,7,-4,6,6
A,b,d,-9,9,-6,8,5,-2,-9,-8
B,a,e,0,-9,1,-6,2,9,-7,-9
B,a,f,-9,-5,-4,-3,-1,8,6,-5
B,b,g,0,1,-8,-8,-2,0,-6,-3
B,b,h,2,5,9,-9,5,-6,3,1


If you want to modify an element at a certain position, it is easy to do in a single-level index, that is, first take out the `values` attribute of the index, then modify the obtained list, and finally reassign the `index` object. However, if it is a multi-level index, it will be a bit troublesome. One solution is to temporarily convert a certain level of index into an element of the table, then modify it, and finally reset it to the index. The following section will introduce these operations.

Another function that needs to be introduced is `map`, which is a method defined on `Index`. It is similar to the functional usage of the layer in the previous `rename` method, except that it does not pass in the scalar value of the layer, but directly passes in the tuple of the index, which provides traversal for users to make cross-level modifications. For example, the above string conversion operation can be written equivalently:

In [84]:
df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x: (x[0], x[1], str.upper(x[2])))
df_temp.index = new_idx
df_temp.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,cat,dog,cat,dog,cat,dog,cat,dog
Upper,Lower,Extra,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
A,a,ALPHA,3,6,-9,-6,-6,-2,0,9
A,a,BETA,-5,-3,3,-8,-3,-2,5,8
A,b,ALPHA,-4,4,-1,0,7,-4,6,6
A,b,BETA,-9,9,-6,8,5,-2,-9,-8
B,a,ALPHA,0,-9,1,-6,2,9,-7,-9


Another use of `map` is to compress multi-level indexes, which is useful in some operations in Chapter 4 and Chapter 5:

In [85]:
df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x: (x[0]+'-'+x[1]+'-'+x[2]))
df_temp.index = new_idx
df_temp.head() # 单层索引

Big,C,C,C,C,D,D,D,D
Small,c,c,d,d,c,c,d,d
Other,cat,dog,cat,dog,cat,dog,cat,dog
A-a-alpha,3,6,-9,-6,-6,-2,0,9
A-a-beta,-5,-3,3,-8,-3,-2,5,8
A-b-alpha,-4,4,-1,0,7,-4,6,6
A-b-beta,-9,9,-6,8,5,-2,-9,-8
B-a-alpha,0,-9,1,-6,2,9,-7,-9


At the same time, it can also be expanded in the opposite direction:

In [86]:
new_idx = df_temp.index.map(lambda x:tuple(x.split('-')))
df_temp.index = new_idx
df_temp.head() # 三层索引

Unnamed: 0_level_0,Unnamed: 1_level_0,Big,C,C,C,C,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,Small,c,c,d,d,c,c,d,d
Unnamed: 0_level_2,Unnamed: 1_level_2,Other,cat,dog,cat,dog,cat,dog,cat,dog
A,a,alpha,3,6,-9,-6,-6,-2,0,9
A,a,beta,-5,-3,3,-8,-3,-2,5,8
A,b,alpha,-4,4,-1,0,7,-4,6,6
A,b,beta,-9,9,-6,8,5,-2,-9,-8
B,a,alpha,0,-9,1,-6,2,9,-7,-9


### 3. Setting and resetting indexes
To illustrate the functions in this section, let's construct a new table:

In [87]:
df_new = pd.DataFrame({'A':list('aacd'), 'B':list('PQRT'), 'C':[1,2,3,4]})
df_new

Unnamed: 0,A,B,C
0,a,P,1
1,a,Q,2
2,c,R,3
3,d,T,4


The index setting can be done using `set_index`. The main parameter here is `append`, which indicates whether to keep the original index and directly add the new setting to the inner layer of the original index:

In [88]:
df_new.set_index('A')

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,P,1
a,Q,2
c,R,3
d,T,4


In [89]:
df_new.set_index('A', append=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,B,C
Unnamed: 0_level_1,A,Unnamed: 2_level_1,Unnamed: 3_level_1
0,a,P,1
1,a,Q,2
2,c,R,3
3,d,T,4


You can specify multiple columns as indexes at the same time:

In [90]:
df_new.set_index(['A', 'B'])

Unnamed: 0_level_0,Unnamed: 1_level_0,C
A,B,Unnamed: 2_level_1
a,P,1
a,Q,2
c,R,3
d,T,4


If the column you want to add an index to does not appear in it, you can directly pass the corresponding `Series` in the parameter:

In [91]:
my_index = pd.Series(list('WXYZ'), name='D')
df_new = df_new.set_index(['A', my_index])
df_new

Unnamed: 0_level_0,Unnamed: 1_level_0,B,C
A,D,Unnamed: 2_level_1,Unnamed: 3_level_1
a,W,P,1
a,X,Q,2
c,Y,R,3
d,Z,T,4


`reset_index` is the inverse function of `set_index`. Its main parameter is `drop`, which indicates whether to discard the removed index layer instead of adding it to the column:

In [92]:
df_new.reset_index(['D'])

Unnamed: 0_level_0,D,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,W,P,1
a,X,Q,2
c,Y,R,3
d,Z,T,4


In [93]:
df_new.reset_index(['D'], drop=True)

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,P,1
a,Q,2
c,R,3
d,T,4


If all indices are reset, pandas will simply regenerate a default index:

In [94]:
df_new.reset_index()

Unnamed: 0,A,D,B,C
0,a,W,P,1
1,a,X,Q,2
2,c,Y,R,3
3,d,Z,T,4


### 4. Index Transformation
In some cases, it is necessary to expand or remove the index. More specifically, a new index is required to fill the corresponding index elements in the original table into the table composed of the new index. For example, the following table gives employee information. It is necessary to create a new table, add an employee, remove the height column and add the gender column:

In [95]:
df_reindex = pd.DataFrame({"Weight":[60,70,80], "Height":[176,180,179]}, index=['1001','1003','1002'])
df_reindex

Unnamed: 0,Weight,Height
1001,60,176
1003,70,180
1002,80,179


In [96]:
df_reindex.reindex(index=['1001','1002','1003','1004'], columns=['Weight','Gender'])

Unnamed: 0,Weight,Gender
1001,60.0,
1002,80.0,
1003,70.0,
1004,,


This requirement often occurs when filling in the time point of the time series index and expanding the `ID` number. In addition, it should be noted that the data in the original table and the new table will be automatically aligned according to the index. For example, the original position 1002 is after the number 1003, and the new table is the opposite. In this case, the elements will be aligned in `reindex` regardless of the position.

Another function similar to `reindex` is `reindex_like`, which is to transform the index of the called table according to the table index passed in. For example, there is already a table that meets the conditions of the target index, so the above function can be obtained using the following code:

In [97]:
df_existed = pd.DataFrame(index=['1001','1002','1003','1004'], columns=['Weight','Gender'])
df_reindex.reindex_like(df_existed)

Unnamed: 0,Weight,Gender
1001,60.0,
1002,80.0,
1003,70.0,
1004,,


## 4. Index operations
### 1. Set operation rules

There is often a need to use set operations to extract rows that meet the conditions. For example, there are two tables `A` and `B`, and their indexes are employee numbers. Now you need to filter out all employee information that intersects the indexes of the two tables. At this time, it is easy to achieve through operations on `Index`.

However, before that, let's review the four common set operations:

$$\rm S_A.intersection(S_B) = \rm S_A \cap S_B \Leftrightarrow \rm \{x|x\in S_A\, and\, x\in S_B\}$$
$$\rm S_A.union(S_B) = \rm S_A \cup S_B \Leftrightarrow \rm \{x|x\in S_A\, or\, x\in S_B\}$$
$$\rm S_A.difference(S_B) = \rm S_A - S_B \Leftrightarrow \rm \{x|x\in S_A\, and\, x\notin S_B\}$$
$$\rm S_A.symmetric\_difference(S_B) = \rm S_A\triangle S_B\Leftrightarrow \rm \{x|x\in S_A\cup S_B - S_A\cap S_B\}$$

### 2. General index operations

Since the elements of the collection are different, but there may be the same elements in the index, use `unique` to remove duplicates before performing the operation. The following constructs two simplest example tables for demonstration:

In [98]:
df_set_1 = pd.DataFrame([[0,1],[1,2],[3,4]], index = pd.Index(['a','b','a'],name='id1'))
df_set_2 = pd.DataFrame([[4,5],[2,6],[7,1]], index = pd.Index(['b','b','c'],name='id2'))
id1, id2 = df_set_1.index.unique(), df_set_2.index.unique()
id1.intersection(id2)

Index(['b'], dtype='object')

In [99]:
id1.union(id2)

Index(['a', 'b', 'c'], dtype='object')

In [100]:
id1.difference(id2)

Index(['a'], dtype='object')

In [101]:
id1.symmetric_difference(id2)

Index(['a', 'c'], dtype='object')

If the columns of the two tables that need to be set for operation are not indexed, one way is to convert them into indexes first, then restore them after the operation. Another way is to use the `isin` function. For example, in the first table where the index is reset, select the row with the intersection of the id columns:

In [102]:
df_set_in_col_1 = df_set_1.reset_index()
df_set_in_col_2 = df_set_2.reset_index()
df_set_in_col_1

Unnamed: 0,id1,0,1
0,a,0,1
1,b,1,2
2,a,3,4


In [103]:
df_set_in_col_2

Unnamed: 0,id2,0,1
0,b,4,5
1,b,2,6
2,c,7,1


In [104]:
df_set_in_col_1[df_set_in_col_1.id1.isin(df_set_in_col_2.id2)]

Unnamed: 0,id1,0,1
1,b,1,2


## 5. Exercises
### Ex1: Company employee dataset
There is a company employee dataset:

In [105]:
df = pd.read_csv('../data/company.csv')
df.head(3)

Unnamed: 0,EmployeeID,birthdate_key,age,city_name,department,job_title,gender
0,1318,1/3/1954,61,Vancouver,Executive,CEO,M
1,1319,1/3/1957,58,Vancouver,Executive,VP Stores,F
2,1320,1/2/1955,60,Vancouver,Executive,Legal Counsel,F


1. Use only `query` and `loc` to select males who are under 40 years old and work in `Dairy` or `Bakery`.
2. Select the first, third and second to last columns of the row where the employee's `ID` number is an odd number.
3. Follow the steps below to perform indexing:

* Set the last three columns as indexes and swap the inner and outer layers
* Restore the middle layer index
* Change the outer layer index name to `Gender`
* Use an underscore to merge the two layers of row indexes
* Split the row index back to its original state
* Change the index name to the original table name
* Restore the default index and keep the columns in the relative position of the original table

### Ex2: Chocolate Dataset
There is a dataset about chocolate reviews:

In [106]:
df = pd.read_csv('../data/chocolate.csv')
df.head(3)

Unnamed: 0,Company,Review\r\nDate,Cocoa\r\nPercent,Company\r\nLocation,Rating
0,A. Morin,2016,63%,France,3.75
1,A. Morin,2015,70%,France,2.75
2,A. Morin,2015,70%,France,3.0


1. Replace `\n` in the column index name with a space.
2. The chocolate `Rating` score is 1 to 5, with a 0.25 point interval. Please select samples with a score of 2.75 or below and a cocoa content `Cocoa Percent` higher than the median.
3. After setting `Review Date` and `Company Location` as indexes, select samples with `Review Date` after 2012 and `Company Location` not belonging to `France, Canada, Amsterdam, Belgium`.