<center><h1>Chapter 2 Pandas Basics</h1></center>

In [1]:
import numpy as np
import pandas as pd

Before you start learning, please make sure that the version number of `pandas` is not lower than the version shown below, otherwise be sure to upgrade! Please make sure that the three packages `xlrd, xlwt, openpyxl` have been installed, and the version of `xlrd` must not be higher than `2.0.0`.

In [2]:
pd.__version__

'1.1.5'

## 1. Reading and writing files
### 1. Reading files

There are many file formats that `pandas` can read. Here we mainly introduce reading `csv, excel, txt` files.

In [3]:
df_csv = pd.read_csv('../data/my_csv.csv')
df_csv

Unnamed: 0,col1,col2,col3,col4,col5
0,2,a,1.4,apple,2020/1/1
1,3,b,3.4,banana,2020/1/2
2,6,c,2.5,orange,2020/1/5
3,5,d,3.2,lemon,2020/1/7


In [4]:
df_txt = pd.read_table('../data/my_table.txt')
df_txt

Unnamed: 0,col1,col2,col3,col4
0,2,a,1.4,apple 2020/1/1
1,3,b,3.4,banana 2020/1/2
2,6,c,2.5,orange 2020/1/5
3,5,d,3.2,lemon 2020/1/7


In [5]:
df_excel = pd.read_excel('../data/my_excel.xlsx')
df_excel

Unnamed: 0,col1,col2,col3,col4,col5
0,2,a,1.4,apple,2020/1/1
1,3,b,3.4,banana,2020/1/2
2,6,c,2.5,orange,2020/1/5
3,5,d,3.2,lemon,2020/1/7


Here are some commonly used public parameters, `header=None` means that the first row is not used as the column name, `index_col` means that one or more columns are used as indexes, and the content of indexes will be detailed in Chapter 3, `usecols` means reading a set of columns, and all columns are read by default, `parse_dates` means the columns that need to be converted into time, and the relevant content about time series will be explained in Chapter 10, and `nrows` means the number of data rows read. The above parameters can be used in the above three functions.

In [6]:
pd.read_table('../data/my_table.txt', header=None)

Unnamed: 0,0,1,2,3
0,col1,col2,col3,col4
1,2,a,1.4,apple 2020/1/1
2,3,b,3.4,banana 2020/1/2
3,6,c,2.5,orange 2020/1/5
4,5,d,3.2,lemon 2020/1/7


In [7]:
pd.read_csv('../data/my_csv.csv', index_col=['col1', 'col2'])

Unnamed: 0_level_0,Unnamed: 1_level_0,col3,col4,col5
col1,col2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,a,1.4,apple,2020/1/1
3,b,3.4,banana,2020/1/2
6,c,2.5,orange,2020/1/5
5,d,3.2,lemon,2020/1/7


In [8]:
pd.read_table('../data/my_table.txt', usecols=['col1', 'col2'])

Unnamed: 0,col1,col2
0,2,a
1,3,b
2,6,c
3,5,d


In [9]:
pd.read_csv('../data/my_csv.csv', parse_dates=['col5'])

Unnamed: 0,col1,col2,col3,col4,col5
0,2,a,1.4,apple,2020-01-01
1,3,b,3.4,banana,2020-01-02
2,6,c,2.5,orange,2020-01-05
3,5,d,3.2,lemon,2020-01-07


In [10]:
pd.read_excel('../data/my_excel.xlsx', nrows=2)

Unnamed: 0,col1,col2,col3,col4,col5
0,2,a,1.4,apple,2020/1/1
1,3,b,3.4,banana,2020/1/2


When reading a `txt` file, we often encounter situations where the delimiter is not a space. `read_table` has a split parameter `sep`, which allows users to customize the split symbol to read `txt` data. For example, the following read table is separated by `||||`:

In [11]:
pd.read_table('../data/my_table_special_sep.txt')

Unnamed: 0,col1 |||| col2
0,TS |||| This is an apple.
1,GQ |||| My name is Bob.
2,WT |||| Well done!
3,PT |||| May I help you?


The above result is obviously not ideal. In this case, you can use `sep` and specify the engine as `python`:

In [12]:
pd.read_table('../data/my_table_special_sep.txt', sep=' \|\|\|\| ', engine='python')

Unnamed: 0,col1,col2
0,TS,This is an apple.
1,GQ,My name is Bob.
2,WT,Well done!
3,PT,May I help you?


#### 【WARNING】`sep` is a regular expression parameter

When using `read_table`, please note that the parameter `sep` uses a regular expression, so `|` needs to be escaped to `\|`, otherwise the correct result cannot be read. For the basic content of regular expressions, please refer to Chapter 8 or other related materials.

#### 【END】

### 2. Data writing

Generally, the most common operation in data writing is to set `index` to `False`, especially when the index has no special meaning, this behavior can remove the index when saving.

In [13]:
df_csv.to_csv('../data/my_csv_saved.csv', index=False)
df_excel.to_excel('../data/my_excel_saved.xlsx', index=False)

There is no `to_table` function defined in `pandas`, but `to_csv` can save as a `txt` file and allows custom delimiters, commonly used tabs `\t`:

In [14]:
df_txt.to_csv('../data/my_txt_saved.txt', sep='\t', index=False)

If you want to quickly convert the table to `markdown` and `latex` languages, you can use the `to_markdown` and `to_latex` functions. Here you need to install the `tabulate` package.

In [15]:
print(df_csv.to_markdown())

|    |   col1 | col2   |   col3 | col4   | col5     |
|---:|-------:|:-------|-------:|:-------|:---------|
|  0 |      2 | a      |    1.4 | apple  | 2020/1/1 |
|  1 |      3 | b      |    3.4 | banana | 2020/1/2 |
|  2 |      6 | c      |    2.5 | orange | 2020/1/5 |
|  3 |      5 | d      |    3.2 | lemon  | 2020/1/7 |


In [16]:
print(df_csv.to_latex())

\begin{tabular}{lrlrll}
\toprule
{} &  col1 & col2 &  col3 &    col4 &      col5 \\
\midrule
0 &     2 &    a &   1.4 &   apple &  2020/1/1 \\
1 &     3 &    b &   3.4 &  banana &  2020/1/2 \\
2 &     6 &    c &   2.5 &  orange &  2020/1/5 \\
3 &     5 &    d &   3.2 &   lemon &  2020/1/7 \\
\bottomrule
\end{tabular}



## 2. Basic data structure
`pandas` has two basic data storage structures, `Series` storing one-dimensional `values` and `DataFrame` storing two-dimensional `values`. Many properties and methods are defined on these two structures.

### 1. Series
`Series` generally consists of four parts, namely the value of the sequence `data`, the index `index`, the storage type `dtype`, and the name of the sequence `name`. Among them, the index can also specify its name, which is empty by default.

In [17]:
s = pd.Series(data = [100, 'a', {'dic1':5}],
              index = pd.Index(['id1', 20, 'third'], name='my_idx'),
              dtype = 'object',
              name = 'my_name')
s

my_idx
id1              100
20                 a
third    {'dic1': 5}
Name: my_name, dtype: object

#### 【NOTE】`object` type

`object` represents a mixed type, as in the example above, integers, strings, and `Python` dictionary data structures are stored. In addition, `pandas` currently considers pure string sequences to be a sequence of `object` type by default, but it can also be stored as `string` type. The content of text sequences will be discussed in Chapter 8.

#### 【END】

For these attributes, you can use . to get:

In [18]:
s.values

array([100, 'a', {'dic1': 5}], dtype=object)

In [19]:
s.index

Index(['id1', 20, 'third'], dtype='object', name='my_idx')

In [20]:
s.dtype

dtype('O')

In [21]:
s.name

'my_name'

Use `.shape` to get the length of the sequence:

In [22]:
s.shape

(3,)

Indexing is one of the most important concepts in pandas and will be discussed in detail in Chapter 3. If you want to retrieve the value corresponding to a single index, you can use [index_item] to retrieve it.

### 2. DataFrame
`DataFrame` adds column indexes based on `Series`. A data frame can be constructed from two-dimensional `data` and row and column indexes:

In [23]:
data = [[1, 'a', 1.2], [2, 'b', 2.2], [3, 'c', 3.2]]
df = pd.DataFrame(data = data,
                  index = ['row_%d'%i for i in range(3)],
                  columns=['col_0', 'col_1', 'col_2'])
df

Unnamed: 0,col_0,col_1,col_2
row_0,1,a,1.2
row_1,2,b,2.2
row_2,3,c,3.2


But in general, more often a data frame is constructed using a mapping from column index names to data, along with row indexes:

In [24]:
df = pd.DataFrame(data = {'col_0': [1,2,3],
                          'col_1':list('abc'),
                          'col_2': [1.2, 2.2, 3.2]},
                  index = ['row_%d'%i for i in range(3)])
df

Unnamed: 0,col_0,col_1,col_2
row_0,1,a,1.2
row_1,2,b,2.2
row_2,3,c,3.2


Due to this mapping relationship, in `DataFrame`, `[col_name]` and `[col_list]` can be used to retrieve the corresponding columns and tables consisting of multiple columns, and the results are `Series` and `DataFrame` respectively:

In [25]:
df['col_0']

row_0    1
row_1    2
row_2    3
Name: col_0, dtype: int64

In [26]:
df[['col_0', 'col_1']]

Unnamed: 0,col_0,col_1
row_0,1,a
row_1,2,b
row_2,3,c


Similar to `Series`, you can also retrieve the corresponding attributes in the data frame:

In [27]:
df.values

array([[1, 'a', 1.2],
       [2, 'b', 2.2],
       [3, 'c', 3.2]], dtype=object)

In [28]:
df.index

Index(['row_0', 'row_1', 'row_2'], dtype='object')

In [29]:
df.columns

Index(['col_0', 'col_1', 'col_2'], dtype='object')

In [30]:
df.dtypes # 返回的是值为相应列数据类型的Series

col_0      int64
col_1     object
col_2    float64
dtype: object

In [31]:
df.shape

(3, 3)

`.T` can be used to transpose a `DataFrame`:

In [32]:
df.T

Unnamed: 0,row_0,row_1,row_2
col_0,1,2,3
col_1,a,b,c
col_2,1.2,2.2,3.2


## 3. Commonly used basic functions
For illustration purposes, in the following section and the remaining chapters, we will use a virtual dataset called `learn_pandas.csv`, which records the personal information of students from four schools.

In [33]:
df = pd.read_csv('../data/learn_pandas.csv')
df.columns

Index(['School', 'Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer',
       'Test_Number', 'Test_Date', 'Time_Record'],
      dtype='object')

The above column names represent school, grade, name, gender, height, weight, whether or not a transfer student, physical test session, test time, and 1000-meter results respectively. In this chapter, only the first seven columns need to be used.

In [34]:
df = df[df.columns[:7]]

### 1. Aggregate functions
The `head, tail` functions return the first `n` rows and the last `n` rows of a table or sequence respectively, where `n` defaults to 5:

In [35]:
df.head(2)

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer
0,Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,158.9,46.0,N
1,Peking University,Freshman,Changqiang You,Male,166.5,70.0,N


In [36]:
df.tail(3)

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer
197,Shanghai Jiao Tong University,Senior,Chengqiang Chu,Female,153.9,45.0,N
198,Shanghai Jiao Tong University,Senior,Chengmei Shen,Male,175.3,71.0,N
199,Tsinghua University,Sophomore,Chunpeng Lv,Male,155.7,51.0,N


`info, describe` returns the information summary of the table and the main statistics corresponding to the numerical columns in the table respectively:

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   School    200 non-null    object 
 1   Grade     200 non-null    object 
 2   Name      200 non-null    object 
 3   Gender    200 non-null    object 
 4   Height    183 non-null    float64
 5   Weight    189 non-null    float64
 6   Transfer  188 non-null    object 
dtypes: float64(2), object(5)
memory usage: 11.1+ KB


In [38]:
df.describe()

Unnamed: 0,Height,Weight
count,183.0,189.0
mean,163.218033,55.015873
std,8.608879,12.824294
min,145.4,34.0
25%,157.15,46.0
50%,161.9,51.0
75%,167.5,65.0
max,193.9,89.0


#### 【NOTE】More comprehensive data summary

`info, describe` can only display less information. If you want to make a comprehensive and effective observation of a data set, especially when there are many columns, it is recommended to use the [pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/) package, which will be mentioned again in Chapter 11.

#### 【END】

### 2. Feature statistical functions
Many statistical functions are defined on `Series` and `DataFrame`, the most common of which are `sum, mean, median, var, std, max, min`. For example, select the height and weight columns for demonstration:

In [39]:
df_demo = df[['Height', 'Weight']]
df_demo.mean()

Height    163.218033
Weight     55.015873
dtype: float64

In [40]:
df_demo.max()

Height    193.9
Weight     89.0
dtype: float64

In addition, it is necessary to introduce the three functions `quantile, count, idxmax`, which return the quantile, the number of non-missing values, and the index corresponding to the maximum value respectively:

In [41]:
df_demo.quantile(0.75)

Height    167.5
Weight     65.0
Name: 0.75, dtype: float64

In [42]:
df_demo.count()

Height    183
Weight    189
dtype: int64

In [43]:
df_demo.idxmax() # idxmin是对应的函数

Height    193
Weight      2
dtype: int64

All of the above functions are called aggregation functions because they return scalars after operation. They have a common parameter `axis`, which defaults to 0 for column-by-column aggregation and 1 for row-by-row aggregation:

In [44]:
df_demo.mean(axis=1).head() # 在这个数据集上体重和身高的均值并没有意义

0    102.45
1    118.25
2    138.95
3     41.00
4    124.00
dtype: float64

### 3. Unique value function
Using `unique` and `nunique` on a sequence can get the list of its unique values ​​and the number of unique values ​​respectively:

In [45]:
df['School'].unique()

array(['Shanghai Jiao Tong University', 'Peking University',
       'Fudan University', 'Tsinghua University'], dtype=object)

In [46]:
df['School'].nunique()

4

`value_counts` can get unique values ​​and their corresponding frequencies:

In [47]:
df['School'].value_counts()

Tsinghua University              69
Shanghai Jiao Tong University    57
Fudan University                 40
Peking University                34
Name: School, dtype: int64

If you want to observe the unique values ​​of multiple column combinations, you can use `drop_duplicates`. The key parameter is `keep`. The default value `first` means that each combination retains the row where it first appears, `last` means that the row where it last appears is retained, and `False` means that all rows with duplicate combinations are removed.

In [48]:
df_demo = df[['Gender','Transfer','Name']]
df_demo.drop_duplicates(['Gender', 'Transfer'])

Unnamed: 0,Gender,Transfer,Name
0,Female,N,Gaopeng Yang
1,Male,N,Changqiang You
12,Female,,Peng You
21,Male,,Xiaopeng Shen
36,Male,Y,Xiaojuan Qin
43,Female,Y,Gaoli Feng


In [49]:
df_demo.drop_duplicates(['Gender', 'Transfer'], keep='last')

Unnamed: 0,Gender,Transfer,Name
147,Male,,Juan You
150,Male,Y,Chengpeng You
169,Female,Y,Chengquan Qin
194,Female,,Yanmei Qian
197,Female,N,Chengqiang Chu
199,Male,N,Chunpeng Lv


In [50]:
df_demo.drop_duplicates(['Name', 'Gender'], keep=False).head() # 保留只出现过一次的性别和姓名组合

Unnamed: 0,Gender,Transfer,Name
0,Female,N,Gaopeng Yang
1,Male,N,Changqiang You
2,Male,N,Mei Sun
4,Male,N,Gaojuan You
5,Female,N,Xiaoli Qian


In [51]:
df['School'].drop_duplicates() # 在Series上也可以使用

0    Shanghai Jiao Tong University
1                Peking University
3                 Fudan University
5              Tsinghua University
Name: School, dtype: object

In addition, `duplicated` and `drop_duplicates` have similar functions, but the former returns a Boolean list of unique values, and its `keep` parameter is the same as the latter. The returned sequence sets the duplicate elements to `True`, otherwise to `False`. `drop_duplicates` is equivalent to removing the corresponding rows where `duplicated` is `True`.

In [52]:
df_demo.duplicated(['Gender', 'Transfer']).head()

0    False
1    False
2     True
3     True
4     True
dtype: bool

In [53]:
df['School'].duplicated().head() # 在Series上也可以使用

0    False
1    False
2     True
3    False
4     True
Name: School, dtype: bool

### 4. Replacement function
Generally speaking, the replacement operation is performed on a certain column, so the following examples all use `Series` as an example. The replacement functions in `pandas` can be summarized into three categories: mapping replacement, logical replacement, and numerical replacement. Mapping replacement includes the `replace` method, the `str.replace` method in Chapter 8, and the `cat.codes` method in Chapter 9. Here we introduce the usage of `replace`.

In `replace`, you can use dictionary construction or pass in two lists to perform replacement:

In [54]:
df['Gender'].replace({'Female':0, 'Male':1}).head()

0    0
1    1
2    1
3    0
4    1
Name: Gender, dtype: int64

In [55]:
df['Gender'].replace(['Female', 'Male'], [0, 1]).head()

0    0
1    1
2    1
3    0
4    1
Name: Gender, dtype: int64

In addition, `replace` has a special directional replacement. If the `method` parameter is specified as `ffill`, it will replace with the most recent unreplaced value, while `bfill` will replace with the most recent unreplaced value. As you can see from the following examples, their results are different:

In [56]:
s = pd.Series(['a', 1, 'b', 2, 1, 1, 'a'])
s.replace([1, 2], method='ffill')

0    a
1    a
2    b
3    b
4    b
5    b
6    a
dtype: object

In [57]:
s.replace([1, 2], method='bfill')

0    a
1    b
2    b
3    a
4    a
5    a
6    a
dtype: object

#### 【WARNING】Please use `str.replace` for regular replacement

Although regular replacement can be used for `replace`, there is still a `bug` for regular replacement of `string` type in the current version. Therefore, if you need this, please choose `str.replace` for replacement operation. The specific method will be explained in Chapter 8.

#### 【END】

Logical replacement includes `where` and `mask`. These two functions are completely symmetrical: the `where` function replaces the corresponding rows with the input condition of `False`, while the `mask` replaces the corresponding rows with the input condition of `True`. When the replacement value is not specified, it is replaced with the missing value.

In [58]:
s = pd.Series([-1, 1.2345, 100, -50])
s.where(s<0)

0    -1.0
1     NaN
2     NaN
3   -50.0
dtype: float64

In [59]:
s.where(s<0, 100)

0     -1.0
1    100.0
2    100.0
3    -50.0
dtype: float64

In [60]:
s.mask(s<0)

0         NaN
1      1.2345
2    100.0000
3         NaN
dtype: float64

In [61]:
s.mask(s<0, -50)

0    -50.0000
1      1.2345
2    100.0000
3    -50.0000
dtype: float64

Note that the condition passed in only needs to be a Boolean sequence that matches the index of the Series being called:

In [62]:
s_condition= pd.Series([True,False,False,True],index=s.index)
s.mask(s_condition, -50)

0    -50.0000
1      1.2345
2    100.0000
3    -50.0000
dtype: float64

Numeric replacement includes the methods `round, abs, clip`, which represent rounding to a given precision, taking the absolute value, and truncating, respectively:

In [63]:
s = pd.Series([-1, 1.2345, 100, -50])
s.round(2)

0     -1.00
1      1.23
2    100.00
3    -50.00
dtype: float64

In [64]:
s.abs()

0      1.0000
1      1.2345
2    100.0000
3     50.0000
dtype: float64

In [65]:
s.clip(0, 2) # 前两个数分别表示上下截断边界

0    0.0000
1    1.2345
2    2.0000
3    0.0000
dtype: float64

#### 【Practice】

In clip, values ​​that exceed the boundary can only be truncated to the boundary value. If you want to replace the values ​​that exceed the boundary with a custom value, how should you do it?

#### 【END】

### 5. Sorting function
There are two ways to sort, one is value sorting, the other is index sorting, and the corresponding functions are `sort_values` and `sort_index`.

In order to demonstrate the sorting function, the following first uses the `set_index` method to set the grade and name columns as indexes. The content of multi-level indexes and the method of index setting will be explained in detail in Chapter 3.

In [66]:
df_demo = df[['Grade', 'Name', 'Height', 'Weight']].set_index(['Grade','Name'])
df_demo.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Height,Weight
Grade,Name,Unnamed: 2_level_1,Unnamed: 3_level_1
Freshman,Gaopeng Yang,158.9,46.0
Freshman,Changqiang You,166.5,70.0
Senior,Mei Sun,188.9,89.0


Sort heights, the default parameter `ascending=True` is in ascending order:

In [67]:
df_demo.sort_values('Height').head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Height,Weight
Grade,Name,Unnamed: 2_level_1,Unnamed: 3_level_1
Junior,Xiaoli Chu,145.4,34.0
Senior,Gaomei Lv,147.3,34.0
Sophomore,Peng Han,147.8,34.0
Senior,Changli Lv,148.7,41.0
Sophomore,Changjuan You,150.5,40.0


In [68]:
df_demo.sort_values('Height', ascending=False).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Height,Weight
Grade,Name,Unnamed: 2_level_1,Unnamed: 3_level_1
Senior,Xiaoqiang Qin,193.9,79.0
Senior,Mei Sun,188.9,89.0
Senior,Gaoli Zhao,186.5,83.0
Freshman,Qiang Han,185.3,87.0
Senior,Qiang Zheng,183.9,87.0


In sorting, we often encounter the problem of sorting multiple columns. For example, when the weight is the same, we sort the height and keep the height in descending order and the weight in ascending order:

In [69]:
df_demo.sort_values(['Weight','Height'],ascending=[True,False]).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Height,Weight
Grade,Name,Unnamed: 2_level_1,Unnamed: 3_level_1
Sophomore,Peng Han,147.8,34.0
Senior,Gaomei Lv,147.3,34.0
Junior,Xiaoli Chu,145.4,34.0
Sophomore,Qiang Zhou,150.5,36.0
Freshman,Yanqiang Xu,152.4,38.0


The usage of index sorting is exactly the same as value sorting, except that the value of the element is in the index, and you need to specify the name or level number of the index level, represented by the parameter `level`. In addition, it should be noted that the order of the strings is determined by the alphabetical order.

In [70]:
df_demo.sort_index(level=['Grade','Name'],ascending=[True,False]).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Height,Weight
Grade,Name,Unnamed: 2_level_1,Unnamed: 3_level_1
Freshman,Yanquan Wang,163.5,55.0
Freshman,Yanqiang Xu,152.4,38.0
Freshman,Yanqiang Feng,162.3,51.0
Freshman,Yanpeng Lv,,65.0
Freshman,Yanli Zhang,165.1,52.0


### 6. apply method
The `apply` method is often used for row or column iteration of `DataFrame`. Its `axis` has the same meaning as the statistical aggregation function in Section 2. The parameter of `apply` is often a function with a sequence as input. For example, for `.mean()`, `apply` can be written as follows:

In [71]:
df_demo = df[['Height', 'Weight']]
def my_mean(x):
     res = x.mean()
     return res
df_demo.apply(my_mean)

Height    163.218033
Weight     55.015873
dtype: float64

Similarly, we can use the lambda expression to make the writing concise. Here, x refers to the sequence entered one by one in the called df_demo table:

In [72]:
df_demo.apply(lambda x:x.mean())

Height    163.218033
Weight     55.015873
dtype: float64

If axis=1 is specified, then a Series consisting of row elements is passed into the function each time, and the result is consistent with the previous row-by-row mean result.

In [73]:
df_demo.apply(lambda x:x.mean(), axis=1).head()

0    102.45
1    118.25
2    138.95
3     41.00
4    124.00
dtype: float64

Here is another example: the `mad` function returns the mean of the absolute values ​​of deviations from the mean of a sequence. For example, in the sequence 1, 3, 7, 10, the mean is 5.25, and the absolute values ​​of the deviations of each element are 4.25, 2.25, 1.75, 4.75. The mean of this deviation sequence is 3.25. Now use `apply` to calculate the `mad` index of height and weight:

In [74]:
df_demo.apply(lambda x:(x-x.mean()).abs().mean())

Height     6.707229
Weight    10.391870
dtype: float64

This is consistent with the result calculated using the built-in `mad` function:

In [75]:
df_demo.mad()

Height     6.707229
Weight    10.391870
dtype: float64

#### 【WARNING】Use `apply` with caution

Thanks to the processing of the passed custom function, `apply` has a high degree of freedom, but this comes at the cost of performance. Generally speaking, the speed of using `pandas`'s built-in functions and `apply` to handle the same task will be quite different, so only consider using `apply` when there is a real need for customization.

#### 【END】

## 4. Window Objects
There are 3 types of windows in `pandas`, namely sliding windows `rolling`, expanding windows `expanding`, and exponentially weighted windows `ewm`. It should be noted that the sliding window with date offset as the window size will be discussed in Chapter 10, and the exponentially weighted window can be seen in the exercises of this chapter.

### 1. Sliding Window Objects
To use the sliding window function, you must first use `.rolling` on a sequence to get the sliding window object, and its most important parameter is the window size `window`.

In [76]:
s = pd.Series([1,2,3,4,5])
roller = s.rolling(window = 3)
roller

Rolling [window=3,center=False,axis=0]

After obtaining the sliding window object, you can use the corresponding aggregate function for calculation. It should be noted that the window includes the element where the current row is located. For example, when performing the mean operation at the fourth position, (2+3+4)/3 should be calculated instead of (1+2+3)/3:

In [77]:
roller.mean()

0    NaN
1    NaN
2    2.0
3    3.0
4    4.0
dtype: float64

In [78]:
roller.sum()

0     NaN
1     NaN
2     6.0
3     9.0
4    12.0
dtype: float64

The calculation of the sliding correlation coefficient or sliding covariance can be written as follows:

In [79]:
s2 = pd.Series([1,2,6,16,30])
roller.cov(s2)

0     NaN
1     NaN
2     2.5
3     7.0
4    12.0
dtype: float64

In [80]:
roller.corr(s2)

0         NaN
1         NaN
2    0.944911
3    0.970725
4    0.995402
dtype: float64

In addition, it also supports using `apply` to pass in a custom function, where the passed value is the `Series` of the corresponding window. For example, the above mean function can be equivalently expressed as:

In [81]:
roller.apply(lambda x:x.mean())

0    NaN
1    NaN
2    2.0
3    3.0
4    4.0
dtype: float64

`shift, diff, pct_change` are a set of sliding window functions, their common parameter is `periods=n`, the default is 1, which means taking the value of the `n`th element before, taking the difference with the `n`th element before (different from `Numpy`, which means `n` order difference), and calculating the growth rate compared with the `n`th element before. Here, `n` can be negative, indicating similar operations in the opposite direction.

In [82]:
s = pd.Series([1,3,6,10,15])
s.shift(2)

0    NaN
1    NaN
2    1.0
3    3.0
4    6.0
dtype: float64

In [83]:
s.diff(3)

0     NaN
1     NaN
2     NaN
3     9.0
4    12.0
dtype: float64

In [84]:
s.pct_change()

0         NaN
1    2.000000
2    1.000000
3    0.666667
4    0.500000
dtype: float64

In [85]:
s.shift(-1)

0     3.0
1     6.0
2    10.0
3    15.0
4     NaN
dtype: float64

In [86]:
s.diff(-2)

0   -5.0
1   -7.0
2   -9.0
3    NaN
4    NaN
dtype: float64

The reason they are considered sliding window functions is that their functionality can be equivalently replaced by the `rolling` method with a window size of `n+1`:

In [87]:
s.rolling(3).apply(lambda x:list(x)[0]) # s.shift(2)

0    NaN
1    NaN
2    1.0
3    3.0
4    6.0
dtype: float64

In [88]:
 s.rolling(4).apply(lambda x:list(x)[-1]-list(x)[0]) # s.diff(3)

0     NaN
1     NaN
2     NaN
3     9.0
4    12.0
dtype: float64

In [89]:
def my_pct(x):
     L = list(x)
     return L[-1]/L[0]-1
s.rolling(2).apply(my_pct) # s.pct_change()

0         NaN
1    2.000000
2    1.000000
3    0.666667
4    0.500000
dtype: float64

#### 【Practice】

The default window direction of the `rolling` object is forward. In some cases, users need a backward window. For example, if the `sum` operation with a backward window of 2 is set for 1, 2, 3, the result is 3, 5, NaN. How should the backward sliding window operation be implemented?

#### 【END】

### 2. Expanding window
The expanding window is also called the cumulative window. It can be understood as a window of dynamic length. The size of the window is the corresponding position from the beginning of the sequence to the specific operation. The aggregation function used will act on these gradually expanding windows. Specifically, let the sequence be a1, a2, a3, a4, then the window corresponding to each position is \[a1\], \[a1, a2\], \[a1, a2, a3\], \[a1, a2, a3, a4\].

In [90]:
s = pd.Series([1, 3, 6, 10])
s.expanding().mean()

0    1.000000
1    2.000000
2    3.333333
3    5.000000
dtype: float64

#### 【Practice】

`cummax, cumsum, cumprod` functions are typical expanding window functions. Please use `expanding` object to implement them in turn.

#### 【END】

## 5. Practice
### Ex1: Pokémon Dataset
There is a Pokémon data set. Here is some background information:

* `#` represents the national illustrated number. The same number in different rows represents the different states of the monster

* Monsters have single attributes and dual attributes. For single-attribute monsters, `Type 2` is a missing value

* `Total, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed` represent race value, physical strength, physical attack, defense, special attack, special defense, and speed respectively. The race value is the sum of the last 6 items.

In [91]:
df = pd.read_csv('../data/pokemon.csv')
df.head(3)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80


1. Add up `HP, Attack, Defense, Sp. Atk, Sp. Def, Speed` and verify whether it is the `Total` value.

2. For monsters with repeated `#`, only keep the first record and solve the following problems:

* Find the number of types of the first attribute and the types corresponding to the top three numbers

* Find the combination types of the first attribute and the second attribute

* Find the attribute combination that has not appeared yet

3. Construct `Series` according to the following requirements:

* Take out the physical attack, replace it with `high` if it is more than 120, replace it with `low` if it is less than 50, otherwise set it to `mid`

* Take out the first attribute, use `replace` and `apply` to replace all letters with uppercase

* Find the deviation of the six abilities of each monster, that is, the value with the largest deviation from the median among all abilities, add it to `df` and sort it from large to small

### Ex2: Exponentially weighted window

1. `ewm` window as an expansion window

In the expansion window, users can use various functions to perform historical cumulative indicator statistics, but these built-in statistical functions often give the same weight to all elements in the window. In fact, different weights can be given to elements in the window. The exponentially weighted window is such a special expansion window.

Among them, the most important parameter is `alpha`, which determines the default window weight $wi=(1−\alpha)^i,i\in\{0,1,...,t\}$, where $i=t$ represents the current element and $i=0$ represents the first element of the sequence.

From the weight formula, we can see that the farther away from the current value, the smaller the weight. If the original sequence is $x$ and the updated current element is $y_t$, we can know that after normalization through the weight formula:

$$
\begin{split}y_t &=\frac{\sum_{i=0}^{t} w_i x_{t-i}}{\sum_{i=0}^{t} w_i} \\
&=\frac{x_t + (1 - \alpha)x_{t-1} + (1 - \alpha)^2 x_{t-2} + ...
+ (1 - \alpha)^{t} x_{0}}{1 + (1 - \alpha) + (1 - \alpha)^2 + ...
+ (1 - \alpha)^{t}}\\\end{split}
$$

For `Series`, the `ewm` object can be used to calculate the exponentially smoothed sequence as follows:

In [92]:
np.random.seed(0)
s = pd.Series(np.random.randint(-1,2,30).cumsum())
s.head()

0   -1
1   -1
2   -2
3   -2
4   -2
dtype: int32

In [93]:
s.ewm(alpha=0.2).mean().head()

0   -1.000000
1   -1.000000
2   -1.409836
3   -1.609756
4   -1.725845
dtype: float64

Please use the `expanding` window to implement it.

2. `ewm` window as a sliding window

From question 1, we can see that `ewm`, as a special case of an expanding window, can only be weighted from the first element of the sequence. Now we want to give a restricted window `n` and only use the `n` elements that contain the most recent element as a window for sliding weighted smoothing. Please give a new update formula for `wi` and `yt` based on the sliding window function, and implement this function through the `rolling` window.