<center><h1>Chapter 5 Transformation</h1></center>

In [1]:
import numpy as np
import pandas as pd

## 1. Transformation of long and wide tables

What is a long table? What is a wide table? This concept refers to a certain feature. For example, if a table stores gender in a column, it is a long table about gender; if gender is used as the column name and the elements in the column are other related feature values, then this table is a wide table about gender. The following two tables are long and wide tables about gender respectively:

In [2]:
pd.DataFrame({'Gender':['F','F','M','M'], 'Height':[163, 160, 175, 180]})

Unnamed: 0,Gender,Height
0,F,163
1,F,160
2,M,175
3,M,180


In [3]:
pd.DataFrame({'Height: F':[163, 160], 'Height: M':[175, 180]})

Unnamed: 0,Height: F,Height: M
0,163,175
1,160,180


Obviously, these two tables are completely equivalent in terms of information. They contain the same height statistics, but the presentation of these values ​​is different, and the presentation method is mainly related to the layout mode selected for the gender column, that is, whether it is stored in the state of $\color{red}{long}$ or $\color{red}{wide}$. Therefore, `pandas` has designed some relevant transformation functions for such long and wide table transformation operations.

### 1. pivot

`pivot` is a typical function for transforming a long table into a wide table. First, let's take a look at an example: the following table stores the Chinese and math scores of Zhang San and Li Si. Now we want to display the Chinese and math scores as columns.

In [4]:
df = pd.DataFrame({'Class':[1,1,2,2],
                   'Name':['San Zhang','San Zhang','Si Li','Si Li'],
                   'Subject':['Chinese','Math','Chinese','Math'],
                   'Grade':[80,75,90,85]})
df

Unnamed: 0,Class,Name,Subject,Grade
0,1,San Zhang,Chinese,80
1,1,San Zhang,Math,75
2,2,Si Li,Chinese,90
3,2,Si Li,Math,85


For a basic length-to-width operation, the three most important elements are the transformed row index, the column to be converted to the column index, and the values ​​corresponding to these column and row indexes, which correspond to the `index, columns, values` parameters in the `pivot` method. The column index of the newly generated table is the `unique` value of the corresponding column of `columns`, and the row index of the new table is the `unique` value of the corresponding column of `index`, and `values` corresponds to the numerical column you want to display.

In [5]:
df.pivot(index='Name', columns='Subject', values='Grade')

Subject,Chinese,Math
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
San Zhang,80,75
Si Li,90,85


The color marking makes it easier to understand the deformation process:

<img src="../source/_static/ch5_pivot.png" width="20%">

Using `pivot` to transform needs to meet the uniqueness requirement, that is, since the row and column indexes in the new table correspond to a unique `value`, the row combination of `index` and `columns` corresponding to the two columns in the original table must be unique. For example, if you change the math of Zhang San in the second row of the original table to Chinese, an error will be reported. This is because `("San Zhang", "Chinese")` appears twice in the combination of `Name` and `Subject`, so it is impossible to determine whether it should be filled in as 80 points or 75 points after the transformation.

In [6]:
df.loc[1, 'Subject'] = 'Chinese'
try:
    df.pivot(index='Name', columns='Subject', values='Grade')
except Exception as e:
    Err_Msg = e
Err_Msg

ValueError('Index contains duplicate entries, cannot reshape')

Starting from pandas 1.1.0, the three parameters related to pivot can be set as lists, which means that multi-level indexes will be returned. Here is an example to illustrate how to use it: The six columns in the following table are class, name, test type (midterm exam and final exam), subject, score, and ranking.

In [7]:
df = pd.DataFrame({'Class':[1, 1, 2, 2, 1, 1, 2, 2],
                   'Name':['San Zhang', 'San Zhang', 'Si Li', 'Si Li',
                              'San Zhang', 'San Zhang', 'Si Li', 'Si Li'],
                   'Examination': ['Mid', 'Final', 'Mid', 'Final',
                                    'Mid', 'Final', 'Mid', 'Final'],
                   'Subject':['Chinese', 'Chinese', 'Chinese', 'Chinese',
                                 'Math', 'Math', 'Math', 'Math'],
                   'Grade':[80, 75, 85, 65, 90, 85, 92, 88],
                   'rank':[10, 15, 21, 15, 20, 7, 6, 2]})
df

Unnamed: 0,Class,Name,Examination,Subject,Grade,rank
0,1,San Zhang,Mid,Chinese,80,10
1,1,San Zhang,Final,Chinese,75,15
2,2,Si Li,Mid,Chinese,85,21
3,2,Si Li,Final,Chinese,65,15
4,1,San Zhang,Mid,Math,90,20
5,1,San Zhang,Final,Math,85,7
6,2,Si Li,Mid,Math,92,6
7,2,Si Li,Final,Math,88,2


Now we want to transfer the four categories (midterm Chinese, final Chinese, midterm math, final math) composed of test type and subject to column index, and count the scores and rankings at the same time:

In [8]:
pivot_multi = df.pivot(index = ['Class', 'Name'],
                       columns = ['Subject','Examination'],
                       values = ['Grade','rank'])
pivot_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,Grade,Grade,Grade,Grade,rank,rank,rank,rank
Unnamed: 0_level_1,Subject,Chinese,Chinese,Math,Math,Chinese,Chinese,Math,Math
Unnamed: 0_level_2,Examination,Mid,Final,Mid,Final,Mid,Final,Mid,Final
Class,Name,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
1,San Zhang,80,75,90,85,10,15,20,7
2,Si Li,85,65,92,88,21,15,6,2


According to the uniqueness principle, the row index of the new table is equivalent to using `drop_duplicates` on multiple columns in `index`, and the length of the column index is the number of elements in `values` multiplied by the number of unique combinations of `columns` (similar to `index`). The corresponding operation can be easily understood from the following diagram:

<img src="../source/_static/ch5_mulpivot.png" width="35%">

### 2. pivot_table

The use of `pivot` depends on the uniqueness condition. If the uniqueness condition is not met, then the multiple values ​​corresponding to the same row and column combination must be aggregated to become one value. For example, Zhang San and Li Si both took two Chinese and mathematics exams. According to the college regulations, the final score is the average of the two exam scores. At this time, it cannot be completed through the `pivot` function.

In [9]:
df = pd.DataFrame({'Name':['San Zhang', 'San Zhang', 
                              'San Zhang', 'San Zhang',
                              'Si Li', 'Si Li', 'Si Li', 'Si Li'],
                   'Subject':['Chinese', 'Chinese', 'Math', 'Math',
                                 'Chinese', 'Chinese', 'Math', 'Math'],
                   'Grade':[80, 90, 100, 90, 70, 80, 85, 95]})
df

Unnamed: 0,Name,Subject,Grade
0,San Zhang,Chinese,80
1,San Zhang,Chinese,90
2,San Zhang,Math,100
3,San Zhang,Math,90
4,Si Li,Chinese,70
5,Si Li,Chinese,80
6,Si Li,Math,85
7,Si Li,Math,95


`pandas` provides `pivot_table` to achieve this, where the `aggfunc` parameter is the aggregation function used. The above scenario can be written as follows:

In [10]:
df.pivot_table(index = 'Name',
               columns = 'Subject',
               values = 'Grade',
               aggfunc = 'mean')

Subject,Chinese,Math
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
San Zhang,85,95
Si Li,75,90


The `aggfunc` passed in here contains all the legal aggregation strings introduced in the previous chapter. In addition, you can also pass in an aggregation function with a sequence as input and a scalar as output to implement custom operations. The above functions can be equivalently written as:

In [11]:
df.pivot_table(index = 'Name',
               columns = 'Subject',
               values = 'Grade',
               aggfunc = lambda x:x.mean())

Subject,Chinese,Math
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
San Zhang,85,95
Si Li,75,90


In addition, `pivot_table` has the function of marginal aggregation, which can be achieved by setting `margins=True`, where the marginal aggregation method is consistent with the aggregation method given in `aggfunc`. The following statistics are respectively calculated for the average scores of Chinese and mathematics, the average scores of Zhang San and Li Si, and the average score of all scores in total:

In [12]:
df.pivot_table(index = 'Name',
               columns = 'Subject',
               values = 'Grade',
               aggfunc='mean',
               margins=True)

Subject,Chinese,Math,All
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
San Zhang,85,95.0,90.0
Si Li,75,90.0,82.5
All,80,92.5,86.25


#### 【Practice】
In the marginal summary example above, the row or column summary is the average of the row elements or column elements in the new table, and the overall summary is the average of the four elements in the new table. Does this relationship always hold? If not, please give an example to illustrate.
#### 【END】
### 3. melt

Long and wide tables are just different in the way data is presented, but the amount of information they contain is equivalent. As mentioned earlier, using `pivot` to convert a long table into a wide table, then the corresponding inverse operation can be used to convert a wide table into a long table. The `melt` function plays such a role. In the following example, `Subject` is stored in the form of column indexes, and now we want to compress it into one column.

In [13]:
df = pd.DataFrame({'Class':[1,2],
                   'Name':['San Zhang', 'Si Li'],
                   'Chinese':[80, 90],
                   'Math':[80, 75]})
df

Unnamed: 0,Class,Name,Chinese,Math
0,1,San Zhang,80,80
1,2,Si Li,90,75


In [14]:
df_melted = df.melt(id_vars = ['Class', 'Name'],
                    value_vars = ['Chinese', 'Math'],
                    var_name = 'Subject',
                    value_name = 'Grade')
df_melted

Unnamed: 0,Class,Name,Subject,Grade
0,1,San Zhang,Chinese,80
1,2,Si Li,Chinese,90
2,1,San Zhang,Math,80
3,2,Si Li,Math,75


The main parameters of `melt` and the compression process are shown in the figure below:

<img src="../source/_static/ch5_melt.png" width="35%">

As mentioned earlier, `melt` and `pivot` are a set of inverse processes, so it is certain that `df_melted` can be converted back to `df` through the `pivot` operation:

In [15]:
df_unmelted = df_melted.pivot(index = ['Class', 'Name'],
                              columns='Subject',
                              values='Grade')
df_unmelted # 下面需要恢复索引，并且重命名列索引名称

Unnamed: 0_level_0,Subject,Chinese,Math
Class,Name,Unnamed: 2_level_1,Unnamed: 3_level_1
1,San Zhang,80,80
2,Si Li,90,75


In [16]:
df_unmelted = df_unmelted.reset_index().rename_axis(columns={'Subject':''})
df_unmelted.equals(df)

True

### 4. wide_to_long

In the `melt` method, the column elements corresponding to a set of values ​​compressed in the column index can only represent the same level of meaning, that is, `values_name`. Now if the column contains cross-categories, such as the categories of mid-term and final exams and the categories of Chinese and math, then if you want to expand the `Grade` corresponding to `values_name` into two columns corresponding to Chinese scores and math scores respectively, and only compress the information of mid-term and final exams, you need to use the `wide_to_long` function to complete this requirement.

In [17]:
df = pd.DataFrame({'Class':[1,2],'Name':['San Zhang', 'Si Li'],
                   'Chinese_Mid':[80, 75], 'Math_Mid':[90, 85],
                   'Chinese_Final':[80, 75], 'Math_Final':[90, 85]})
df

Unnamed: 0,Class,Name,Chinese_Mid,Math_Mid,Chinese_Final,Math_Final
0,1,San Zhang,80,90,80,90
1,2,Si Li,75,85,75,85


In [18]:
pd.wide_to_long(df,
                stubnames=['Chinese', 'Math'],
                i = ['Class', 'Name'],
                j='Examination',
                sep='_',
                suffix='.+')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Chinese,Math
Class,Name,Examination,Unnamed: 3_level_1,Unnamed: 4_level_1
1,San Zhang,Mid,80,90
1,San Zhang,Final,80,90
2,Si Li,Mid,75,85
2,Si Li,Final,75,85


The specific transformation process is shown in the figure below. Elements of the same concept are marked with the same color:

<img src="../source/_static/ch5_wtl.png" width="35%">

The following is a more complex example. The result of the multi-column operation in the previous section `pivot` (generating a multi-level index) is converted to its original form using the `wide_to_long` function. The `str.split` function in Chapter 8 is used. For now, it can be understood as splitting the sequence according to a certain delimiter.

In [19]:
res = pivot_multi.copy()
res.columns = res.columns.map(lambda x:'_'.join(x))
res = res.reset_index()
res = pd.wide_to_long(res, stubnames=['Grade', 'rank'],
                           i = ['Class', 'Name'],
                           j = 'Subject_Examination',
                           sep = '_',
                           suffix = '.+')
res

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Grade,rank
Class,Name,Subject_Examination,Unnamed: 3_level_1,Unnamed: 4_level_1
1,San Zhang,Chinese_Mid,80,10
1,San Zhang,Chinese_Final,75,15
1,San Zhang,Math_Mid,90,20
1,San Zhang,Math_Final,85,7
2,Si Li,Chinese_Mid,85,21
2,Si Li,Chinese_Final,65,15
2,Si Li,Math_Mid,92,6
2,Si Li,Math_Final,88,2


In [20]:
res = res.reset_index()
res[['Subject', 'Examination']] = res['Subject_Examination'].str.split('_', expand=True)
res = res[['Class', 'Name', 'Examination', 'Subject', 'Grade', 'rank']].sort_values('Subject')
res = res.reset_index(drop=True)
res

Unnamed: 0,Class,Name,Examination,Subject,Grade,rank
0,1,San Zhang,Mid,Chinese,80,10
1,1,San Zhang,Final,Chinese,75,15
2,2,Si Li,Mid,Chinese,85,21
3,2,Si Li,Final,Chinese,65,15
4,1,San Zhang,Mid,Math,90,20
5,1,San Zhang,Final,Math,85,7
6,2,Si Li,Mid,Math,92,6
7,2,Si Li,Final,Math,88,2


## 2. Index Transformation

### 1. stack and unstack

In Chapter 2, we mentioned using `swaplevel` or `reorder_levels` to swap layers within an index. Now we will discuss the exchange of $\color{red}{row and column indices}$. Since this exchange brings about changes in the dimensions of `DataFrame`, it is a transformation operation. The four transformation functions mentioned in Section 1 are different in that they all belong to the conversion between one or more columns of $\color{red}{elements}$ and $\color{red}{column indices}$, rather than the conversion between indices.

The function of the `unstack` function is to convert row indices into column indices, such as the following simple example:

In [21]:
df = pd.DataFrame(np.ones((4,2)),
                  index = pd.Index([('A', 'cat', 'big'),
                                    ('A', 'dog', 'small'),
                                    ('B', 'cat', 'big'),
                                    ('B', 'dog', 'small')]),
                  columns=['col_1', 'col_2'])
df

Unnamed: 0,Unnamed: 1,Unnamed: 2,col_1,col_2
A,cat,big,1.0,1.0
A,dog,small,1.0,1.0
B,cat,big,1.0,1.0
B,dog,small,1.0,1.0


In [22]:
df.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,col_1,col_1,col_2,col_2
Unnamed: 0_level_1,Unnamed: 1_level_1,big,small,big,small
A,cat,1.0,,1.0,
A,dog,,1.0,,1.0
B,cat,1.0,,1.0,
B,dog,,1.0,,1.0


The main parameter of `unstack` is the layer number to be moved. By default, the innermost layer is converted and moved to the innermost layer of the column index. It also supports converting multiple layers at the same time:

In [23]:
df.unstack(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,col_1,col_1,col_2,col_2
Unnamed: 0_level_1,Unnamed: 1_level_1,big,small,big,small
A,cat,1.0,,1.0,
A,dog,,1.0,,1.0
B,cat,1.0,,1.0,
B,dog,,1.0,,1.0


In [24]:
df.unstack([0,2])

Unnamed: 0_level_0,col_1,col_1,col_1,col_1,col_2,col_2,col_2,col_2
Unnamed: 0_level_1,A,A,B,B,A,A,B,B
Unnamed: 0_level_2,big,small,big,small,big,small,big,small
cat,1.0,,1.0,,1.0,,1.0,
dog,,1.0,,1.0,,1.0,,1.0


Similar to the uniqueness requirement in `pivot`, in `unstack`, it is necessary to ensure that the combination of $\color{red}{row index layer converted to column index}$ and $\color{red}{retained row index layer}$ is unique. For example, changing the first two column indexes to the same destroys the uniqueness, and an error will be reported:

In [25]:
my_index = df.index.to_list()
my_index[1] = my_index[0]
df.index = pd.Index(my_index)
df

Unnamed: 0,Unnamed: 1,Unnamed: 2,col_1,col_2
A,cat,big,1.0,1.0
A,cat,big,1.0,1.0
B,cat,big,1.0,1.0
B,dog,small,1.0,1.0


In [26]:
try:
    df.unstack()
except Exception as e:
    Err_Msg = e
Err_Msg

ValueError('Index contains duplicate entries, cannot reshape')

In contrast to `unstack`, the function of `stack` is to stack the layers of column index into row index, and its usage is exactly the same.

In [27]:
df = pd.DataFrame(np.ones((4,2)),
                  index = pd.Index([('A', 'cat', 'big'),
                                    ('A', 'dog', 'small'),
                                    ('B', 'cat', 'big'),
                                    ('B', 'dog', 'small')]),
                  columns=['index_1', 'index_2']).T
df

Unnamed: 0_level_0,A,A,B,B
Unnamed: 0_level_1,cat,dog,cat,dog
Unnamed: 0_level_2,big,small,big,small
index_1,1.0,1.0,1.0,1.0
index_2,1.0,1.0,1.0,1.0


In [28]:
df.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,A,B,B
Unnamed: 0_level_1,Unnamed: 1_level_1,cat,dog,cat,dog
index_1,big,1.0,,1.0,
index_1,small,,1.0,,1.0
index_2,big,1.0,,1.0,
index_2,small,,1.0,,1.0


In [29]:
df.stack([1, 2])

Unnamed: 0,Unnamed: 1,Unnamed: 2,A,B
index_1,cat,big,1.0,1.0
index_1,dog,small,1.0,1.0
index_2,cat,big,1.0,1.0
index_2,dog,small,1.0,1.0


### 2. The relationship between aggregation and transformation

Among all the functions introduced above, except for `pivot_table` which has an aggregation effect, all functions will not change the number of `values` before and after the transformation, but the values ​​will change in the form of presentation. The group aggregation operation discussed in the previous chapter must also belong to a special transformation operation because it generates new row and column indexes. However, since the original multiple values ​​are converted into one value after aggregation, the number of `values` has changed, which is also the biggest difference between group aggregation and transformation functions.

## 3. Other transformation functions

### 1. crosstab

`crosstab` is an awkward function because all the functions it can achieve can be completed by `pivot_table`. In the default state, `crosstab` can count the frequency of element combinations, that is, the `count` operation. For example, count the frequency of schools and transfers in the `learn_pandas` dataset:

In [30]:
df = pd.read_csv('../data/learn_pandas.csv')
pd.crosstab(index = df.School, columns = df.Transfer)

Transfer,N,Y
School,Unnamed: 1_level_1,Unnamed: 2_level_1
Fudan University,38,1
Peking University,28,2
Shanghai Jiao Tong University,53,0
Tsinghua University,62,4


This is equivalent to the following `crosstab`, where `aggfunc` is the aggregation parameter:

In [31]:
pd.crosstab(index = df.School, columns = df.Transfer, values = [0]*df.shape[0], aggfunc = 'count')

Transfer,N,Y
School,Unnamed: 1_level_1,Unnamed: 2_level_1
Fudan University,38.0,1.0
Peking University,28.0,2.0
Shanghai Jiao Tong University,53.0,
Tsinghua University,62.0,4.0


Similarly, you can use `pivot_table` to perform equivalent operations. Since the frequency of combinations is counted here, no matter which column is passed into the `values` parameter, it will not affect the final result:

In [32]:
df.pivot_table(index = 'School',
               columns = 'Transfer',
               values = 'Name',
               aggfunc = 'count')

Transfer,N,Y
School,Unnamed: 1_level_1,Unnamed: 2_level_1
Fudan University,38.0,1.0
Peking University,28.0,2.0
Shanghai Jiao Tong University,53.0,
Tsinghua University,62.0,4.0


From the above, we can see that the difference between these two functions is that the corresponding position of `crosstab` passes in a specific sequence, while `pivot_table` passes in the name of the called table. If the value corresponding to the sequence is passed in, an error will be reported.

In addition to the default `count` statistics, all custom functions that aggregate strings and return scalars are available, such as counting the average height of the corresponding combination:

In [33]:
pd.crosstab(index = df.School, columns = df.Transfer, values = df.Height, aggfunc = 'mean')

Transfer,N,Y
School,Unnamed: 1_level_1,Unnamed: 2_level_1
Fudan University,162.04375,177.2
Peking University,163.42963,162.4
Shanghai Jiao Tong University,163.953846,
Tsinghua University,163.253571,164.55


### 2. explode

The `explode` parameter can vertically expand the elements of a column. The cells to be expanded must store a type among `list, tuple, Series, np.ndarray`.

In [34]:
df_ex = pd.DataFrame({'A': [[1, 2], 'my_str', {1, 2}, pd.Series([3, 4])],
                      'B': 1})
df_ex

Unnamed: 0,A,B
0,"[1, 2]",1
1,my_str,1
2,"{1, 2}",1
3,0 3 1 4 dtype: int64,1


In [35]:
df_ex.explode('A')

Unnamed: 0,A,B
0,1,1
0,2,1
1,my_str,1
2,1,1
2,2,1
3,3,1
3,4,1


### 3. get_dummies

`get_dummies` is one of the important functions used for feature construction. Its function is to convert categorical features into indicator variables. For example, convert the grade column into an indicator variable, and the corresponding column belonging to a certain grade is marked as 1, otherwise it is marked as 0:

In [36]:
pd.get_dummies(df.Grade).head()

Unnamed: 0,Freshman,Junior,Senior,Sophomore
0,1,0,0,0
1,1,0,0,0
2,0,0,1,0
3,0,0,0,1
4,0,0,0,1


## 4. Exercises
### Ex1: US illegal drug dataset

There is a dataset about illegal drugs in the United States, where `SubstanceName, DrugReports` refer to the drug name and the number of reports respectively:

In [37]:
df = pd.read_csv('../data/drugs.csv').sort_values(['State','COUNTY','SubstanceName'],ignore_index=True)
df.head(3)

Unnamed: 0,YYYY,State,COUNTY,SubstanceName,DrugReports
0,2011,KY,ADAIR,Buprenorphine,3
1,2012,KY,ADAIR,Buprenorphine,5
2,2013,KY,ADAIR,Buprenorphine,4


1. Convert the data into the following format:

<img src="../source/_static/Ex5_1.png" width="35%">

2. Restore the result in question 1 to the original table.
3. Count the total number of reports for each year by `State`, where `State, YYYY` are column index and row index respectively. Use two different strategies, `pivot_table` function and `groupby+unstack`, to implement them respectively, and understand the connection between them.

### Ex2: Special wide_to_long method

Functionally, the `melt` method should be a special case of `wide_to_long`, that is, `stubnames` has only one category. Please use `wide_to_long` to generate `df_melted` in the `melt` section. (Hint: add appropriate prefixes to column names)

In [38]:
df = pd.DataFrame({'Class':[1,2],
                   'Name':['San Zhang', 'Si Li'],
                   'Chinese':[80, 90],
                   'Math':[80, 75]})
df

Unnamed: 0,Class,Name,Chinese,Math
0,1,San Zhang,80,80
1,2,Si Li,90,75
