## Pandas实现数据归一化

目标知识：怎样对groupby之后的每个分组应用apply函数？

***归一化做了什么：***    
把数字映射到[0,1]范围之间，相当于数值的缩放，不改变原始数据的数值排序

***归一化的公式：***  
<div style="text-align:left; width:300px;"><img src="./other_files/Normalization-Formula.jpg" style=""/></div>

***归一化的意义：***  
* 能够比较：经过归一化处理后，各指标处于同一数量级，以便进行综合对比评价；
* 更快收敛：使得机器学习模型更快收敛，学到正确参数 


***本次演示：***
1. 数值列全局的归一化
2. 数据groupby分组之后归一化

### 1、数值列全局的归一化

实例：将4家互联网股票归一化

In [20]:
import pandas as pd
stocks = pd.read_excel('./datas/stocks/互联网公司股票.xlsx')
stocks.head()

Unnamed: 0,日期,公司,收盘,开盘,高,低,交易量,涨跌幅
0,2019-10-03,BIDU,104.32,102.35,104.73,101.15,2.24,0.02
1,2019-10-02,BIDU,102.62,100.85,103.24,99.5,2.69,0.01
2,2019-10-01,BIDU,102.0,102.8,103.26,101.0,1.78,-0.01
3,2019-10-03,BABA,169.48,166.65,170.18,165.0,10.39,0.02
4,2019-10-02,BABA,165.77,162.82,166.88,161.9,11.6,0.0


***单列数据的归一化***

In [21]:
max_value = stocks["收盘"].max()
max_value

169.48

In [22]:
min_value = stocks["收盘"].min()
min_value

15.72

In [23]:
stocks["收盘"].apply(lambda x:(x-min_value)/(max_value-min_value))

0     0.576223
1     0.565166
2     0.561134
3     1.000000
4     0.975871
5     0.971839
6     0.002211
7     0.000000
8     0.001301
9     0.085068
10    0.080255
11    0.081100
Name: 收盘, dtype: float64

***多列数据的归一化***

In [24]:
for column_name in ('收盘', '开盘', '高', '低', '交易量', '涨跌幅'):
    min_value = stocks[column_name].min()
    max_value = stocks[column_name].max()
    stocks[column_name] = stocks[column_name].apply(lambda x:(x-min_value)/(max_value-min_value))

In [25]:
stocks.head()

Unnamed: 0,日期,公司,收盘,开盘,高,低,交易量,涨跌幅
0,2019-10-03,BIDU,0.576223,0.568877,0.575854,0.573993,0.037067,0.75
1,2019-10-02,BIDU,0.565166,0.559028,0.566198,0.562984,0.073328,0.5
2,2019-10-01,BIDU,0.561134,0.571832,0.566328,0.572992,0.0,0.0
3,2019-10-03,BABA,1.0,0.99107,1.0,1.0,0.693795,0.75
4,2019-10-02,BABA,0.975871,0.965923,0.978614,0.979317,0.791297,0.25


正因为有了归一化：  
收盘、开盘、高、低这四列的数值（原来很大），才能够和交易量、涨跌幅（原来很小）作综合比较

### 2、数据groupby分组之后归一化

#### 知识：Pandas的GroupBy遵从split、apply、combine模式

<div style="text-align:left; width:500px;"><img src="./other_files/pandas-split-apply-combine.png" style=""/></div>


#### 知识：怎样先groupby后对每个分组apply函数？
GroupBy.apply(function)  
* function的第一个参数是dataframe
* function的返回结果，可是dataframe、series、单个值

这里的split指的是pandas的groupby，我们自己实现apply函数，apply返回的结果由pandas进行combine得到结果

#### 演示：用户对电影评分的归一化

每个用户的评分不同，有的乐观派评分高，有的悲观派评分低，按用户做归一化

In [8]:
ratings = pd.read_csv(
    "./datas/movielens-1m/ratings.dat", 
    sep="::",
    engine='python', 
    names="UserID::MovieID::Rating::Timestamp".split("::")
)
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [28]:
# 实现按照用户ID分组，然后对其中一列归一化
def ratings_norm(df):
    """
    @param df：每个分组的dataframe
    """
    min_value = df["Rating"].min()
    max_value = df["Rating"].max()
    df["Rating_norm"] = df["Rating"].apply(
        lambda x:(x-min_value)/(max_value-min_value))
    return df

ratings = ratings.groupby("UserID").apply(ratings_norm)

In [30]:
ratings[ratings["UserID"]==1].head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Rating_norm
0,1,1193,5,978300760,1.0
1,1,661,3,978302109,0.0
2,1,914,3,978301968,0.0
3,1,3408,4,978300275,0.5
4,1,2355,5,978824291,1.0


可以看到UserID==1这个用户，Rating==3是他的最低分，是个乐观派，我们归一化到0分；