# 数据缩放

在本练习中，你将练习缩放数据。有时，提到特征缩放，你会看到我们不加区分地使用 **标准化(standardization)** 和 **归一化/正则化 (normalization)** 这样的术语。但是，这是两种略微不同的操作。标准化是指将一列值进行缩放，使之平均值为 0，标准差为 1。归一化是指将一列值缩放到 0 到 1 的区间 。

在本练习中，你会练习代码实现标准化和归一化。有代码库如scikit-learn 可以实现这个功能，但是在数据工程中，你并不总能找到可用的工具包。

运行第一个单元格，读进世界银行 GDP 和人口数据。这个单元格还筛选出了 2016 年的数据，删掉了多国联合体如 'World' 和 'OECD Members' 的值。

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

# read in the projects data set and do basic wrangling 
gdp = pd.read_csv('../data/gdp_data.csv', skiprows=4)
gdp.drop(['Unnamed: 62', 'Country Code', 'Indicator Name', 'Indicator Code'], inplace=True, axis=1)
population = pd.read_csv('../data/population_data.csv', skiprows=4)
population.drop(['Unnamed: 62', 'Country Code', 'Indicator Name', 'Indicator Code'], inplace=True, axis=1)


# Reshape the data sets so that they are in long format
gdp_melt = gdp.melt(id_vars=['Country Name'], 
                    var_name='year', 
                    value_name='gdp')

# Use back fill and forward fill to fill in missing gdp values
gdp_melt['gdp'] = gdp_melt.sort_values('year').groupby('Country Name')['gdp'].fillna(method='ffill').fillna(method='bfill')

population_melt = population.melt(id_vars=['Country Name'], 
                                  var_name='year', 
                                  value_name='population')

# Use back fill and forward fill to fill in missing population values
population_melt['population'] = population_melt.sort_values('year').groupby('Country Name')['population'].fillna(method='ffill').fillna(method='bfill')

# merge the population and gdp data together into one data frame
df_country = gdp_melt.merge(population_melt, on=('Country Name', 'year'))

# filter data for the year 2016
df_2016 = df_country[df_country['year'] == '2016']

# filter out values that are not countries
non_countries = ['World',
 'High income',
 'OECD members',
 'Post-demographic dividend',
 'IDA & IBRD total',
 'Low & middle income',
 'Middle income',
 'IBRD only',
 'East Asia & Pacific',
 'Europe & Central Asia',
 'North America',
 'Upper middle income',
 'Late-demographic dividend',
 'European Union',
 'East Asia & Pacific (excluding high income)',
 'East Asia & Pacific (IDA & IBRD countries)',
 'Euro area',
 'Early-demographic dividend',
 'Lower middle income',
 'Latin America & Caribbean',
 'Latin America & the Caribbean (IDA & IBRD countries)',
 'Latin America & Caribbean (excluding high income)',
 'Europe & Central Asia (IDA & IBRD countries)',
 'Middle East & North Africa',
 'Europe & Central Asia (excluding high income)',
 'South Asia (IDA & IBRD)',
 'South Asia',
 'Arab World',
 'IDA total',
 'Sub-Saharan Africa',
 'Sub-Saharan Africa (IDA & IBRD countries)',
 'Sub-Saharan Africa (excluding high income)',
 'Middle East & North Africa (excluding high income)',
 'Middle East & North Africa (IDA & IBRD countries)',
 'Central Europe and the Baltics',
 'Pre-demographic dividend',
 'IDA only',
 'Least developed countries: UN classification',
 'IDA blend',
 'Fragile and conflict affected situations',
 'Heavily indebted poor countries (HIPC)',
 'Low income',
 'Small states',
 'Other small states',
 'Not classified',
 'Caribbean small states',
 'Pacific island small states']

# remove non countries from the data
df_2016 = df_2016[~df_2016['Country Name'].isin(non_countries)]


# show the first ten rows
print('first ten rows of data')
df_2016.head(10)

# 练习 - 归一化

为了归一化数据，选择一个特征，如 GDP，使用下面的公式

$x_{normalized} = \frac{x - x_{min}}{x_{max} - x_{min}}$

其中 
* x 是 GDP 值
* x_max 是数据集中 GDP 的最大值
* x_min是数据集中 GDP 的最小值

首先，编写一个函数，返回一列值的 x_min 和 x_max。输入是一列数据  (比如 GDP 数据)。返回结果是 x_min 和 x_max 值。

In [None]:
def x_min_max(data):
    # TODO: Complete this function called x_min_max() 
    # The input is an array of data as an input 
    # The outputs are the minimum and maximum of that array
    minimum = None
    maximum = None
    return minimu, maximum

# this should give the result (36572611.88531479, 18624475000000.0)
x_min_max(df_2016['gdp'])

下一步，编写一个函数用于归一化数据。输入是一个 x 值，一个最小值，一个最大值。返回归一化的数据。

In [None]:
def normalize(x, x_min, x_max):
    # TODO: Complete this function
    # The input is a single value 
    # The output is the normalized value
    return None

为什么要分开写两个函数？假如你在训练一个机器学习模型，使用归一化的 GDP 作特征。当有新数据时，你想用新的 GDP 做预测。你还需要归一化新来的数据。为了实现这个目标，你需要存储训练集的 x_min 和 x_max。因此 x_min_max() 函数会返回最大值和最小值，你可以把它们存到变量里。

一个好的保持最大值和最小值最新的办法是使用类。在下一章，填写 Normalizer() 类的代码来构造一个类，以归一化数据集并存储最大和最小值。 

In [None]:
class Normalizer():
    # TODO: Complete the normalizer class
    # The normalizer class receives a dataframe as its only input for initialization
    # For example, the data frame might contain gdp and population data in two separate columns
    # Follow the TODOs in each section
    
    def __init__(self, dataframe):
        
        # TODO: complete the init function. 
        # Assume the dataframe has an unknown number of columns like [['gdp', 'population']] 
        # iterate through each column calculating the min and max for each column
        # append the results to the params attribute list
        
        # For example, take the gdp column and calculate the minimum and maximum
        # Put these results in a list [minimum, maximum]
        # Append the list to the params variable
        # Then take the population column and do the same
        
        # HINT: You can put your x_min_max() function as part of this class and use it
        
        # HINT: Use a for loop to iterate through the columns of the dataframe
        
        self.params = []
            
    def x_min_max(data):
        # TODO: complete the x_min_max method
        # HINT: You can use the same function defined earlier in the exercise
        minimum = None
        maximum = None
        return minimum, maximum

    def normalize_data(self, x):
        # TODO: complete the normalize_data method
        # The function receives a data point as an input and then outputs the normalized version
        # For example, if an input data point of [gdp, population] were used. Then the output would
        # be the normalized version of the [gdp, population] data point
        # Put the results in the normalized variable defined below
        
        # Assume that the columns in the dataframe used to initialize an object are in the same
        # order as this data point x
        
        # HINT: You cannot use the normalize_data function defined earlier in the exercise.
        # You'll need to iterate through the individual values in the x variable. A for loop and the 
        #    Python enumerate method might be useful.
        # Use the params attribute where the min and max values are stored 
        normalized = []
        return normalized

运行下方单元格中的代码，检查你的答案。

In [None]:
gdp_normalizer = Normalizer(df_2016[['gdp', 'population']])

In [None]:
# This cell should output: [(36572611.88531479, 18624475000000.0), (11097.0, 1378665000.0)]
gdp_normalizer.params

In [None]:
# This cell should output [0.7207969507229194, 0.9429407193285986]
gdp_normalizer.normalize_data([13424475000000.0, 1300000000])

# 结语

在归一化或标准化机器学习的特征时，你需要保存数据缩放时用到的参数。这样你就可以在预测的时候对新数据实现缩放。在本练习中，你保存了特征的最大最小值。在标准化数据时，你需要存储平均值和标准差。标准化的公式是：

$x_{standardized} = \frac{x - \overline{x}}{S}$