![Day 1](https://github.com/MachineLearning100/100-Days-Of-ML-Code/raw/master/Info-graphs/Day%201.jpg)

## Step 1: 导入需要的库

In [1]:
import numpy as np
import pandas as pd

## Step 2: 导入数据集
pd.iloc 提取指定的行和列  
· 利用iloc提取所有数据  
data.iloc[:,:] #取第所有列的所有行  

· 利用iloc提取所选列数据  
data.iloc[:,[0]] #取第0列所有行，多取几列格式为 data.iloc[:,[0,1]]

In [6]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:,:-1].values # 除了最后一列的所有数据
Y = dataset.iloc[:,3].values # 最后一列（第4列）

## Step 3: 处理丢失数据
可以用整列的平均值或中间值替换丢失的数据。  
用sklearn.preprocessing库中的Imputer类完成这项任务。  

`sklearn.preprocessing.Imputer(missing_values=’NaN’, strategy=’mean’, axis=0, verbose=0, copy=True)`

主要参数说明：
1. missing_values：缺失值，可以为整数或NaN(缺失值numpy.nan用字符串‘NaN’表示)，默认为NaN
2. strategy：替换策略，字符串，默认用均值‘mean’替换
    ① 若为mean时，用特征列的均值替换
    ② 若为median时，用特征列的中位数替换
    ③ 若为most_frequent时，用特征列的众数替换
3. axis：指定轴数，默认axis=0代表列，axis=1代表行
4. copy：设置为True代表不在原数据集上修改，设置为False时，就地修改，存在如下情况时，即使设置为False时，也不会就地修改
    ① X不是浮点值数组
    ② X是稀疏且missing_values=0
    ③ axis=0且X为CRS矩阵
    ④ axis=1且X为CSC矩阵
5. statistics_属性：axis设置为0时，每个特征的填充值数组，axis=1时，报没有该属性错误



In [7]:
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:,1:3]) # 用数据拟合X的第二列和第三列
X[:,1:3] = imputer.transform(X[:,1:3]) # 补充缺失数据

print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Step 4: 解析分类数据
将标签值（非数字）解析为数字。  
用sklearn.preprocessing库中的LabelEncoder, OneHotEncoder类。  
* LabelEncoder 可以将标签分配一个0 — n_classes-1之间的编码。

In [9]:
from sklearn.preprocessing import LabelEncoder
LabelEncoder_X = LabelEncoder()
X[:,0] = LabelEncoder_X.fit_transform(X[:,0]) # 给第一列分类编码

print(X)

[[0 44.0 72000.0]
 [2 27.0 48000.0]
 [1 30.0 54000.0]
 [2 38.0 61000.0]
 [1 40.0 63777.77777777778]
 [0 35.0 58000.0]
 [2 38.77777777777778 52000.0]
 [0 48.0 79000.0]
 [1 50.0 83000.0]
 [0 37.0 67000.0]]


### 创建虚拟变量
* OneHotEncoder 独热编码，采用位状态寄存器来对个状态进行编码，每个状态都由他独立的寄存器位，并且在任意时候只有一位有效。

In [13]:
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
LabelEncoder_Y = LabelEncoder()
Y = LabelEncoder_Y.fit_transform(Y)

print(X,'\n',Y)

[[1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 4.40000000e+01 7.20000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  1.00000000e+00 2.70000000e+01 4.80000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 1.00000000e+00
  0.00000000e+00 3.00000000e+01 5.40000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  1.00000000e+00 3.80000000e+01 6.10000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 1.00000000e+00
  0.00000000e+00 4.00000000e+01 6.37777778e+04]
 [1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 3.50000000e+01 5.80000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  1.00000000e+00 3.87777778e+01 5.20000000e+04]
 [1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 4.80000000e+01 7.90000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 1.00000000e+00
  0.00000000e+00 5.00000000e+01 8.30000000e+04]
 [1.000000

## Step 5: 将数据集拆分为训练集和测试集
两者一般的比例为80:20。  
用sklearn.crossvalidation库中的train_test_split。

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)
print(X_train,'\n',X_test,'\n',Y_train,'\n',Y_test)

[[0.00000000e+00 1.00000000e+00 0.00000000e+00 1.00000000e+00
  0.00000000e+00 4.00000000e+01 6.37777778e+04]
 [1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 3.70000000e+01 6.70000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  1.00000000e+00 2.70000000e+01 4.80000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  1.00000000e+00 3.87777778e+01 5.20000000e+04]
 [1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 4.80000000e+01 7.90000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  1.00000000e+00 3.80000000e+01 6.10000000e+04]
 [1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 4.40000000e+01 7.20000000e+04]
 [1.00000000e+00 0.00000000e+00 1.00000000e+00 0.00000000e+00
  0.00000000e+00 3.50000000e+01 5.80000000e+04]] 
 [[0.0e+00 1.0e+00 0.0e+00 1.0e+00 0.0e+00 3.0e+01 5.4e+04]
 [0.0e+00 1.0e+00 0.0e+00 1.0e+00 0.0e+00 5.0e+01 8.3e+04]

## Step 6: 特征量化
用特征标准化或Z值归一化解决特征缩放的问题。  
使用sklearn.preprocessing库中的StandardScaler类。

In [17]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

print(X_train,'\n',X_test)

[[-1.          1.         -1.          2.64575131 -0.77459667  0.26306757
   0.12381479]
 [ 1.         -1.          1.         -0.37796447 -0.77459667 -0.25350148
   0.46175632]
 [-1.          1.         -1.         -0.37796447  1.29099445 -1.97539832
  -1.53093341]
 [-1.          1.         -1.         -0.37796447  1.29099445  0.05261351
  -1.11141978]
 [ 1.         -1.          1.         -0.37796447 -0.77459667  1.64058505
   1.7202972 ]
 [-1.          1.         -1.         -0.37796447  1.29099445 -0.0813118
  -0.16751412]
 [ 1.         -1.          1.         -0.37796447 -0.77459667  0.95182631
   0.98614835]
 [ 1.         -1.          1.         -0.37796447 -0.77459667 -0.59788085
  -0.48214934]] 
 [[-1.          1.         -1.          2.64575131 -0.77459667 -1.45882927
  -0.90166297]
 [-1.          1.         -1.          2.64575131 -0.77459667  1.98496442
   2.13981082]]
