# 使用Scikit-Learn 完成預測
### Scikit-Learn在三個面向提供支援。
1. 獲取資料:***klearn.datasets***
2. 掌握資料:***sklearn.preprocessing*** 
3. 機器學習:***sklearn Estimator API*** 

獲取資料的方式有很多種（包含檔案、資料庫、網路爬蟲、Kaggle Datasets等），<br>
其中最簡單的方式是從Sklearn import 內建的資料庫。由於其特性隨手可得且不用下載，所以我們通常叫他**玩具資料**：

# 基本架構

* 讀取資料&pre-processing
* 切分訓練集與測試集 
* 模型配適
* 預測 
* 評估(計算成績可能是誤差值或正確率或..)


In [1]:
%matplotlib inline

from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## 讀取Iris資料集與資料前處理

Iris Flowers 資料集

我們在這個項目中使用 Iris Data Set，這個資料集中的每個樣本有4個特徵，1個類別。該資料集1中的樣本類別數為3類，每類樣本數目為50個，總共150個樣本。

屬性資訊：

    花萼長度 sepal length(cm)
    花萼寬度 sepal width(cm)
    花瓣長度 petal length(cm)
    花瓣寬度 petal width(cm)
    類別：
        Iris Setosa
        Iris Versicolour
        Iris Virginica

樣本特徵資料是數值型的，而且單位都相同（釐米）。

![Iris Flowers](images/iris_data.PNG)


In [2]:
iris = datasets.load_iris()
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

* 印出iris的key值與檔案位置
* 查看前10筆資料
* 查看資料型別
* 印出標註的樣本類別資料

In [3]:
print(iris.keys())

print(iris.filename)

print(iris.data[:10])

print(type(iris.data))

print(iris.target_names)
print(iris.target)

print(iris.data_module)

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
iris.csv
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]
<class 'numpy.ndarray'>
['setosa' 'versicolor' 'virginica']
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
sklearn.datasets.data


In [4]:
# we only take the first two features. 
X = iris.data[:,:2] #sepal length & sepal width
print(X.shape)
Y = iris.target
print(Y.shape)


(150, 2)
(150,)


In [5]:
#以下是組成 pandas DataFrame (也可以不用這種做)
x = pd.DataFrame(iris.data, columns = iris['feature_names'])
x

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [6]:
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


In [7]:
#建立Target欄位與資料
y = pd.DataFrame(iris.target, columns = ['target'])
y

Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0
...,...
145,2
146,2
147,2
148,2


In [8]:
#合併資料特徵欄位與目標欄位

iris_data = pd.concat([x, y], axis=1)
iris_data

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [9]:
#choose part of data
iris_data = iris_data[['sepal length (cm)', 'petal length (cm)', 'target']]
iris_data

Unnamed: 0,sepal length (cm),petal length (cm),target
0,5.1,1.4,0
1,4.9,1.4,0
2,4.7,1.3,0
3,4.6,1.5,0
4,5.0,1.4,0
...,...,...,...
145,6.7,5.2,2
146,6.3,5.0,2
147,6.5,5.2,2
148,6.2,5.4,2


In [10]:
#只選擇目標為0與1的資料
iris_data = iris_data[iris_data['target'].isin([0,1])]
iris_data

Unnamed: 0,sepal length (cm),petal length (cm),target
0,5.1,1.4,0
1,4.9,1.4,0
2,4.7,1.3,0
3,4.6,1.5,0
4,5.0,1.4,0
...,...,...,...
95,5.7,4.2,1
96,5.7,4.2,1
97,6.2,4.3,1
98,5.1,3.0,1


In [11]:
print(iris_data.size)
print(iris_data.size/len(iris.feature_names))

300
75.0


## 切分訓練集與測試集
> train_test_split()

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(iris_data[['sepal length (cm)','petal length (cm)']], iris_data[['target']], test_size=0.3)
# [[ ]] use list

In [13]:
X_train.head(5)

Unnamed: 0,sepal length (cm),petal length (cm)
67,5.8,4.1
79,5.7,3.5
33,5.5,1.4
2,4.7,1.3
3,4.6,1.5


In [14]:
X_test.head(5)

Unnamed: 0,sepal length (cm),petal length (cm)
39,5.1,1.5
86,6.7,4.7
7,5.0,1.5
30,4.8,1.6
95,5.7,4.2


In [15]:
Y_train.head(5)

Unnamed: 0,target
67,1
79,1
33,0
2,0
3,0


In [16]:
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

(70, 2) (30, 2) (70, 1) (30, 1)


# Appendix 

>normalization和standardization是差不多的<br>
都是把數據進行前處理，從而使數值都落入到統一的數值範圍，從而在建模過程中，各個特徵量沒差別對待。<br> 
* normalization一般是把數據限定在需要的範圍，比如一般都是【0，1】，從而消除了數據量綱對建模的影響。<br> 
* standardization 一般是指將數據正態化，使平均值0方差為1.<br> 

因此normalization和standardization 是針對數據而言的，消除一些數值差異帶來的特種重要性偏見。<br>
經過歸一化的數據，能加快訓練速度，促進算法的收斂。

### Standardization (z-score)
    to compute the mean and standard deviation on a training set so as to be able to later reapply the same transformation on the testing set. 

In [17]:
def norm_stats(dfs):
    minimum = np.min(dfs)
    maximum = np.max(dfs)
    mu = np.mean(dfs)
    sigma = np.std(dfs, axis=0)
    #sigma = float(np.std(dfs, axis=0))
    return (minimum, maximum, mu, sigma)

def z_score(col, stats):
    m, M, mu, s = stats
    df = pd.DataFrame()
    for c in col.columns:
        df[c] = (col[c]-mu[c]) / s[c]
    return df

In [18]:
X_train

Unnamed: 0,sepal length (cm),petal length (cm)
67,5.8,4.1
79,5.7,3.5
33,5.5,1.4
2,4.7,1.3
3,4.6,1.5
...,...,...
72,6.3,4.9
44,5.1,1.9
21,5.1,1.5
41,4.5,1.3


In [19]:
stats = norm_stats(X_train)
print(stats)

#arr_x_train = np.array(z_score(X_train, stats))
#arr_x_train
#arr_y_train = np.array(Y_train)
#arr_x_train[:5]

(1.0, 6.9, 4.115, sepal length (cm)    0.618835
petal length (cm)    1.412900
dtype: float64)


## use sklearn

In [20]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler().fit(X_train)  #Compute the statistics to be used for later scaling.
print(sc.mean_)  #mean
print(sc.scale_) #standard deviation

[5.43 2.8 ]
[0.61883531 1.41289975]


In [21]:
#transform: (x-u)/std.
X_train_std = sc.transform(X_train)
X_train_std[:10]

array([[ 0.59789736,  0.92009359],
       [ 0.43630348,  0.49543501],
       [ 0.11311572, -0.99087001],
       [-1.17963533, -1.06164644],
       [-1.34122922, -0.92009359],
       [ 0.43630348,  0.92009359],
       [ 0.75949124,  1.41552859],
       [-1.66441698, -1.06164644],
       [ 0.2747096 ,  0.99087001],
       [-1.01804145, -0.99087001]])

The scaler instance can then be used on new data to transform it the same way it did on the training set:

In [22]:
X_test_std = sc.transform(X_test)
print(X_test_std[:10])

[[-0.53325981 -0.92009359]
 [ 2.05224229  1.34475216]
 [-0.69485369 -0.92009359]
 [-1.01804145 -0.84931716]
 [ 0.43630348  0.99087001]
 [ 0.11311572  0.84931716]
 [-0.53325981 -0.92009359]
 [ 0.75949124  0.99087001]
 [ 0.92108512  1.2031993 ]
 [ 0.11311572 -1.06164644]]


you can also use fit_transform method (i.e., fit and then transform)    

In [23]:
X_train_std = sc.fit_transform(X_train)  
X_test_std = sc.fit_transform(X_test)
print(X_train_std[:10])
print(X_test_std[:10])


[[ 0.59789736  0.92009359]
 [ 0.43630348  0.49543501]
 [ 0.11311572 -0.99087001]
 [-1.17963533 -1.06164644]
 [-1.34122922 -0.92009359]
 [ 0.43630348  0.92009359]
 [ 0.75949124  1.41552859]
 [-1.66441698 -1.06164644]
 [ 0.2747096   0.99087001]
 [-1.01804145 -0.99087001]]
[[-0.69395424 -1.00289352]
 [ 1.68531743  1.13186874]
 [-0.84265872 -1.00289352]
 [-1.14006767 -0.9361822 ]
 [ 0.19827264  0.79831214]
 [-0.09913632  0.66488949]
 [-0.69395424 -1.00289352]
 [ 0.4956816   0.79831214]
 [ 0.64438608  0.9984461 ]
 [-0.09913632 -1.13631616]]


In [24]:
print('mean of X_train_std:',np.round(X_train_std.mean(),4))
print('std of X_train_std:',X_train_std.std())

mean of X_train_std: -0.0
std of X_train_std: 1.0


## Min-Max Normaliaztion
    Transforms features by scaling each feature to a given range.
    The transformation is given by:

    X' = X - X.min(axis=0) / ((X.max(axis=0) - X.min(axis=0))
    X -> N 維資料
    


In [25]:
x1 = np.random.normal(50, 6, 100)  # np.random.normal(mu,sigma,size))
y1 = np.random.normal(5, 0.5, 100)

x2 = np.random.normal(30,6,100)
y2 = np.random.normal(4,0.5,100)
plt.scatter(x1,y1,c='b',marker='s',s=20,alpha=0.8)
plt.scatter(x2,y2,c='r', marker='^', s=20, alpha=0.8)

print(np.sum(x1)/len(x1))
print(np.sum(x2)/len(x2))

50.24962232819359
30.294966946091876


In [26]:
x_val = np.concatenate((x1,x2))
y_val = np.concatenate((y1,y2))

x_val.shape

(200,)

In [27]:
def minmax_norm(X):
    return (X - X.min(axis=0)) / ((X.max(axis=0) - X.min(axis=0)))

In [28]:
minmax_norm(x_val[:10])

array([0.80473716, 0.        , 0.64593345, 0.57939474, 0.934979  ,
       0.94479352, 0.7728475 , 0.42983595, 0.93192034, 1.        ])

In [29]:
from sklearn.preprocessing import MinMaxScaler
print(x_val.shape)
x_val=x_val.reshape(-1, 1)
print(x_val.shape)
scaler = MinMaxScaler().fit(x_val)  # default range 0~1
print(scaler.data_max_)
print(scaler.transform(x_val)[:10])

(200,)
(200, 1)
[65.0917563]
[[0.74808322]
 [0.37619227]
 [0.6746957 ]
 [0.64394635]
 [0.80827152]
 [0.81280708]
 [0.73334614]
 [0.57483117]
 [0.80685803]
 [0.83831949]]


In [30]:
from IPython.display import Math

Math(r'x^{i} = \frac{x^{i} - x_{min}}{x_{max} - x_{min}}')

<IPython.core.display.Math object>