程式的一開始先 import 所有會用到的 python 套件
包含了 pandas、numpy 等用來處理、儲存資料的格式
以及 scikit learn 提供的多個 mechine learning 的 model

In [1]:
# from sklearn.datasets import load_iris
from sklearn import preprocessing
import pandas as pd
import numpy as np
np.set_printoptions(threshold = 1e6)

from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

讀取這次作業用到的 training data 以及 testing data

In [2]:
################################################################
## load dataset
df_iris_train = pd.read_csv("train.csv")
df_iris_test = pd.read_csv("test.csv")
df_iris_submission = pd.read_csv("submission.csv")
print("Raw data:")
print(df_iris_train[119:123])
print()

df_iris_train[119:123]

Raw data:
      id  花萼長度  花萼寬度  花瓣長度  花瓣寬度              屬種  type
119  120   5.9   3.0   5.1   1.8  Iris-virginica     3
120  121   NaN   3.0   4.9   1.2   Iris-new_type     4
121  122   5.2   NaN   5.1   1.8   Iris-new_type     4
122  123   6.1   3.2   5.1   1.8   Iris-new_type     4



Unnamed: 0,id,花萼長度,花萼寬度,花瓣長度,花瓣寬度,屬種,type
119,120,5.9,3.0,5.1,1.8,Iris-virginica,3
120,121,,3.0,4.9,1.2,Iris-new_type,4
121,122,5.2,,5.1,1.8,Iris-new_type,4
122,123,6.1,3.2,5.1,1.8,Iris-new_type,4


讀取完data後，要先對data做前處理，首先先對每個類別的文字，利用pandas的功能做 one hot encoding 轉換成數字

In [3]:
################################################################
print("###############################")
print("Preprocessing")
print("###############################")
# 用 pandas 對'屬種'的 column 文字做 encoding
df_iris_train_encode = df_iris_train.copy()
df_iris_train_encode['屬種'] = df_iris_train_encode['屬種'].replace({'Iris-setosa':1,'Iris-versicolor':2,'Iris-virginica':3,'Iris-new_type':4})   #轉換
print("Data after encoding:")
print(df_iris_train_encode[119:123])
print()

df_iris_train_encode[119:123]

###############################
Preprocessing
###############################
Data after encoding:
      id  花萼長度  花萼寬度  花瓣長度  花瓣寬度  屬種  type
119  120   5.9   3.0   5.1   1.8   3     3
120  121   NaN   3.0   4.9   1.2   4     4
121  122   5.2   NaN   5.1   1.8   4     4
122  123   6.1   3.2   5.1   1.8   4     4



Unnamed: 0,id,花萼長度,花萼寬度,花瓣長度,花瓣寬度,屬種,type
119,120,5.9,3.0,5.1,1.8,3,3
120,121,,3.0,4.9,1.2,4,4
121,122,5.2,,5.1,1.8,4,4
122,123,6.1,3.2,5.1,1.8,4,4


接著要做填補 missing value 的動作
首先要先選擇那些columns是data中的feature，然後宣告兩個空的dictionary來記錄data中每個feature有有效值得個數以及計算欲填入的數值

In [4]:
################################################################
## 填 missing value
featureArray = ['花萼長度','花萼寬度','花瓣長度','花瓣寬度']
totalAvailableFeatureCount = dict()
fillMissingValueFeature = dict()
for feature in featureArray:
    totalAvailableFeatureCount[feature]=0

利用for迴圈去對每個feature做計算，首先要先統計一個feature中具有有效數值並非Nan的data有幾筆
而要填補的數值，我選擇用此feature的平均數來做填補，將此feature的有效數值做總和後，再除以有效數值的個數

In [5]:
#總data數
totalDataCount = len(df_iris_train_encode['id'])
#print("Total Data Count: %d"%(totalDataCount))

# for 每個 feature
for feature in featureArray:
    # 統計此feature的有效數值的數量
    for i in df_iris_train_encode[feature]:
        if not(np.isnan(i)):
           totalAvailableFeatureCount[feature] += 1 #if不是Nan，個數+1
    #用此feature有效的數值來算平均數                        
    fillMissingValueFeature[feature] = np.sum(df_iris_train_encode[feature])/totalAvailableFeatureCount[feature]
    print("Available %s Count: %d"%(feature,totalAvailableFeatureCount[feature]))
    print("%s Filled Number: %f"%(feature,fillMissingValueFeature[feature]))

Available 花萼長度 Count: 122
花萼長度 Filled Number: 5.840164
Available 花萼寬度 Count: 122
花萼寬度 Filled Number: 3.058197
Available 花瓣長度 Count: 123
花瓣長度 Filled Number: 3.764228
Available 花瓣寬度 Count: 123
花瓣寬度 Filled Number: 1.204878


計算完填補用的數值後，利用pandas的功能將數值填補到data中

In [6]:
# 填入missing value
df_iris_train_encode = df_iris_train_encode.fillna({featureArray[0]:fillMissingValueFeature[featureArray[0]],
                                                    featureArray[1]:fillMissingValueFeature[featureArray[1]],
                                                    featureArray[2]:fillMissingValueFeature[featureArray[2]],
                                                    featureArray[3]:fillMissingValueFeature[featureArray[3]]})
print("Fill in missing value: ")
print(df_iris_train_encode[119:123])
print()

df_iris_train_encode[119:123]

Fill in missing value: 
      id      花萼長度      花萼寬度  花瓣長度  花瓣寬度  屬種  type
119  120  5.900000  3.000000   5.1   1.8   3     3
120  121  5.840164  3.000000   4.9   1.2   4     4
121  122  5.200000  3.058197   5.1   1.8   4     4
122  123  6.100000  3.200000   5.1   1.8   4     4



Unnamed: 0,id,花萼長度,花萼寬度,花瓣長度,花瓣寬度,屬種,type
119,120,5.9,3.0,5.1,1.8,3,3
120,121,5.840164,3.0,4.9,1.2,4,4
121,122,5.2,3.058197,5.1,1.8,4,4
122,123,6.1,3.2,5.1,1.8,4,4


前處理的最後要做normalized的動作，利用scikit learn，我們可以對4個feature做normalized

In [7]:
#################################################################
## 做 normalized
#training dataset
df_iris_train_encode_normalized = df_iris_train_encode.copy()
# for 每個 feature
for feature in featureArray:
    # 對此 feature 做 normalized
    np_normalized_col = preprocessing.scale(df_iris_train_encode_normalized[feature], axis=0)
    df_iris_train_encode_normalized[feature] = np_normalized_col
print("Training data after nomalized: ")
print(df_iris_train_encode_normalized[119:123])

df_iris_train_encode_normalized[119:123]

Training data after nomalized: 
      id      花萼長度          花萼寬度      花瓣長度      花瓣寬度  屬種  type
119  120  0.076276 -1.317572e-01  0.774593  0.792069   3     3
120  121  0.000000 -1.317572e-01  0.658616 -0.006492   4     4
121  122 -0.816047  2.010833e-15  0.774593  0.792069   4     4
122  123  0.331225  3.210422e-01  0.774593  0.792069   4     4


Unnamed: 0,id,花萼長度,花萼寬度,花瓣長度,花瓣寬度,屬種,type
119,120,0.076276,-0.1317572,0.774593,0.792069,3,3
120,121,0.0,-0.1317572,0.658616,-0.006492,4,4
121,122,-0.816047,2.010833e-15,0.774593,0.792069,4,4
122,123,0.331225,0.3210422,0.774593,0.792069,4,4


對testing data也同樣地做normalized的動作

In [8]:
df_iris_test_normalized = df_iris_test.copy()
for feature in featureArray:
    # 對此 feature 做 normalized
    np_normalized_col = preprocessing.scale(df_iris_test_normalized[feature], axis=0)
    df_iris_test_normalized[feature] = np_normalized_col
print("Testing data after nomalized: ")
print(df_iris_test_normalized[len(df_iris_test_normalized)-4:len(df_iris_test_normalized)])
print()

df_iris_test_normalized[len(df_iris_test_normalized)-4:len(df_iris_test_normalized)]

Testing data after nomalized: 
    id      花萼長度      花萼寬度      花瓣長度      花瓣寬度
26  27 -0.995215 -1.472971  0.344187  0.627332
27  28  1.536781 -0.381881  1.317280  0.756236
28  29  0.903782 -1.472971  1.046976  0.756236
29  30  1.431281  1.527525  1.209158  1.658563



Unnamed: 0,id,花萼長度,花萼寬度,花瓣長度,花瓣寬度
26,27,-0.995215,-1.472971,0.344187,0.627332
27,28,1.536781,-0.381881,1.31728,0.756236
28,29,0.903782,-1.472971,1.046976,0.756236
29,30,1.431281,1.527525,1.209158,1.658563


再來要開始進行model的training以及predicting
要先將前處理過後的training data，分成model的input部分以及output部分
testing data的部分，一般而言只會有input的部分，但此次作業有堤共testing data的正確結果，所以先用testing_output_GroundTurth的陣列記錄起來

後面我使用了scikit learn 提供的 model來逞生最後的結果，我使用的model有 KNN、random forset、貝氏分類器以及 SVM

In [9]:
print("###############################")
print("Training and Predicting")
print("###############################")
## 把training dataset分成 input(x), output(y)
#training dataset
training_input = df_iris_train_encode_normalized[featureArray]
training_output = df_iris_train_encode_normalized['type']
#print(training_input[119:123])
#print(training_output[119:123])

#testing dataset
testing_input = df_iris_test_normalized[featureArray]
testing_output_GroundTurth = np.asarray([1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3])
print("Ground truth:")
print(testing_output_GroundTurth)
print()

###############################
Training and Predicting
###############################
Ground truth:
[1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3]



KNN 的結果

In [10]:
## KNN model
knn = KNeighborsClassifier(n_neighbors=3, weights='uniform')
knn.fit(training_input, training_output)
#predict
testing_output_KNN = knn.predict(testing_input)
print("KNN Results: ")
print(testing_output_KNN)
print(metrics.classification_report(testing_output_GroundTurth, testing_output_KNN), end='')
print("KNN confusion matrix: ")
print(metrics.confusion_matrix(testing_output_GroundTurth, testing_output_KNN))
print()

KNN Results: 
[1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 3 3 3]
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        10
          2       0.91      1.00      0.95        10
          3       1.00      0.90      0.95        10

avg / total       0.97      0.97      0.97        30
KNN confusion matrix: 
[[10  0  0]
 [ 0 10  0]
 [ 0  1  9]]



Random forest 的結果

In [11]:
## Random forest
rfc = RandomForestClassifier(n_estimators=500, criterion='gini', max_features='auto', oob_score=True)
rfc.fit(training_input, training_output)
#predict
testing_output_RF = rfc.predict(testing_input)
print("Random Forest Results: ")
print(testing_output_RF)
print(metrics.classification_report(testing_output_GroundTurth, testing_output_RF), end='')
print("Random Forest confusion matrix: ")
print(metrics.confusion_matrix(testing_output_GroundTurth, testing_output_RF))
print()

Random Forest Results: 
[1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 3 3 3]
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        10
          2       0.91      1.00      0.95        10
          3       1.00      0.90      0.95        10

avg / total       0.97      0.97      0.97        30
Random Forest confusion matrix: 
[[10  0  0]
 [ 0 10  0]
 [ 0  1  9]]



貝氏分類器的結果

In [12]:
## 貝氏分類器
bc = GaussianNB()
bc.fit(training_input, training_output)
#predict
testing_output_BC = rfc.predict(testing_input)
print("Bayes Results: ")
print(testing_output_BC)
print(metrics.classification_report(testing_output_GroundTurth, testing_output_BC), end='')
print("Bayes confusion matrix: ")
print(metrics.confusion_matrix(testing_output_GroundTurth, testing_output_BC))
print()

Bayes Results: 
[1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 3 3 3]
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        10
          2       0.91      1.00      0.95        10
          3       1.00      0.90      0.95        10

avg / total       0.97      0.97      0.97        30
Bayes confusion matrix: 
[[10  0  0]
 [ 0 10  0]
 [ 0  1  9]]



SVM 的結果

In [13]:
## SVM model
svc = SVC(C=10.0, kernel="rbf", probability=True)
svc.fit(training_input, training_output)
#predict
testing_output_SVM = svc.predict(testing_input)
print("SVM Results: ")
print(testing_output_SVM)
print(metrics.classification_report(testing_output_GroundTurth, testing_output_SVM), end='')
print("SVM confusion matrix: ")
print(metrics.confusion_matrix(testing_output_GroundTurth, testing_output_SVM))
print()

SVM Results: 
[1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3]
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        10
          2       1.00      1.00      1.00        10
          3       1.00      1.00      1.00        10

avg / total       1.00      1.00      1.00        30
SVM confusion matrix: 
[[10  0  0]
 [ 0 10  0]
 [ 0  0 10]]



最後產生Kaggle所需的submission.csv檔案

In [14]:
## 產生 submission檔
df_submission = pd.DataFrame({ 'id': df_iris_submission.id, 'type': testing_output_SVM })  #SVM結果
df_submission.to_csv("submission.csv", index=False)

submission = pd.read_csv('submission.csv', encoding = "utf-8", dtype = {'type': np.int32})
print("Submission:")
#print(submission)

submission

Submission:


Unnamed: 0,id,type
0,1,1
1,2,1
2,3,1
3,4,1
4,5,1
5,6,1
6,7,1
7,8,1
8,9,1
9,10,1
