## Kaggle 競賽

Data Science London + Scikit-learn

Hints: https://ai100-2.cupoy.com/mission/D48

同時建立了Gradient Boosting, RandomForest, Logistics regression之模型，比較其準確度之後，再以表現最佳的模型進行資料預測

In [1]:
# 載入需要的套件
import os
import numpy as np 
import pandas as pd
from sklearn import datasets, metrics, linear_model
from sklearn.model_selection import train_test_split, KFold, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingRegressor, RandomForestClassifier
from sklearn.metrics import accuracy_score

## 1. 讀取檔案

In [2]:
# 設定 data_path
dir_data = './data/'
train = os.path.join(dir_data, 'scikit_train.csv')
test = os.path.join(dir_data, 'scikit_test.csv')
label = os.path.join(dir_data, 'scikit_trainLabels.csv')

# 讀取檔案
s_train = pd.read_csv(train, header=None)
s_test = pd.read_csv(test, header=None)
s_trainLabels = pd.read_csv(label, header=None)


## 2. 了解資料特性

In [3]:
s_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
0,0.299403,-1.226624,1.498425,-1.17615,5.289853,0.208297,2.404498,1.594506,-0.051608,0.663234,...,-0.850465,-0.62299,-1.833057,0.293024,3.552681,0.717611,3.305972,-2.715559,-2.682409,0.10105
1,-1.174176,0.332157,0.949919,-1.285328,2.199061,-0.151268,-0.427039,2.619246,-0.765884,-0.09378,...,-0.81975,0.012037,2.038836,0.468579,-0.517657,0.422326,0.803699,1.213219,1.382932,-1.817761
2,1.192222,-0.414371,0.067054,-2.233568,3.658881,0.089007,0.203439,-4.219054,-1.184919,-1.24031,...,-0.604501,0.750054,-3.360521,0.856988,-2.751451,-1.582735,1.672246,0.656438,-0.932473,2.987436
3,1.57327,-0.580318,-0.866332,-0.603812,3.125716,0.870321,-0.161992,4.499666,1.038741,-1.092716,...,1.022959,1.275598,-3.48011,-1.065252,2.153133,1.563539,2.767117,0.215748,0.619645,1.883397
4,-0.613071,-0.644204,1.112558,-0.032397,3.490142,-0.011935,1.443521,-4.290282,-1.761308,0.807652,...,0.513906,-1.803473,0.518579,-0.205029,-4.744566,-1.520015,1.830651,0.870772,-1.894609,0.408332


In [4]:
# train的資料欄列數
s_train.shape

(1000, 40)

In [5]:
# 了解train中，所有欄位的資料類型
s_train.dtypes

0     float64
1     float64
2     float64
3     float64
4     float64
5     float64
6     float64
7     float64
8     float64
9     float64
10    float64
11    float64
12    float64
13    float64
14    float64
15    float64
16    float64
17    float64
18    float64
19    float64
20    float64
21    float64
22    float64
23    float64
24    float64
25    float64
26    float64
27    float64
28    float64
29    float64
30    float64
31    float64
32    float64
33    float64
34    float64
35    float64
36    float64
37    float64
38    float64
39    float64
dtype: object

In [6]:
# test的資料欄列數
s_test.shape

(9000, 40)

## 3. 建立模型
### 3.1. Gradient Boosting模型

In [7]:
# 建立模型
clf = GradientBoostingRegressor(random_state=7)

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(s_train, s_trainLabels, test_size=0.25, random_state=42)

In [8]:
# 先看看train資料集預設參數得到的結果
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
y_pred = y_pred.round() #將預測結果四捨五入

  y = column_or_1d(y, warn=True)


In [9]:
# 預測準確度
acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.852


#### 參數調整

In [10]:
# 設定要訓練的超參數組合
param_dist = {
        'n_estimators':range(50,1000, 5),
        'max_depth':range(1,10,1),
        }
## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
random_search = RandomizedSearchCV(clf, param_dist, scoring="neg_mean_squared_error",n_jobs=-1, verbose=1, cv=5)

# 開始搜尋最佳參數
random_result = random_search.fit(x_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   10.7s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   15.6s finished
  y = column_or_1d(y, warn=True)


In [11]:
# 印出最佳結果與最佳參數
print("Best Accuracy: %f using %s" % (random_result.best_score_, random_result.best_params_))

Best Accuracy: -0.108012 using {'n_estimators': 205, 'max_depth': 4}


In [12]:
# 使用最佳參數重新建立模型
clf_bestparam = GradientBoostingRegressor(max_depth=random_result.best_params_['max_depth'],
                                           n_estimators=random_result.best_params_['n_estimators'])
# 訓練模型
clf_bestparam.fit(x_train, y_train)

# 預測測試集
y_pred = clf_bestparam.predict(x_test)
y_pred = y_pred.round() #將預測結果四捨五入

  y = column_or_1d(y, warn=True)


In [13]:
# 預測準確度
acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.876


### 3.2. RandomForest模型

In [14]:
# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
rf = RandomForestClassifier(n_estimators=20, max_depth=4)

# 訓練模型
rf.fit(x_train, y_train)

# 預測測試集
y_pred = rf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)

Accuracy:  0.844


  """


#### 參數調整

In [15]:
# 以Random Search找出最佳的超參數
# 設定要訓練的超參數組合
param_dist = {
        'n_estimators':range(100,1000, 5),
        'max_depth':range(1,10,1),
        }
## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
random_search = RandomizedSearchCV(rf, param_dist, scoring="accuracy",n_jobs=-1, verbose=1, cv=5)

# 開始搜尋最佳參數
random_result = random_search.fit(x_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   10.4s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   13.0s finished
  self.best_estimator_.fit(X, y, **fit_params)


In [16]:
# 印出最佳結果與最佳參數
print("Best Accuracy: %f using %s" % (random_result.best_score_, random_result.best_params_))

Best Accuracy: 0.858667 using {'n_estimators': 630, 'max_depth': 9}


In [17]:
# 建立最佳參數之模型
rf = RandomForestClassifier(max_depth=random_result.best_params_['max_depth']
                            , n_estimators=random_result.best_params_['n_estimators'])
# 訓練模型
rf.fit(x_train, y_train)

# 預測測試集
y_pred = rf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)

  """


Accuracy:  0.872


### 3.3. Logistics regression模型

In [18]:
# 建立模型
logreg = linear_model.LogisticRegression()

# 訓練模型
logreg.fit(x_train, y_train)

# 預測測試集
y_pred = logreg.predict(x_test)

  y = column_or_1d(y, warn=True)


In [19]:
acc = accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)

Accuracy:  0.812


#### 參數調整

In [20]:
# 以Random Search找出最佳的超參數
# 設定要訓練的超參數組合
param_ldist = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }

## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
random_search = RandomizedSearchCV(logreg, param_ldist, scoring="accuracy",n_jobs=-1, verbose=1, cv=5)

# 開始搜尋最佳參數
random_result = random_search.fit(x_train, y_train)

Fitting 5 folds for each of 7 candidates, totalling 35 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  35 out of  35 | elapsed:    0.0s finished
  y = column_or_1d(y, warn=True)


In [22]:
# 印出最佳結果與最佳參數
print("Best Accuracy: %f using %s" % (random_result.best_score_, random_result.best_params_))

Best Accuracy: 0.818667 using {'C': 0.01}


## 4. 載入test資料集，並儲存結果

In [56]:
# 預測 Test資料集
test_pred = rf.predict(s_test)
test_pred = test_pred.round()

In [57]:
submit = pd.DataFrame(test_pred)
submit.index=range(1,9001)
submit = submit.reset_index()
submit.columns = ['Id', 'Solution']
submit.head()

Unnamed: 0,Id,Solution
0,1,1
1,2,0
2,3,0
3,4,0
4,5,0


In [58]:
# 將結果存成csv
submit.to_csv('Kaggle_48.csv', encoding='utf-8', index=False)