## [作業重點]
了解如何使用 Sklearn 中的 hyper-parameter search 找出最佳的超參數

### 作業
請使用不同的資料集，並使用 hyper-parameter search 的方式，看能不能找出最佳的超參數組合

In [37]:
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier, RandomForestClassifier


In [2]:
diabetes = datasets.load_diabetes()
# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.25, random_state=42)

# 建立模型
clf = GradientBoostingRegressor(random_state=7)

In [3]:
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print(metrics.mean_squared_error(y_test, y_pred))

3194.3823958820526


In [4]:
n_estimators = [100, 200, 300]
max_depth = [1, 3, 5]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth)

## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
grid_search = GridSearchCV(clf, param_grid, scoring="neg_mean_squared_error", n_jobs=-1, verbose=1)

# 開始搜尋最佳參數
grid_result = grid_search.fit(x_train, y_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:   30.4s finished


In [5]:
grid_result.best_params_

{'max_depth': 1, 'n_estimators': 200}

In [6]:
clf_bestparam = GradientBoostingRegressor(max_depth=grid_result.best_params_['max_depth'],
                                           n_estimators=grid_result.best_params_['n_estimators'])

# 訓練模型
clf_bestparam.fit(x_train, y_train)

# 預測測試集
y_pred = clf_bestparam.predict(x_test)

In [7]:
print(metrics.mean_squared_error(y_test, y_pred))

2812.9857279113457


In [63]:
import pandas as pd
import numpy as np

data_path = 'data/'
x_train = pd.read_csv(data_path +'train.csv',header=None)
y_train = pd.read_csv(data_path +'trainLabels.csv',header=None)
x_test = pd.read_csv(data_path +'test.csv',header=None)
x_train = np.asarray(x_train)
y_train = np.asarray(y_train)
x_test = np.asarray(x_test)
y_train = y_train.ravel()
print ('training_x Shape:',x_train.shape,',training_y Shape:',y_train.shape, ',testing_x Shape:',x_test.shape)

training_x Shape: (1000, 40) ,training_y Shape: (1000,) ,testing_x Shape: (9000, 40)


In [64]:
clf = RandomForestClassifier()
n_estimators = [100, 200, 300]
max_depth = [1, 3, 5]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth)
grid_search = GridSearchCV(clf, param_grid, scoring="accuracy", n_jobs=-1, verbose=1,cv=10)
grid_result = grid_search.fit(x_train, y_train)

Fitting 10 folds for each of 9 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   24.9s
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:   31.9s finished


In [65]:
grid_result.best_params_

{'max_depth': 5, 'n_estimators': 200}

In [66]:
clf_bestparam = RandomForestClassifier(max_depth=grid_result.best_params_['max_depth'],
                                           n_estimators=grid_result.best_params_['n_estimators'])

# 訓練模型
clf_bestparam.fit(x_train, y_train)

# 預測測試集
y_pred = clf_bestparam.predict(x_test)

In [67]:
y_pred

array([1, 0, 0, ..., 1, 0, 1], dtype=int64)

In [68]:
rf_best_pred = pd.DataFrame(y_pred, columns={"Solution"})
rf_best_pred.index.name = 'ID'

In [69]:
rf_best_pred.index += 1
rf_best_pred

Unnamed: 0_level_0,Solution
ID,Unnamed: 1_level_1
1,1
2,0
3,0
4,0
5,0
...,...
8996,0
8997,1
8998,1
8999,0


In [70]:
rf_best_pred.to_csv(data_path+'Submission04.csv')

In [62]:
train = pd.read_csv(data_path +'train.csv', header=None)
test = pd.read_csv(data_path +'test.csv', header=None)
test.shape

(9000, 40)