## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [1]:
import multiprocessing
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import confusion_matrix

In [2]:
def performance(model, x_test, y_test):
    print("test accuracy: ", round(model.score(x_test, y_test), 4))
    print("confusion matrix: \n", confusion_matrix(y_test, model.predict(x_test)))

In [3]:
iris = datasets.load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

In [4]:
for num_est in [100, 200, 300, 400]:
    for max_depth in [5, 6, 7, 8]:
        clf = RandomForestClassifier(
            n_estimators=num_est, 
            max_depth=max_depth, 
            bootstrap=True)
        clf.fit(x_train, y_train)
        y_pred = clf.predict(x_test)
        acc = metrics.accuracy_score(y_test, y_pred)
        print("# of estimators: {} | Max depth: {} | Accuracy: {}".format(num_est, max_depth, acc))

# of estimators: 100 | Max depth: 5 | Accuracy: 0.9736842105263158
# of estimators: 100 | Max depth: 6 | Accuracy: 0.9473684210526315
# of estimators: 100 | Max depth: 7 | Accuracy: 0.9736842105263158
# of estimators: 100 | Max depth: 8 | Accuracy: 0.9736842105263158
# of estimators: 200 | Max depth: 5 | Accuracy: 0.9736842105263158
# of estimators: 200 | Max depth: 6 | Accuracy: 0.9736842105263158
# of estimators: 200 | Max depth: 7 | Accuracy: 0.9736842105263158
# of estimators: 200 | Max depth: 8 | Accuracy: 0.9736842105263158
# of estimators: 300 | Max depth: 5 | Accuracy: 0.9736842105263158
# of estimators: 300 | Max depth: 6 | Accuracy: 0.9736842105263158
# of estimators: 300 | Max depth: 7 | Accuracy: 0.9736842105263158
# of estimators: 300 | Max depth: 8 | Accuracy: 0.9736842105263158
# of estimators: 400 | Max depth: 5 | Accuracy: 0.9736842105263158
# of estimators: 400 | Max depth: 6 | Accuracy: 0.9736842105263158
# of estimators: 400 | Max depth: 7 | Accuracy: 0.973684210526

In [5]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

xgb = XGBClassifier(random_state=17, verbosity=0)
params = { 
    'min_child_weight': [1, 5, 10],
    'n_estimators': [100, 200, 300, 400],
    'subsample': [0.6, 0.8, 1.0],
    'learning_rate': [0.1, 0.01, 0.001],
    'max_depth': [5, 10, 15, 20]
}
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)
random_search = RandomizedSearchCV(
    xgb, 
    param_distributions=params, 
    n_iter=5, 
    scoring="f1_weighted", 
    n_jobs=multiprocessing.cpu_count(), 
    cv=skf.split(x_train, y_train), 
    verbose=2, 
    random_state=17)
random_search.fit(x_train, y_train)
xgb = random_search.best_estimator_
performance(xgb, x_test, y_test)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=16)]: Using backend LokyBackend with 16 concurrent workers.


test accuracy:  0.9737
confusion matrix: 
 [[18  0  0]
 [ 0  7  1]
 [ 0  0 12]]


[Parallel(n_jobs=16)]: Done   7 out of  25 | elapsed:    1.4s remaining:    3.7s
[Parallel(n_jobs=16)]: Done  20 out of  25 | elapsed:    1.5s remaining:    0.3s
[Parallel(n_jobs=16)]: Done  25 out of  25 | elapsed:    1.5s finished


In [6]:
boston = datasets.load_boston()
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=4)

In [7]:
for c in [1, 2, 3, 4, 5, 6]:
    reg = make_pipeline(StandardScaler(), svm.SVR(C=c))
    reg.fit(x_train, y_train)
    y_pred = reg.predict(x_test)
    mse = metrics.mean_squared_error(y_test, y_pred)
    print("Penalties: {} | MSE: {}".format(c, mse))

Penalties: 1 | MSE: 42.670029732855795
Penalties: 2 | MSE: 31.870269694362477
Penalties: 3 | MSE: 27.355687749608045
Penalties: 4 | MSE: 25.013394160606925
Penalties: 5 | MSE: 23.04952632025864
Penalties: 6 | MSE: 21.757800048310777


In [8]:
for max_depth in [1, 2, 3, 4, 5, 6]:
    reg = XGBRegressor(max_depth=max_depth, objective="reg:squarederror", verbosity=0)
    reg.fit(x_train, y_train)
    y_pred = reg.predict(x_test)
    mse = metrics.mean_squared_error(y_test, y_pred)
    print("Penalties: {} | MSE: {}".format(max_depth, mse))

Penalties: 1 | MSE: 16.694057583801456
Penalties: 2 | MSE: 17.454989601610414
Penalties: 3 | MSE: 13.994183218680243
Penalties: 4 | MSE: 12.8215578839198
Penalties: 5 | MSE: 14.113723560255737
Penalties: 6 | MSE: 18.419417445034618
