## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [2]:
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

  from numpy.core.umath_tests import inner1d


In [8]:
# 一樣使用鳶尾花資料，但調整超參數
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型 (使用 ? 顆樹，每棵樹的最大深度為 ?)
clf = RandomForestClassifier(n_estimators=5, max_depth=2)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)

print(iris.feature_names)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.14085379 0.         0.40368235 0.45546386]


### 答
- 嘗試幾種不同的n_estimators、max_depth組合：（10, 4）、（5, 4）、（10, 3）、（5, 5），準確率都維持在0.9737不變，特徵的重要性則有所改變：
    - （10, 4）：[0.04648369 0.01285688 0.55229025 0.38836918]
    - （5,  4）：[0.01874326 0.00949248 0.30601228 0.66575198]
    - （10, 3）：[0.08019059 0.02066845 0.45252995 0.44661102]
    - （5,  2）：[0.14085379 0.         0.40368235 0.45546386]    

In [15]:
# 以wine資料集做嘗試

# 讀取wine資料集
wine = datasets.load_wine()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)

# 建立模型 (使用 ? 顆樹，每棵樹的最大深度為 ?)
parameters = [(20, 4), (10, 4), (5, 4), (20, 3), (10, 3), (5, 3), (20, 2), (10, 2), (5, 2)]
for param in parameters:
    n_estimators, max_depth = param
    print('=' * 50)
    print(n_estimators, max_depth)
    print('=' * 50)    
    
    clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)

    # 訓練模型
    clf.fit(x_train, y_train)

    # 預測測試集
    y_pred = clf.predict(x_test)

    acc = metrics.accuracy_score(y_test, y_pred)
    print("Accuracy: ", acc)

    print(wine.feature_names)
    print("Feature importance: ", clf.feature_importances_)

20 4
Accuracy:  0.9777777777777777
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Feature importance:  [0.14124954 0.03110262 0.01219913 0.03003404 0.03327835 0.04356411
 0.14123949 0.00864039 0.0097575  0.11904287 0.08130055 0.17546231
 0.17312911]
10 4
Accuracy:  0.9777777777777777
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Feature importance:  [0.10989523 0.11765354 0.01788187 0.04072617 0.02819106 0.03294067
 0.18214196 0.04413701 0.01659838 0.19139465 0.05689708 0.0556818
 0.10586059]
5 4
Accuracy:  0.9555555555555556
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 

### 答
- 和Decision（homework 42）裡使用DecisionTree訓練和預測Wine資料的準確率表現相比，RandomForest在多組不同參數下，準確率（0.95~0.97）都遠比DecisionTree（0.87）來得高 

In [8]:
# 讀取Boston資料集
boston = datasets.load_boston()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=4)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
clf = RandomForestRegressor(n_estimators=20, max_depth=4)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

# 計算MSE
print(metrics.mean_squared_error(y_test, y_pred))

print(boston.feature_names)
print("Feature importance: ", clf.feature_importances_)

15.588027165295788
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Feature importance:  [4.20903718e-02 1.27258129e-04 2.38059006e-03 9.20964577e-04
 1.54522478e-02 4.49292729e-01 1.25808065e-02 4.82739255e-02
 1.23961311e-03 4.48777793e-03 2.26816179e-02 4.94790150e-03
 3.95524196e-01]


### 答：
- 和DecisionTreeRegessor相比，RandomForestRegressor的MSE(17~21)比DecisionTreeRegessor經過GridSearch選擇最佳參數後的MSE(25~28)較低。