## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [1]:
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

In [2]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
clf = RandomForestClassifier(n_estimators=20, max_depth=4)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print()
print(iris.feature_names)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9736842105263158

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.08307542 0.0205459  0.4655107  0.43086799]


In [3]:
# 建立模型 (使用 40 顆樹，每棵樹的最大深度為 4)
clf = RandomForestClassifier(n_estimators=40, max_depth=4)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print()
print(iris.feature_names)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9736842105263158

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.09987197 0.04777135 0.40759228 0.44476441]


In [32]:
# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 10)
clf = RandomForestClassifier(n_estimators=20,max_depth=10)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print()
print(iris.feature_names)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9777777777777777

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.11201313 0.02266162 0.0129218  0.01778589 0.02769078 0.06559569
 0.16513468 0.00919766 0.00520418 0.11971471 0.0707054  0.14831984
 0.22305461]


In [18]:
# 建立模型 (使用 40 顆樹，每棵樹的最大深度為 4)
clf = RandomForestClassifier(n_estimators=40, max_depth=4)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print()
print(iris.feature_names)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9736842105263158

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.10998094 0.03748818 0.4619191  0.39061178]


In [19]:
# 建立模型 (使用 100 顆樹，每棵樹的最大深度為 10)
clf = RandomForestClassifier(n_estimators=100, max_depth=10)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print()
print(iris.feature_names)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9736842105263158

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.10802709 0.0407653  0.39557087 0.45563675]


In [20]:
# 建立模型 (使用 500 顆樹，每棵樹的最大深度為不限)
clf = RandomForestClassifier(n_estimators=500, max_features="auto")

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print()
print(iris.feature_names)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9736842105263158

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.09909199 0.03223516 0.42852839 0.44014445]


In [21]:
#讀取wine資料集
wine = datasets.load_wine()
#切分資料集
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25 , random_state=4)
# 建立模型 (使用 500 顆樹，每棵樹的最大深度為不限)
clf = RandomForestClassifier(n_estimators=500, max_features="auto")

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print()
print(wine.feature_names)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9777777777777777

['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Feature importance:  [0.12418692 0.0271911  0.01624521 0.0424565  0.03717306 0.05052087
 0.1359205  0.01000377 0.02123676 0.15619418 0.08913086 0.12230721
 0.16743305]


In [31]:
#讀取boston資料集
boston = datasets.load_boston()
#切分資料集
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25 , random_state=4)
# 建立模型 (使用 500 顆樹，每棵樹的最大深度為不限)
clf = RandomForestRegressor(n_estimators=500, max_features="auto", min_samples_leaf=1)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

#acc = metrics.accuracy_score(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
r2 = metrics.r2_score(y_test, y_pred)
#print("Accuracy: ", acc)
print("MSE:", mse)
print("R2:", r2)
print()
print(boston.feature_names)
print("Feature importance: ", clf.feature_importances_)

MSE: 0.02807297777777778
R2: 0.9578905333333333

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Feature importance:  [0.03524889 0.00252744 0.00169104 0.00445672 0.00777409 0.00247715
 0.38323382 0.00050295 0.00228299 0.09502683 0.02484664 0.17306349
 0.26686796]
