## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [2]:
from sklearn import datasets, metrics
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split,GridSearchCV

In [6]:
# 重複鳶尾花資料集，試著用GridSearch尋找最佳hyper parameters，看看結果是否有異
#
# GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, iid=’warn’, 
# refit=True, cv=’warn’, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’, 
# return_train_score=False)[source]
#
# DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, 
# min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, 
# max_features=None, random_state=None, max_leaf_nodes=None,
# min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)
#
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型
parameters = {'criterion' : ['gini', 'entropy'],
              'splitter' : ['best', 'random'],
              'max_depth' : [3, 5, 7, None],
              'min_samples_split' : [2, 3, 4],
              'min_samples_leaf' : [1, 2]
             }
clf = GridSearchCV(DecisionTreeClassifier(), parameters)
# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

print(clf.best_params_)

{'criterion': 'gini', 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'best'}


In [7]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.9736842105263158


In [13]:
# 回歸樹，使用boston資料集
#
# DecisionTreeRegressor(criterion=’mse’, splitter=’best’, max_depth=None, min_samples_split=2, 
# min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, 
# max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, presort=False)[source]
#

boston = datasets.load_boston()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=4)

# 建立預設模型
clf = DecisionTreeRegressor()
# 訓練模型
clf.fit(x_train, y_train)
# 預測測試集
y_pred = clf.predict(x_test)
# 計算r2分數
print(metrics.r2_score(y_test, y_pred))


# 使用GridSearch尋找最佳參數
parameters = {'criterion' : ['mse', 'friedman_mse', 'mae'],
              'splitter' : ['best', 'random'],
              'max_depth' : [3, 5, 7, None],
              'min_samples_split' : [2, 3, 4],
              'min_samples_leaf' : [1, 2]
             }
clf = GridSearchCV(DecisionTreeRegressor(), parameters)
# 訓練模型
clf.fit(x_train, y_train)
# 預測測試集
y_pred = clf.predict(x_test)
# 最佳參數
print(clf.best_params_)
# 計算r2分數
print(metrics.r2_score(y_test, y_pred))

0.7424762282399908
{'criterion': 'mse', 'max_depth': 7, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'best'}
0.7450228339071028
