## [範例重點]
了解機器學習建模的步驟、資料型態以及評估結果等流程

In [1]:
from sklearn import datasets, metrics

# 如果是分類問題，請使用 DecisionTreeClassifier，若為回歸問題，請使用 DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split

## 建立模型四步驟

在 Scikit-learn 中，建立一個機器學習的模型其實非常簡單，流程大略是以下四個步驟

1. 讀進資料，並檢查資料的 shape (有多少 samples (rows), 多少 features (columns)，label 的型態是什麼？)
    - 讀取資料的方法：
        - **使用 pandas 讀取 .csv 檔：**pd.read_csv
        - **使用 numpy 讀取 .txt 檔：**np.loadtxt 
        - **使用 Scikit-learn 內建的資料集：**sklearn.datasets.load_xxx
    - **檢查資料數量：**data.shape (data should be np.array or dataframe)
2. 將資料切為訓練 (train) / 測試 (test)
    - train_test_split(data)
3. 建立模型，將資料 fit 進模型開始訓練
    - clf = DecisionTreeClassifier()
    - clf.fit(x_train, y_train)
4. 將測試資料 (features) 放進訓練好的模型中，得到 prediction，與測試資料的 label (y_test) 做評估
    - clf.predict(x_test)
    - accuracy_score(y_test, y_pred)
    - f1_score(y_test, y_pred)

In [19]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型
clf = DecisionTreeClassifier(criterion='gini',max_depth= None ,
                             min_samples_split=2)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [20]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.9736842105263158


In [21]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [22]:
print("Feature importance: ", clf.feature_importances_)

Feature importance:  [0.01796599 0.         0.52229134 0.45974266]


## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [47]:
boston = datasets.load_boston()
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=4)

In [36]:
clf = DecisionTreeRegressor()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
y_pred

array([14.4, 22. , 21.1, 22.5, 44.8, 23.4, 34.9, 23.2, 17.2, 15.4, 24.1,
       16.5, 22.7, 23.7, 22.6, 13.4, 16.2, 12.8, 10.4, 14.8, 10.4, 15.4,
       21.5, 18.5, 19. , 21.4, 13.4, 15.2, 23. , 21.7,  9.5, 22.9, 36.5,
       20.3, 13.1,  9.5, 33.2, 46. , 24.6, 23.7, 46. , 24.8, 12.7, 29.4,
       25.1, 20.9, 50. , 19.4, 23. , 22.2, 30.8, 23.8, 11.3, 27.1, 15.7,
       19.3, 22. , 33.1, 14.5, 33.1, 16.1, 21.4, 37. , 19.3, 43.1, 29.4,
       21. ,  8.4, 23.2, 23.2, 21.7, 16.2, 22. , 30.1, 21.7, 33.4, 14.5,
       22. , 17.7, 22.2, 21.7, 15.2, 26.6, 23. , 24.7, 20.6, 32.2, 24.5,
       22.5, 50. , 29. , 50. , 19.4, 44.8, 24.4, 19.4, 20. , 23.1, 15.6,
       19. ,  8.4, 19.3, 34.9, 14.5, 23.7, 20.6, 34.9, 30.3, 50. , 22.3,
       22.2, 19.6, 13.2, 37. , 33.4, 29.6, 50. , 13.8,  7. , 19.6, 21.2,
       12.5, 23.3, 22.6, 16.2, 24. , 50. ])

In [37]:
y_test

array([16.5, 24.8, 17.4, 19.3, 37.6, 24.2, 35.4, 19.9, 27.5, 17. , 31.2,
       24.4, 16.1, 27. , 21. , 14.9, 18.9,  6.3, 16.3, 13.9,  8.8, 19.4,
       18.8, 19.8, 17.5, 19.3, 20. , 14.3, 16.1, 19.5, 11. , 21.9, 31. ,
       22. , 15.1, 13.3, 28.7, 46.7, 22.2, 22.8, 42.3, 41.3, 16.7, 31.1,
       26.7, 19.4, 50. , 16.6, 19.5, 24.4, 28.5, 22.3, 12.1, 28.6, 15.6,
       19.2, 27.5, 32. , 20.2, 32.4, 18.4, 19.9, 29.8, 20.1, 43.5, 24.5,
       50. ,  7.2, 19.1, 21.2, 22.6, 22.9, 25. , 23.3, 17.3, 33. , 17.8,
       23.8, 10.9, 18.6, 19.3, 16.7, 28. , 18.2, 29.1, 11.9, 32.7, 18.3,
       22.4, 45.4, 31.5, 48.5, 19.8, 41.7, 22.2, 20.3, 20.7, 50. , 11.8,
       19.5,  8.7, 23.3, 36.4, 13.3, 24.8, 20.4, 44. , 29. , 39.8, 22.9,
       23. , 15.3, 23.7, 30.5, 33.2, 26.4, 50. , 14.2,  8.1, 16. , 20. ,
        8.5, 23.7, 26.4, 18.5, 20. , 50. ])

In [38]:
mae = metrics.mean_absolute_error(y_test, y_pred) # 使用 MAE 評估
mse = metrics.mean_squared_error(y_test, y_pred) # 使用 MSE 評估
r2 = metrics.r2_score(y_test, y_pred)

In [39]:
print("MAE: ", mae)
print("MSE: ", mse)
print("R-square: ", r2)

MAE:  3.418110236220472
MSE:  28.65881889763779
R-square:  0.7143644760680452
