## [作業重點]
目前你應該已經要很清楚資料集中，資料的型態是什麼樣子囉！包含特徵 (features) 與標籤 (labels)。因此要記得未來不管什麼專案，必須要把資料清理成相同的格式，才能送進模型訓練。
今天的作業開始踏入決策樹這個非常重要的模型，請務必確保你理解模型中每個超參數的意思，並試著調整看看，對最終預測結果的影響為何

## 作業

1. 試著調整 DecisionTreeClassifier(...) 中的參數，並觀察是否會改變結果？


In [64]:
from sklearn import datasets, metrics

# 如果是分類問題，請使用 DecisionTreeClassifier，若為回歸問題，請使用 DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

In [65]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型
clf = DecisionTreeClassifier()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.9736842105263158


**min_samples_split 設定為85，accuracy 非常明顯的下滑至0.68**

In [66]:
# 建立模型
clf = DecisionTreeClassifier(max_depth=3, min_samples_split=85)
# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)

Acuuracy:  0.6842105263157895


## 作業

2. 改用其他資料集 (boston, wine)，並與回歸模型的結果進行比較

In [93]:
wine = datasets.load_wine()
print(f'wine.data.shape: {wine.data.shape}')
print(f'wine.data.dtype: {wine.data.dtype}')
print(f'wine.keys: {wine.keys()}')

wine.data.shape: (178, 13)
wine.data.dtype: float64
wine.keys: dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])


In [68]:
print(wine['DESCR'])

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

In [69]:
data = pd.DataFrame(wine.data, columns=wine.feature_names)
data['target'] = wine['target']
data.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [70]:
data.pivot_table(index='target')

Unnamed: 0_level_0,alcalinity_of_ash,alcohol,ash,color_intensity,flavanoids,hue,magnesium,malic_acid,nonflavanoid_phenols,od280/od315_of_diluted_wines,proanthocyanins,proline,total_phenols
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,17.037288,13.744746,2.455593,5.528305,2.982373,1.062034,106.338983,2.010678,0.29,3.157797,1.899322,1115.711864,2.840169
1,20.238028,12.278732,2.244789,3.08662,2.080845,1.056282,94.549296,1.932676,0.363662,2.785352,1.630282,519.507042,2.258873
2,21.416667,13.15375,2.437083,7.39625,0.781458,0.682708,99.3125,3.33375,0.4475,1.683542,1.153542,629.895833,1.67875


In [88]:
# 整理输入集输出集，拆分测试集训练集 
x, y = data.iloc[:, :-1], data['target']
# 训练模型
train_x, test_x, train_y, test_y = \
    train_test_split(x, y, test_size=0.2, random_state=7)


In [89]:
# 建立模型
clf_wine = DecisionTreeClassifier(max_depth=20, min_samples_split=20)
# 訓練模型
clf_wine.fit(train_x, train_y)

# 預測測試集
pred_test_y = clf_wine.predict(test_x)

acc = metrics.accuracy_score(test_y, pred_test_y)
print("Acuuracy: ", acc)

Acuuracy:  0.9166666666666666


## 

In [90]:
lg_wine = LogisticRegression(max_iter=10000, solver='lbfgs')
lg_wine.fit(train_x, train_y)

lg_pred_test_y = lg_wine.predict(test_x)

acc = metrics.accuracy_score(test_y, lg_pred_test_y)
print(acc)

0.9722222222222222
