<a href="https://colab.research.google.com/github/kuoootina/aop113/blob/main/EX04_05_%E8%BE%A8%E8%AD%98%E9%B3%B6%E5%B0%BE%E8%8A%B1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 問題定義

鳶尾花資料集最初是埃德加·安德森從加拿大加斯帕半島的鳶尾屬花朵中提取的形態學變異資料，後由羅納德·費雪作為判別分析的一個例子，運用到統計學中。

其資料集包含了150個樣本，都屬於鳶尾屬下的3個亞屬，分別是山鳶尾（Setosa）、變色鳶尾（Versicolor）和維吉尼亞鳶尾（Virginica）。每個樣本都包含4項特徵，即花萼和花瓣的長度和寬度，它們可用於樣本的定量分析。基於這些特徵，費雪發展了能夠確定其屬種的線性判別分析。

## 資料收集

In [5]:
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)

X, y = iris.data, iris.target
df = iris.frame
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


## 資料前處理

### 資料清理

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB


### 探索性分析

In [4]:
df_cor = df.corr()
df_cor

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
sepal length (cm),1.0,-0.11757,0.871754,0.817941,0.782561
sepal width (cm),-0.11757,1.0,-0.42844,-0.366126,-0.426658
petal length (cm),0.871754,-0.42844,1.0,0.962865,0.949035
petal width (cm),0.817941,-0.366126,0.962865,1.0,0.956547
target,0.782561,-0.426658,0.949035,0.956547,1.0


### 資料分割

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("訓練集樣本數:", len(X_train))
print("測試集樣本數:", len(X_test))

### 類別轉換

### 特徵縮放

In [7]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## 模型訓練

In [8]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

In [10]:
knn.score(X_train, y_train)

0.9666666666666667

## 模型評估

In [11]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = knn.predict(X_test)

print("準確率：", accuracy_score(y_test, y_pred))
print("分類報告：\n", classification_report(y_test, y_pred))

準確率： 1.0
分類報告：
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



# 模型調整

In [13]:
from sklearn.model_selection import GridSearchCV
param_grid = { 'n_neighbors':[3, 5,7] , 'weights':['uniform', 'distance']}
grid = GridSearchCV(knn, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)

In [14]:
grid.best_params_    # 最佳參數組合

{'n_neighbors': 3, 'weights': 'uniform'}

In [15]:
grid.best_score_     # 交叉驗證中最佳準確率

np.float64(0.9583333333333334)

## 模型部署

### 儲存模型

In [16]:
# prompt: 幫我儲存knn和scaler模型

import joblib

# Save the trained KNN model
joblib.dump(knn, 'knn_model.pkl')

# Save the fitted scaler
joblib.dump(scaler, 'scaler.pkl')

['scaler.pkl']

### 推論預測

In [None]:
# prompt: 使用以下虛擬資料預測 data = {
#     'sepal length (cm)': [5.1],
#     'sepal width (cm)': [3.5],
#     'petal length (cm)': [1.4],
#     'petal width (cm)': [0.2]
# }
# virtual_df = pd.DataFrame(data, columns=data.keys())

import pandas as pd

data = {
    'sepal length (cm)': [5.1],
    'sepal width (cm)': [3.5],
    'petal length (cm)': [1.4],
    'petal width (cm)': [0.2]
}
virtual_df = pd.DataFrame(data, columns=data.keys())

# Scale the virtual data using the loaded scaler
virtual_df_scaled = scaler.transform(virtual_df)

# Predict the class using the loaded model
prediction = knn.predict(virtual_df_scaled)

# Get the class name from the iris dataset target names
predicted_class_name = iris.target_names[prediction[0]]

print("Predicted class:", predicted_class_name)