# **Week 4: Colab Experiment**

# I. Introduction
In this exercise, we load the Breast cancer wisconsin dataset for classification.

# II. Methods

In [18]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
from collections import Counter
from datetime import datetime
import numpy as np
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import zero_one_loss


In [19]:
# Define the dependent and independent variables.
data = load_breast_cancer()
Y = data.target
X = data.data

In [20]:
# Create CV folds
num_folds = 5
kf = KFold(n_splits=num_folds, random_state=0, shuffle=True)
kfold_indices = {}

for i, (train_index, test_index) in enumerate(kf.split(X)):
  kfold_indices[f"fold_{i}"] = {'train': train_index, 'test': test_index}

In [21]:
# Train models and apply them to the test set

Error_rate = {'logreg': [], 'svm': [], 'decision_tree': []}

scaler = StandardScaler()  # 標準化，避免不同範圍特徵對正則化造成影響

for fold_id in range(num_folds):
  X_train = X[kfold_indices[f"fold_{fold_id}"]['train']]
  Y_train = Y[kfold_indices[f"fold_{fold_id}"]['train']]
  X_test = X[kfold_indices[f"fold_{fold_id}"]['test']]
  Y_test = Y[kfold_indices[f"fold_{fold_id}"]['test']]

  X_train_scaled = scaler.fit_transform(X_train)
  X_test_scaled = scaler.transform(X_test)

  # Logistic regression
  ######################## TODO #####################################
  logreg = LogisticRegression(max_iter=10000, solver='liblinear')

  param_grid_logreg = {
    'C': [0.01, 0.1, 1, 10, 100],  # 正則化參數
    'penalty': ['l1', 'l2']        # 正則化類型
    }

  grid_logreg = GridSearchCV(logreg, param_grid_logreg, cv=5, scoring='accuracy')
  grid_logreg.fit(X_train_scaled, Y_train)
  best_logreg = grid_logreg.best_estimator_
  Y_pred_logreg = best_logreg.predict(X_test_scaled)
  error_logreg = zero_one_loss(Y_test, Y_pred_logreg)
  Error_rate['logreg'].append(error_logreg)
  #####################################################################


  # SVM
  ######################## TODO #####################################
  svm = SVC(kernel='linear')

  param_grid_svm = {
    'C': [0.1, 1, 10, 100],          # 惩罰參數
    'kernel': ['linear', 'rbf'],     # 核函數類型
    # 'gamma': ['scale', 'auto']        # 核係數
  }
  grid_svm = GridSearchCV(svm, param_grid_svm, cv=5, scoring='accuracy')
  grid_svm.fit(X_train_scaled, Y_train)
  best_svm = grid_svm.best_estimator_
  Y_pred_svm = best_svm.predict(X_test_scaled)
  error_svm = zero_one_loss(Y_test, Y_pred_svm)
  Error_rate['svm'].append(error_svm)
  #####################################################################


  # Decision tree
  ######################## TODO #####################################
  dt = DecisionTreeClassifier(random_state=0)

  param_grid_dt = {
    'max_depth': [None, 5, 10, 20, 30],  # 樹的最大深度
    'min_samples_split': [2, 5, 10],      # 分裂所需的最小樣本數
    'min_samples_leaf': [1, 2, 4]         # 葉節點的最小樣本數
  }
  grid_dt = GridSearchCV(dt, param_grid_dt, cv=5, scoring='accuracy')
  grid_dt.fit(X_train_scaled, Y_train)
  best_dt = grid_dt.best_estimator_
  Y_pred_dt = best_dt.predict(X_test_scaled)
  error_dt = zero_one_loss(Y_test, Y_pred_dt)
  Error_rate['decision_tree'].append(error_dt)
  #####################################################################

## III. Results

Here we report the mean and standard deviation of the error rates over 5 folds for each method.

In [22]:
######################## TODO #####################################
print(f"The error rate over 5 folds in CV:")

for model in Error_rate:
  avg_error = np.mean(Error_rate[model])
  std_error = np.std(Error_rate[model])
  print(f"{model}: mean = {avg_error}, std = {std_error}")
#####################################################################


The error rate over 5 folds in CV:
logreg: mean = 0.02105263157894739, std = 0.01312862240973312
svm: mean = 0.026315789473684202, std = 0.014678246079545192
decision_tree: mean = 0.059711224965067554, std = 0.014990903335492592


# IV. Conclusion and Discussion


　　這次作業利用Logistic Regression, SVM, Decision tree等三種模型，並以 Error Rate 作為評估指標。以下將對parameter 的選擇、實驗結果及其與研究目的的關聯進行討論。

**參數選擇的影響分析**

1. Logistic Regression

　　正則化參數 C 和正規化類型（L1、L2）的選擇會影響模型的泛化能力和特徵選擇。較小的 C 值會強化正則化，防止過擬合。

2. SVM

　　Penalty parameter C 和核函數類型（linear、rbf）對模型性能影響顯著。不同的 C 值會影響分類與模型複雜度，選擇適當的核函數則會決定模型是否能得到資料的非線性特徵。

3. Decision Tree

　　最大深度（max_depth）、分裂最小樣本數（min_samples_split）及 leave nodes 最小樣本數（min_samples_leaf）等參數控制樹的複雜度，直接影響模型的過擬合與泛化能力。

**實驗結果分析**

*   Logistic Regression：mean error 為 0.021、標準差為 0.013。
*   SVM：mean error 為 0.026、標準差為 0.015。
*   Decision Tree：mean error 為 0.060、標準差為 0.015。

　　從結果可看出 logistic regression 在這次實驗的結果最佳，有最低的 mean error 和相對較小的標準差，代表其穩定性較高。SVM 次之，雖然 mean error 稍高於 logistic regression，但也有不錯的結果，表示它也具良好的分類能力。Decision Tree 的 mean error 明顯較高，且標準差與 SVM 相近，表示它在本次資料集上的表現較為不穩定，可能是因為此模型較容易過擬合。

**預測效果與目的**

　　本次作業的主要目的是建立有效的模型來預測乳癌的分類，以提升早期診斷的準確性，並改善治療效果。Logisic regression 和 SVM 在 mean error 上有較好表現，表示這兩種方法在乳癌分類中具有較高的準確性。Decision tree 具較高 mean error，這限制它在此應用中的實用性，需進一步調整參數或使用 Random Forests 以提升性能，才能更好地應用於乳癌分類。