# **Week 4: Colab Experiment**

# I. Introduction
In this exercise, we load the Breast cancer wisconsin dataset for classification.

# II. Methods

1. **Dataset Loading**:
   The `load_breast_cancer` dataset from `sklearn` is loaded, with features stored in `X` and labels (target) in `Y`.

2. **Cross-Validation Setup**:
   - The data is split into 5 cross-validation (CV) folds using `KFold` with `shuffle=True` and `random_state=0` for reproducibility.
   - Indices for training and testing data are stored for each fold in the `kfold_indices` dictionary.

3. **Model Evaluation Setup**:
   - Three models are evaluated: Logistic Regression, Support Vector Machine (SVM), and Decision Tree.
   - For each model, a pipeline is constructed where features are standardized using `StandardScaler`.
   - A hyperparameter search is performed using `GridSearchCV` for each model to find the best combination of hyperparameters on the training data for each fold.

4. **Model Training and Prediction**:
   - For each fold, the training and testing data are extracted using the saved indices.
   - The models (Logistic Regression, SVM, and Decision Tree) are trained on the training data using their respective pipelines.
   - The best model for each fold is selected via `GridSearchCV`, and predictions are made on the test data.

5. **Error Rate Calculation**:
   - After making predictions on the test data, the classification error rate is computed using `zero_one_loss`, which measures the fraction of misclassified samples.
   - The error rates for each fold are stored in the `Error_rate` dictionary for each model.

6. **Results Aggregation**:
   - The mean and standard deviation of the error rates over the 5 cross-validation folds are computed for each model.
   - These statistics are printed for each model (Logistic Regression, SVM, and Decision Tree), summarizing the model performance across the folds.



In [1]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
from collections import Counter
from datetime import datetime
import numpy as np
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import zero_one_loss


In [2]:
# Define the dependent and independent variables.
data = load_breast_cancer()
Y = data.target
X = data.data


In [3]:
# Create CV folds
num_folds = 5
kf = KFold(n_splits=num_folds, random_state=0, shuffle=True)
kfold_indices = {}

for i, (train_index, test_index) in enumerate(kf.split(X)):
  kfold_indices[f"fold_{i}"] = {'train': train_index, 'test': test_index}

In [4]:
# Train models and apply them to the test set

Error_rate = {'logreg': [], 'svm': [], 'decision_tree': []}

for fold_id in range(num_folds):
  X_train = X[kfold_indices[f"fold_{fold_id}"]['train']]
  Y_train = Y[kfold_indices[f"fold_{fold_id}"]['train']]
  X_test = X[kfold_indices[f"fold_{fold_id}"]['test']]
  Y_test = Y[kfold_indices[f"fold_{fold_id}"]['test']]

  # Logistic Regression Model
  ###############################################################
  # Pipeline with StandardScaler and Logistic Regression
  logreg_pipeline = Pipeline([
      ('scaler', StandardScaler()),                   # Standardizes the data
      ('logistic', LogisticRegression(random_state=0)) # Logistic Regression classifier
  ])

  # Define hyperparameter grid for Logistic Regression
  logreg_param_grid = {
    'logistic__C': [0.001, 0.01, 0.1, 1, 10, 100],    # Regularization strength
    'logistic__penalty': ['l1', 'l2'],                # Regularization type (l1 or l2)
    'logistic__solver': ['liblinear', 'saga']         # Solver algorithm for optimization
  }

  # Perform Grid Search Cross Validation to find the best hyperparameters
  logreg_gs = GridSearchCV(logreg_pipeline, logreg_param_grid, cv=3)
  logreg_gs.fit(X_train, Y_train)                     # Train on training data
  Y_pred_log = logreg_gs.predict(X_test)              # Predict on test data

  # Append the error rate for Logistic Regression
  Error_rate['logreg'].append(zero_one_loss(Y_test, Y_pred_log))
  ###############################################################

  # SVM Model
  ###############################################################
  # Pipeline with StandardScaler and SVM classifier
  svm_pipeline = Pipeline([
      ('scaler', StandardScaler()),                   # Standardizes the data
      ('svm', SVC(random_state=0))                    # SVM classifier
  ])

  # Define hyperparameter grid for SVM
  svm_param_grid = {
    'svm__C': [0.01, 0.1, 1, 10, 100],               # Regularization strength
    'svm__kernel': ['linear', 'rbf', 'poly'],         # Type of kernel to use
    'svm__gamma': ['scale', 'auto', 0.001, 0.01, 0.1] # Kernel coefficient
  }

  # Perform Grid Search Cross Validation to find the best hyperparameters
  svm_gs = GridSearchCV(svm_pipeline, svm_param_grid, cv=5)
  svm_gs.fit(X_train, Y_train)                       # Train on training data
  Y_pred_svm = svm_gs.predict(X_test)                # Predict on test data

  # Append the error rate for SVM
  Error_rate['svm'].append(zero_one_loss(Y_test, Y_pred_svm))
  ###############################################################

  # Decision Tree Model
  ###############################################################
  # Pipeline with StandardScaler and Decision Tree Classifier
  tree_pipeline = Pipeline([
      ('scaler', StandardScaler()),                   # Standardizes the data
      ('tree', DecisionTreeClassifier(random_state=0)) # Decision Tree classifier
  ])

  # Define hyperparameter grid for Decision Tree
  tree_param_grid = {'tree__max_depth': [3, 5, 10, None]}  # Maximum depth of the tree

  # Perform Grid Search Cross Validation to find the best hyperparameters
  tree_gs = GridSearchCV(tree_pipeline, tree_param_grid, cv=3)
  tree_gs.fit(X_train, Y_train)                       # Train on training data
  Y_pred_tree = tree_gs.predict(X_test)               # Predict on test data

  # Append the error rate for Decision Tree
  Error_rate['decision_tree'].append(zero_one_loss(Y_test, Y_pred_tree))
  ###############################################################




I tried two different approaches for scaling in my machine learning workflow:

**First Approach**: I performed scaling manually before tuning and then passed the model to `GridSearchCV`.

**Second Approach**: I used a `Pipeline` to integrate both `StandardScaler` and the model into a single process.

I found that the implementation using a `Pipeline` performed better, and here are some explanations for why this happens:

- **Consistent Scaling**: The `Pipeline` ensures that the training and test sets are processed identically, preventing data leakage and maintaining the integrity of model evaluation, especially during cross-validation.

- **Simplified Code**: By integrating preprocessing and model training into one structure, the code becomes cleaner and reduces the potential for errors associated with manual scaling and hyperparameter specification.

- **Best Practices**: Using a `Pipeline` is considered a best practice in machine learning, as it streamlines the process, reduces risks of inconsistent preprocessing, and simplifies tuning across multiple stages (preprocessing and modeling).


## III. Results

Here we report the mean and standard deviation of the error rates over 5 folds for each method.

In [5]:
######################## TODO #####################################
print(f"The error rate over 5 folds in CV:")

# For Logistic Regression
logreg_mean = np.mean(Error_rate['logreg'])
logreg_std = np.std(Error_rate['logreg'])
print(f"Logistic Regression: Mean = {logreg_mean:.4f}, Std = {logreg_std:.4f}")

# For SVM
svm_mean = np.mean(Error_rate['svm'])
svm_std = np.std(Error_rate['svm'])
print(f"SVM: Mean = {svm_mean:.4f}, Std = {svm_std:.4f}")

# For Decision Tree
tree_mean = np.mean(Error_rate['decision_tree'])
tree_std = np.std(Error_rate['decision_tree'])
print(f"Decision Tree: Mean = {tree_mean:.4f}, Std = {tree_std:.4f}")
#####################################################################


The error rate over 5 folds in CV:
Logistic Regression: Mean = 0.0193, Std = 0.0140
SVM: Mean = 0.0246, Std = 0.0129
Decision Tree: Mean = 0.0527, Std = 0.0146


# IV. Conclusion and Discussion


### Conclusion:
**Best Performing Model:**

Logistic Regression has the lowest mean error rate (0.0193), indicating it makes the fewest mistakes on average across the five folds of cross-validation.

**SVM Performance:**

SVM has a slightly higher mean error rate (0.0246) compared to Logistic Regression. While it's still reasonably good, it does not outperform Logistic Regression in this case. The lower standard deviation (0.0129) indicates that the SVM model's performance is more consistent across the folds.

**Decision Tree Performance:**

The Decision Tree has the highest mean error rate (0.0527), which suggests that it is not performing as well as the other two models. The Decision Tree's performance is less reliable, as indicated by its relatively higher mean error and similar standard deviation (0.0146) compared to the SVM.

---

### Some Extra Tests for Results:
I recommend using a baseline of lower standard deviation and mean values for comparison. I conducted multiple tests on the SVM model, which resulted in either a higher standard deviation or mean. Therefore, I carried out several experiments to evaluate the results.

**First:**

svm_param_grid = {  
'svm__C': [0.01, 0.1, 1, 10],  
'svm__kernel': ['linear', 'rbf'],  
}  
svm_gs = GridSearchCV(svm_pipeline, svm_param_grid, cv=3)

The result is:

SVM: Mean = 0.0228, Std = 0.0143  
The mean is lower than the baseline, but the standard deviation is higher.

Next, I updated cv to 5 with the same svm_gs parameters.

The results:  

SVM: Mean = 0.0263, Std = 0.0147  
This result is worse on both mean and standard deviation.

**Second:**

I expanded the parameter grid to include more parameters for training
svm_param_grid = {  
'svm__C': [0.01, 0.1, 1, 10, 100],  
'svm__kernel': ['linear', 'rbf', 'poly'],  
'svm__gamma': ['scale', 'auto', 0.001, 0.01, 0.1]  
}  
svm_gs = GridSearchCV(svm_pipeline, svm_param_grid, cv=3)

SVM: Mean = 0.0246, Std = 0.0140  
Although this shows a worse mean, it indicates a slight improvement in standard deviation.

Set `cv` to 5:  
SVM: Mean = 0.0246, Std = 0.0129  
This is the first time both standard deviation and mean exceeded the baseline.

Set `cv` to 10:  
SVM: Mean = 0.0211, Std = 0.0153  
Here, the mean is lower, but the standard deviation is higher.


**Third:**

 I standardized all three GridSearchCV implementations to cv = 5. The error rates over 5 folds in CV are as follows:

Logistic Regression: Mean = 0.0211, Std = 0.0131  
SVM: Mean = 0.0246, Std = 0.0129  
Decision Tree: Mean = 0.0598, Std = 0.0129  

Despite these evaluations, the other classification models did not outperform the Logistic Regression model.

---
### Discussion:

**Pros and Cons of the Evaluated Models:**

#### 1. **Logistic Regression**

**Pros:**
- **Simplicity and Interpretability:** Logistic Regression is easy to implement and interpret, making it a great choice for understanding the relationship between features and the target variable.
- **Performance:** It achieved the lowest mean error rate (0.0193), indicating strong predictive accuracy on the dataset.

**Cons:**
- **Linearity Assumption:** Logistic Regression assumes a linear relationship between the independent variables and the log-odds of the dependent variable, which may not hold true in all cases.

#### 2. **SVM (Support Vector Machine)**

**Pros:**
- **Flexibility with Kernels:** SVM can handle non-linear relationships using kernel functions, allowing it to model complex data distributions.

**Cons:**

- **Parameter Sensitivity:** The performance of SVM heavily relies on the choice of hyperparameters (C, kernel type, gamma), requiring careful tuning.

#### 3. **Decision Tree**

**Pros:**

- **Handling Non-linearity:** They can capture non-linear relationships without requiring data transformations.

**Cons:**
- **Instability:** Small changes in the data can lead to different tree structures, resulting in model instability.

---

### Proposal for Next Steps

1. **Model Optimization:**
   - Conduct further hyperparameter tuning on SVM and Decision Tree models to identify optimal settings that may improve their performance.

2. **Cross-Validation Strategies:**
   - Implement different cross-validation strategies (e.g., stratified k-fold) to ensure that the training and testing sets represent the overall data distribution accurately.