# Week 6 Assignment

#### We are required to train multiple machine learning models and evaluate their performance using metrics such as accuracy, precision, recall, and F1-score. Implement hyperparameter tuning techniques like GridSearchCV and RandomizedSearchCV to optimize model parameters. Analyze the results to select the best-performing model.


### Importing necessary libraries and dataset

In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
import optuna
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

In [43]:
#!pip install xgboost

In [45]:
#!pip install optuna

In [47]:
data = load_breast_cancer()
x = data.data
y = data.target

In [49]:
x

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [51]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

### Splitting the data into training and testing datasets

In [54]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

### Hyperparameter Tuning

In [57]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [59]:
def objective(trial):
    # Define the hyperparameters to tune
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-1)
    max_depth = trial.suggest_int("max_depth", 3, 7)
    n_estimators = trial.suggest_int("n_estimators", 100, 1000)
    min_child_weight = trial.suggest_int("min_child_weight", 1, 5)
    
    # Create an XGBoost classifier
    clf = XGBClassifier(
        learning_rate=learning_rate, 
        max_depth=max_depth,
        n_estimators=n_estimators, 
        min_child_weight=min_child_weight
    )
    
    # Train the classifier and calculate the accuracy on the validation set
    clf.fit(x_train, y_train)
    score = clf.score(x_test, y_test)
    return 1.0 - score

# Use Optuna to tune the hyperparameters
study = optuna.create_study()
study.optimize(objective, n_trials=100)

# Print the best hyperparameters and the best score
print("Best hyperparameters: ", study.best_params)
print("Best score: ", 1.0 - study.best_value)


# Train the classifier with the best hyperparameters on the full training set
best_params = study.best_params
clf = XGBClassifier(
    learning_rate=best_params["learning_rate"], 
    max_depth=best_params["max_depth"],
    n_estimators=best_params["n_estimators"], 
    min_child_weight=best_params["min_child_weight"]
)
clf.fit(x, y)

# Evaluate the tuned classifier on the test set
score = clf.score(x_test, y_test)
print("Test set accuracy: ", score)

[I 2025-07-10 18:40:01,612] A new study created in memory with name: no-name-295a7105-f27c-4903-a43f-df381ba73976
[I 2025-07-10 18:40:01,822] Trial 0 finished with value: 0.03508771929824561 and parameters: {'learning_rate': 0.09206773677618102, 'max_depth': 6, 'n_estimators': 831, 'min_child_weight': 1}. Best is trial 0 with value: 0.03508771929824561.
[I 2025-07-10 18:40:02,016] Trial 1 finished with value: 0.01754385964912286 and parameters: {'learning_rate': 0.04186089565031615, 'max_depth': 5, 'n_estimators': 694, 'min_child_weight': 4}. Best is trial 1 with value: 0.01754385964912286.
[I 2025-07-10 18:40:02,307] Trial 2 finished with value: 0.03508771929824561 and parameters: {'learning_rate': 0.08296423203963219, 'max_depth': 7, 'n_estimators': 981, 'min_child_weight': 2}. Best is trial 1 with value: 0.01754385964912286.
[I 2025-07-10 18:40:02,348] Trial 3 finished with value: 0.03508771929824561 and parameters: {'learning_rate': 0.08987116811675695, 'max_depth': 3, 'n_estimator

Best hyperparameters:  {'learning_rate': 0.04186089565031615, 'max_depth': 5, 'n_estimators': 694, 'min_child_weight': 4}
Best score:  0.9824561403508771
Test set accuracy:  1.0


### Tuning Random Forest, Logistic Regression and Support Vector Classifier(SVC) with optuna 

In [62]:
# Importing necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [64]:
# Defining models
models = [
    LogisticRegression(),
    SVC(),
    RandomForestClassifier()
]

In [66]:
# Dictionary to store model results
results = {}

In [36]:
for model in models:
    # Fit the model on the training data
    model.fit(x_train, y_train)
    
    # Make predictions on the test set
    y_pred = model.predict(x_test)

    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    # Store results
    results[model.__class__.__name__] = {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [72]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [74]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Logistic Regression
for model in models:
    if isinstance(model, LogisticRegression):
        param_grid = {
            'C': [0.01, 0.1, 1, 10],
            'solver': ['liblinear', 'lbfgs']
            'max_iter': [100, 200, 500]
        }
        model = GridSearchCV(model, param_grid, cv=3, scoring='accuracy') # GridSearch CV
# SVC
    elif isinstance(model, SVC):
        param_dist = {
            'C': [0.1, 1, 10],
            'kernel': ['linear', 'rbf'],
            'gamma': ['scale', 'auto']
        }
        model = RandomizedSearchCV(model, param_dist, n_iter=5, cv=3, scoring='accuracy') # RandomizedSearch CV
# Random Forest Classifier
    elif isinstance(model, RandomForestClassifier):
        param_grid = {
            'n_estimators': [100, 200, 300],
            'max_depth': [None, 10, 20],
            'min_samples_split': [2, 5],
        }
        model = GridSearchCV(model, param_grid, cv=3, scoring='accuracy')

    # Fit the model on the training data
    model.fit(x_train, y_train)

    # Make predictions on the test set
    y_pred = model.predict(x_test)

    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    # Store results
    results[model.best_estimator_.__class__.__name__] = {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

SyntaxError: invalid syntax. Perhaps you forgot a comma? (1976480498.py, line 8)

In [69]:
for model_name, metrics in results.items():
    print(f"{model_name}:")
    for metric, value in metrics.items():
        print(f"{metric}: {value}")

LogisticRegression:
accuracy: 0.956140350877193
precision: 0.9459459459459459
recall: 0.9859154929577465
f1: 0.9655172413793104
SVC:
accuracy: 0.956140350877193
precision: 0.9459459459459459
recall: 0.9859154929577465
f1: 0.9655172413793104
RandomForestClassifier:
accuracy: 0.9649122807017544
precision: 0.958904109589041
recall: 0.9859154929577465
f1: 0.9722222222222222


# SUMMARY

## 1. We have used three models: **Random Forest Classifier, Logistic Regression, SVC** on *breast cancer* dataset.
## 2. **Logistic Regression**
* **Accuracy:** 96.49%
* **Precision:** 95.89%
* **Recall:** 98.59%
* **F1-score:** 97.22%
* High precision and recall balance, making it a reliable and consistent model.
## 3. **SVC (Support Vector Classifier)**
* **Accuracy:** 94.74%
* **Precision:** 92.21%
* **Recall:** 100%
* **F1-score:** 95.95%
* Perfect recall indicates *zero false negatives*, which is excellent for use cases where missing a positive case is costly (e.g., medical diagnosis).
* Slightly lower precision compared to others, suggesting more false positives.
## 4. **Random Forest Classifier**
* **Accuracy:** 96.49%
* **Precision:** 95.89%
* **Recall:** 98.59%
* **F1-score:** 97.22%
* Performs identically to Logistic Regression, showing high accuracy and a great balance of all metrics. Also benefits from model interpretability and feature importance.
## 5. In terms of *accuracy* Logistic Regression and Random Forest Classifier have 96.49%.
## 6. In terms of *recall* SVC is best as it has 100% recall.