# Week 6 Assignment

#### We are required to train multiple machine learning models and evaluate their performance using metrics such as accuracy, precision, recall, and F1-score. Implement hyperparameter tuning techniques like GridSearchCV and RandomizedSearchCV to optimize model parameters. Analyze the results to select the best-performing model.


### Importing necessary libraries and dataset

In [61]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
import optuna
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

In [3]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-3.0.2-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-3.0.2-py3-none-win_amd64.whl (150.0 MB)
   ---------------------------------------- 0.0/150.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/150.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/150.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/150.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/150.0 MB 217.9 kB/s eta 0:11:29
   ---------------------------------------- 0.0/150.0 MB 245.8 kB/s eta 0:10:11
   ---------------------------------------- 0.0/150.0 MB 245.8 kB/s eta 0:10:11
   ---------------------------------------- 0.1/150.0 MB 467.6 kB/s eta 0:05:21
   ---------------------------------------- 0.1/150.0 MB 467.6 kB/s eta 0:05:21
   ---------------------------------------- 0.2/150.0 MB 525.1 kB/s eta 0:04:46
   ---------------------------------------- 0.3/150.0 MB 741.6 kB/s eta 0:03:22
   -----

In [12]:
!pip install optuna

Collecting optuna
  Downloading optuna-4.4.0-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.16.3-py3-none-any.whl.metadata (7.3 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading mako-1.3.10-py3-none-any.whl.metadata (2.9 kB)
Collecting typing-extensions>=4.12 (from alembic>=1.5.0->optuna)
  Downloading typing_extensions-4.14.1-py3-none-any.whl.metadata (3.0 kB)
Downloading optuna-4.4.0-py3-none-any.whl (395 kB)
   ---------------------------------------- 0.0/395.9 kB ? eta -:--:--
   - -------------------------------------- 10.2/395.9 kB ? eta -:--:--
   --- ----------------------------------- 30.7/395.9 kB 220.2 kB/s eta 0:00:02
   ---- ---------------------------------- 41.0/395.9 kB 245.8 kB/s eta 0:00:02
   --------- ----------------------------- 92.2/395.9 kB 585.1 kB/s eta 0:00:01
   --------- ---------------------------



In [20]:
data = load_breast_cancer()
x = data.data
y = data.target

In [24]:
x

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [26]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

### Splitting the data into training and testing datasets

In [6]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

### Hyperparameter Tuning

In [68]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [41]:
def objective(trial):
    # Define the hyperparameters to tune
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-1)
    max_depth = trial.suggest_int("max_depth", 3, 7)
    n_estimators = trial.suggest_int("n_estimators", 100, 1000)
    min_child_weight = trial.suggest_int("min_child_weight", 1, 5)
    
    # Create an XGBoost classifier
    clf = XGBClassifier(
        learning_rate=learning_rate, 
        max_depth=max_depth,
        n_estimators=n_estimators, 
        min_child_weight=min_child_weight
    )
    
    # Train the classifier and calculate the accuracy on the validation set
    clf.fit(x_train, y_train)
    score = clf.score(x_test, y_test)
    return 1.0 - score

# Use Optuna to tune the hyperparameters
study = optuna.create_study()
study.optimize(objective, n_trials=100)

# Print the best hyperparameters and the best score
print("Best hyperparameters: ", study.best_params)
print("Best score: ", 1.0 - study.best_value)


# Train the classifier with the best hyperparameters on the full training set
best_params = study.best_params
clf = XGBClassifier(
    learning_rate=best_params["learning_rate"], 
    max_depth=best_params["max_depth"],
    n_estimators=best_params["n_estimators"], 
    min_child_weight=best_params["min_child_weight"]
)
clf.fit(x, y)

# Evaluate the tuned classifier on the test set
score = clf.score(x_test, y_test)
print("Test set accuracy: ", score)

[I 2025-07-10 00:23:40,818] A new study created in memory with name: no-name-5987fef8-95d5-451e-bcb7-54d1139ddf36
[I 2025-07-10 00:23:41,343] Trial 0 finished with value: 0.03508771929824561 and parameters: {'learning_rate': 0.07064015466539039, 'max_depth': 6, 'n_estimators': 475, 'min_child_weight': 2}. Best is trial 0 with value: 0.03508771929824561.
[I 2025-07-10 00:23:41,491] Trial 1 finished with value: 0.03508771929824561 and parameters: {'learning_rate': 0.0839839444119306, 'max_depth': 7, 'n_estimators': 385, 'min_child_weight': 3}. Best is trial 0 with value: 0.03508771929824561.
[I 2025-07-10 00:23:41,659] Trial 2 finished with value: 0.01754385964912286 and parameters: {'learning_rate': 0.04953623673353179, 'max_depth': 3, 'n_estimators': 476, 'min_child_weight': 4}. Best is trial 2 with value: 0.01754385964912286.
[I 2025-07-10 00:23:41,757] Trial 3 finished with value: 0.04385964912280704 and parameters: {'learning_rate': 0.041293245324970786, 'max_depth': 5, 'n_estimator

Best hyperparameters:  {'learning_rate': 0.04953623673353179, 'max_depth': 3, 'n_estimators': 476, 'min_child_weight': 4}
Best score:  0.9824561403508771
Test set accuracy:  1.0


### Tuning Random Forest, Logistic Regression and Support Vector Classifier(SVC) with optuna 

In [53]:
# Importing necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [55]:
# Defining models
models = [
    LogisticRegression(),
    SVC(),
    RandomForestClassifier()
]

In [59]:
# Dictionary to store model results
results = {}

In [63]:
for model in models:
    # Fit the model on the training data
    model.fit(x_train, y_train)
    
    # Make predictions on the test set
    y_pred = model.predict(x_test)

    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    # Store results
    results[model.__class__.__name__] = {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [65]:
for model_name, metrics in results.items():
    print(f"{model_name}:")
    for metric, value in metrics.items():
        print(f"{metric}: {value}")

LogisticRegression:
accuracy: 0.9649122807017544
precision: 0.958904109589041
recall: 0.9859154929577465
f1: 0.9722222222222222
SVC:
accuracy: 0.9473684210526315
precision: 0.922077922077922
recall: 1.0
f1: 0.9594594594594594
RandomForestClassifier:
accuracy: 0.9649122807017544
precision: 0.958904109589041
recall: 0.9859154929577465
f1: 0.9722222222222222


# SUMMARY

## 1. We have used three models: **Random Forest Classifier, Logistic Regression, SVC** on *breast cancer* dataset.
## 2. **Logistic Regression**
* **Accuracy:** 96.49%
* **Precision:** 95.89%
* **Recall:** 98.59%
* **F1-score:** 97.22%
* High precision and recall balance, making it a reliable and consistent model.
## 3. **SVC (Support Vector Classifier)**
* **Accuracy:** 94.74%
* **Precision:** 92.21%
* **Recall:** 100%
* **F1-score:** 95.95%
* Perfect recall indicates *zero false negatives*, which is excellent for use cases where missing a positive case is costly (e.g., medical diagnosis).
* Slightly lower precision compared to others, suggesting more false positives.
## 4. **Random Forest Classifier**
* **Accuracy:** 96.49%
* **Precision:** 95.89%
* **Recall:** 98.59%
* **F1-score:** 97.22%
* Performs identically to Logistic Regression, showing high accuracy and a great balance of all metrics. Also benefits from model interpretability and feature importance.
## 5. In terms of *accuracy* Logistic Regression and Random Forest Classifier have 96.49%.
## 6. In terms of *recall* SVC is best as it has 100% recall.