# Model Comparison and A/B Testing

In this notebook, we will:
- Compare the performance of Logistic Regression, SVM, and Random Forest on a classification task.
- Evaluate the models using accuracy, classification reports, and confusion matrices.
- Perform hyperparameter tuning to improve model performance.
- Conduct A/B testing between the models with and without hyperparameter tuning to assess if hyperparameter tuning leads to significantly better performance.


In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from scipy import stats

In [2]:
# Step 1: Load and Explore the Data
# -----------------------------------

# Import necessary libraries
# We're using pandas for data manipulation and exploration
import pandas as pd

# Load the dataset
# Replace 'data.csv' with the actual path to your dataset
file_path = 'data.csv'  # Example file path, update as needed
data = pd.read_csv(file_path)

# Display basic information about the dataset
# This helps us understand the structure, data types, and other metadata
print("Dataset Information:")
print(data.info())  # Display information about columns, data types, and non-null values

# Display the first few rows of the dataset
# It gives a quick look at the data to understand its structure and content
print("\nFirst 5 Rows of the Dataset:")
print(data.head())  # Display the first 5 rows to inspect the data

# Check for missing values
# This helps us identify if there are any missing values in the dataset
# It is important to address missing data before proceeding with analysis
print("\nMissing Values in the Dataset:")
print(data.isnull().sum())  # Count missing values per column

# Display basic statistics for numerical columns
# This will give us an overview of the distributions and ranges of numerical values in the dataset
print("\nSummary Statistics:")
print(data.describe())  # Display basic statistical measures (mean, std, min, max, etc.)

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perim

### Step 2: Clean the Data

In this step, we'll clean the dataset by removing unnecessary columns. We'll drop columns like 'id' and 'Unnamed: 32' as they don't provide meaningful information for our analysis. After cleaning, we'll display the updated dataset's basic information and show the first few rows to confirm the cleaning process was successful.


In [3]:
# Step 2: Clean the Data
# -----------------------

# Drop unnecessary columns
# We will remove 'id' and 'Unnamed: 32' columns since they don't provide meaningful information for our analysis
data_cleaned = data.drop(['id', 'Unnamed: 32'], axis=1)

# Display the updated dataset information
# After removing the unnecessary columns, we can check the new structure of the dataset
print("Updated Dataset Information:")
print(data_cleaned.info())  # This will show the updated dataset, confirming the drop

# Display the first few rows of the cleaned dataset
# It's helpful to inspect the cleaned data to make sure the columns have been correctly removed
print("\nFirst 5 Rows of the Cleaned Dataset:")
print(data_cleaned.head())  # Preview the first 5 rows of the cleaned data

Updated Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    object 
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 1

### Step 3: Encode the Target Variable

In this step, we'll encode the target variable 'diagnosis', which represents the class of the cancer (Benign or Malignant), into numerical values. We'll map 'B' to 0 (Benign) and 'M' to 1 (Malignant). After encoding, we'll verify the changes by checking the unique values in the 'diagnosis' column and displaying the first few rows of the dataset.


In [4]:
# Step 3: Encode the Target Variable
# -----------------------

# Map the diagnosis column to numerical values
# 'B' (Benign) will be mapped to 0, and 'M' (Malignant) will be mapped to 1
data_cleaned['diagnosis'] = data_cleaned['diagnosis'].map({'B': 0, 'M': 1})

# Check the unique values to confirm encoding
# We can inspect the unique values in the 'diagnosis' column to ensure the mapping was done correctly
print("Unique values in the 'diagnosis' column after encoding:")
print(data_cleaned['diagnosis'].unique())

# Display the first few rows to verify changes
# After encoding, let's preview the dataset to check the changes in the 'diagnosis' column
print("\nFirst 5 Rows After Encoding:")
print(data_cleaned.head())

Unique values in the 'diagnosis' column after encoding:
[1 0]

First 5 Rows After Encoding:
   diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0          1        17.99         10.38          122.80     1001.0   
1          1        20.57         17.77          132.90     1326.0   
2          1        19.69         21.25          130.00     1203.0   
3          1        11.42         20.38           77.58      386.1   
4          1        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280          0.1980              0.10430   

   symmetry_mean  ...  r

### Step 4: Split the Dataset into Training and Testing Sets

Now, we will split the dataset into training and testing sets. This is an important step to evaluate the model's performance on unseen data. We'll separate the features (X) from the target variable (y), and then use `train_test_split` from Scikit-learn to create the training and testing sets. We will use a test size of 20% and set a random seed for reproducibility. Finally, we'll check the shape of the training and testing datasets.


In [5]:
# Step 4: Split the Dataset into Training and Testing Sets
# -----------------------

# Import train_test_split for splitting the data
from sklearn.model_selection import train_test_split

# Separate features (X) and target variable (y)
# The target variable 'diagnosis' is separated from the features
X = data_cleaned.drop('diagnosis', axis=1)
y = data_cleaned['diagnosis']

# Split the data into training and testing sets
# We use 20% of the data for testing, and the rest (80%) for training
# The stratify parameter ensures that the target variable's distribution is maintained in both training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Check the shape of the training and testing sets
# We print the shape of features and labels for both the training and testing sets
print("Shape of Training Features:", X_train.shape)
print("Shape of Testing Features:", X_test.shape)
print("Shape of Training Labels:", y_train.shape)
print("Shape of Testing Labels:", y_test.shape)

Shape of Training Features: (455, 30)
Shape of Testing Features: (114, 30)
Shape of Training Labels: (455,)
Shape of Testing Labels: (114,)


### Step 6: Scale the Features

In this step, we will scale the features using `StandardScaler` from Scikit-learn. Feature scaling is important for algorithms that rely on distance metrics or optimization, such as Logistic Regression. We will apply scaling to both the training and testing datasets, re-train the Logistic Regression model on the scaled data, and evaluate its performance. We will display the accuracy, classification report, and confusion matrix for the scaled model.


In [7]:
# Step 6: Scale the Features
# ---------------------------

# Import StandardScaler from scikit-learn to scale the features
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Initialize the Logistic Regression model
log_reg = LogisticRegression(max_iter=1000, random_state=42)
# Fit and transform the training data, and transform the test data
# The 'fit_transform' is applied to the training data, and 'transform' is applied to the test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Re-train the Logistic Regression model on the scaled data
log_reg.fit(X_train_scaled, y_train)

# Make predictions on the scaled test data
y_pred_scaled = log_reg.predict(X_test_scaled)

# Evaluate the model performance on the scaled data

# Calculate the accuracy score of the model on the scaled test set
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print("Logistic Regression Model Accuracy (Scaled):", accuracy_scaled)

# Display the classification report for the scaled model
# The classification report provides precision, recall, and f1-score for each class
print("\nClassification Report (Scaled):")
print(classification_report(y_test, y_pred_scaled))

# Display the confusion matrix for the scaled model
# The confusion matrix helps us understand the true positives, false positives, true negatives, and false negatives
print("\nConfusion Matrix (Scaled):")
print(confusion_matrix(y_test, y_pred_scaled))

Logistic Regression Model Accuracy (Scaled): 0.9649122807017544

Classification Report (Scaled):
              precision    recall  f1-score   support

           0       0.96      0.99      0.97        72
           1       0.97      0.93      0.95        42

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114


Confusion Matrix (Scaled):
[[71  1]
 [ 3 39]]


### Step 7: Train the Support Vector Machine (SVM) and Random Forest Models

In this step, we will train two additional machine learning models: Support Vector Machine (SVM) and Random Forest. Both models will be trained using the scaled features. We will evaluate their performance by calculating the accuracy of each model and compare them to the Logistic Regression model. The results will be stored in a dictionary for easy comparison.


In [8]:
# Step 7: Train the Support Vector Machine (SVM) and Random Forest Models
# ------------------------------------------------------------------------

# Import SVC for Support Vector Machine and RandomForestClassifier for Random Forest
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Train the Support Vector Machine (SVM)
svm = SVC(random_state=42)  # Initialize the SVM model
svm.fit(X_train_scaled, y_train)  # Train the SVM model
y_pred_svm = svm.predict(X_test_scaled)  # Make predictions on the test set
accuracy_svm = accuracy_score(y_test, y_pred_svm)  # Evaluate the accuracy of the SVM model

# Train the Random Forest model
rf = RandomForestClassifier(random_state=42)  # Initialize the Random Forest model
rf.fit(X_train_scaled, y_train)  # Train the Random Forest model
y_pred_rf = rf.predict(X_test_scaled)  # Make predictions on the test set
accuracy_rf = accuracy_score(y_test, y_pred_rf)  # Evaluate the accuracy of the Random Forest model

# Store the results in a dictionary for easy comparison
model_accuracies = {
    'Logistic Regression': accuracy_scaled,
    'Support Vector Machine': accuracy_svm,
    'Random Forest': accuracy_rf
}

# Print the accuracies for each model
print("Model Accuracies:", model_accuracies)

Model Accuracies: {'Logistic Regression': 0.9649122807017544, 'Support Vector Machine': 0.9736842105263158, 'Random Forest': 0.9736842105263158}


### Step 8: Perform Paired t-test to Compare Model Performance

In this step, we will use cross-validation to generate multiple accuracy scores for each model (Logistic Regression, SVM, and Random Forest) and then perform a **paired t-test** to statistically compare the performance of these models. This will help us determine if there is a significant difference in their accuracies.

We will:
1. Use **cross-validation** to obtain multiple accuracy scores for each model.
2. Perform the **paired t-test** to compare the models.
3. Determine if the differences are statistically significant based on the p-value (commonly, a p-value < 0.05 indicates a significant difference).

Now, let's proceed with the cross-validation and t-test.



In [10]:
# Step 8: Perform Paired t-test to Compare Model Performance

from sklearn.model_selection import cross_val_score
from scipy.stats import ttest_rel

# Perform cross-validation to get multiple accuracy scores for each model
log_reg_scores = cross_val_score(log_reg, X_train_scaled, y_train, cv=5, scoring='accuracy')
svm_scores = cross_val_score(svm, X_train_scaled, y_train, cv=5, scoring='accuracy')
rf_scores = cross_val_score(rf, X_train_scaled, y_train, cv=5, scoring='accuracy')

# Perform paired t-test between Logistic Regression and SVM
t_stat_lr_svm, p_value_lr_svm = ttest_rel(log_reg_scores, svm_scores)

# Perform paired t-test between Logistic Regression and Random Forest
t_stat_lr_rf, p_value_lr_rf = ttest_rel(log_reg_scores, rf_scores)

# Print results for both comparisons
print(f"Paired t-test between Logistic Regression and SVM: T-statistic = {t_stat_lr_svm}, P-value = {p_value_lr_svm}")
print(f"Paired t-test between Logistic Regression and Random Forest: T-statistic = {t_stat_lr_rf}, P-value = {p_value_lr_rf}")

# Check significance (commonly p-value < 0.05 is considered significant)
if p_value_lr_svm < 0.05:
    print("The difference in performance between Logistic Regression and SVM is statistically significant.")
else:
    print("The difference in performance between Logistic Regression and SVM is not statistically significant.")

if p_value_lr_rf < 0.05:
    print("The difference in performance between Logistic Regression and Random Forest is statistically significant.")
else:
    print("The difference in performance between Logistic Regression and Random Forest is not statistically significant.")


Paired t-test between Logistic Regression and SVM: T-statistic = -1.0, P-value = 0.373900966300059
Paired t-test between Logistic Regression and Random Forest: T-statistic = 0.6064784348631235, P-value = 0.5769327973042664
The difference in performance between Logistic Regression and SVM is not statistically significant.
The difference in performance between Logistic Regression and Random Forest is not statistically significant.
