# Comprehensive Machine Learning Models Comparison
## Breast Cancer Classification with Model Persistence

This notebook provides a comprehensive comparison of multiple machine learning algorithms for breast cancer classification using the Wisconsin Breast Cancer dataset. It includes:

- **Shared Data Processing Pipeline**: Consistent data preprocessing for all models
- **Multiple ML Algorithms**: Logistic Regression, KNN, SVM, Naive Bayes, Decision Tree, Random Forest
- **Model Persistence**: Save/Load functionality for trained models
- **Comprehensive Evaluation**: Performance metrics, visualizations, and cross-validation
- **Hyperparameter Tuning**: Optimization for best performing models
- **Feature Analysis**: Importance analysis and model interpretability

**Maintainable Code Structure**: All models use the same preprocessing pipeline and evaluation framework to ensure consistency and reduce code duplication.

## 1. Import Required Libraries

In [None]:
# Test import each module individually to find the exact issue
print("Testing imports step by step...")
print("=" * 40)

try:
    from data_processor import load_and_explore_data, preprocess_data
    print("‚úÖ data_processor imported successfully")
except Exception as e:
    print(f"‚ùå data_processor Error: {e}")

try:
    from model_trainer import train_and_evaluate_model
    print("‚úÖ train_and_evaluate_model imported successfully")
except Exception as e:
    print(f"‚ùå train_and_evaluate_model Error: {e}")

try:
    from model_trainer import analyze_feature_importance
    print("‚úÖ analyze_feature_importance imported successfully")
except Exception as e:
    print(f"‚ùå analyze_feature_importance Error: {e}")

try:
    from model_trainer import optimize_knn_k_values
    print("‚úÖ optimize_knn_k_values imported successfully")
    # Test alias
    optimize_knn_k = optimize_knn_k_values
    print("‚úÖ optimize_knn_k alias created successfully")
except Exception as e:
    print(f"‚ùå optimize_knn_k_values Error: {e}")

# Check what's actually in the module
try:
    import model_trainer
    available_functions = [name for name in dir(model_trainer) if not name.startswith('_')]
    print(f"\nüìã Available functions in model_trainer: {available_functions}")
except Exception as e:
    print(f"‚ùå Cannot inspect model_trainer: {e}")

print("\n" + "=" * 40)

Testing imports step by step...
‚úÖ data_processor imported successfully
‚úÖ train_and_evaluate_model imported successfully
‚úÖ analyze_feature_importance imported successfully
‚ùå optimize_knn_k Error: cannot import name 'optimize_knn_k' from 'model_trainer' (e:\ML_BreastCancer_Wisonsin_Original\ML_BreastCancerWisconsin_Prediction\Codes\model_trainer.py)

üìã Available functions in model_trainer: ['accuracy_score', 'analyze_feature_importance', 'analyze_logistic_coefficients', 'classification_report', 'confusion_matrix', 'datetime', 'f1_score', 'np', 'optimize_knn_k_values', 'pd', 'precision_score', 'recall_score', 'roc_auc_score', 'roc_curve', 'train_and_evaluate_model']



In [14]:
# Import our custom modules
from data_processor import load_and_explore_data, preprocess_data
from model_trainer import train_and_evaluate_model, analyze_feature_importance, optimize_knn_k_values as optimize_knn_k
from visualizer import (plot_confusion_matrix, plot_decision_boundary, plot_feature_importance,
                        plot_knn_analysis, plot_svm_comparison, plot_tree_models_comparison)
from model_persistence import save_model, load_model, save_all_models, list_saved_models, load_model_by_name
from model_comparison import (create_comparison_dataframe, plot_comprehensive_comparison,
                            generate_model_summary_report, create_performance_radar_chart)

# Basic libraries
import pandas as pd
import numpy as np
import warnings
from datetime import datetime
warnings.filterwarnings('ignore')

# Machine Learning Models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("All libraries imported successfully!")
print(f"Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 50)

All libraries imported successfully!
Timestamp: 2025-07-16 09:15:01


## 2. Data Loading and Exploration

In [None]:
# Load and explore data using our module
dataset, dataset_info = load_and_explore_data('../Dataset/breast_cancer_wisconsin.csv')

print("‚úÖ Data loaded and explored successfully!")
print(f"Dataset shape: {dataset.shape}")
print(f"Features: {dataset_info['n_features']}")
print(f"Samples: {dataset_info['n_samples']}")

Dataset Information:
Dataset shape: (699, 10)
Number of features: 8
Number of samples: 699

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Clump_thickness              699 non-null    int64  
 1   Uniformity_of_cell_size      699 non-null    int64  
 2   Uniformity_of_cell_shape     699 non-null    int64  
 3   Marginal_adhesion            699 non-null    int64  
 4   Single_epithelial_cell_size  699 non-null    int64  
 5   Bare_nuclei                  683 non-null    float64
 6   Bland_chromatin              699 non-null    int64  
 7   Normal_nucleoli              699 non-null    int64  
 8   Mitoses                      699 non-null    int64  
 9   Class                        699 non-null    int64  
dtypes: float64(1), int64(9)
memory usage: 54.7 KB

First 5 rows:


Unnamed: 0,Clump_thickness,Uniformity_of_cell_size,Uniformity_of_cell_shape,Marginal_adhesion,Single_epithelial_cell_size,Bare_nuclei,Bland_chromatin,Normal_nucleoli,Mitoses,Class
0,5,1,1,1,2,1.0,3,1,1,2
1,5,4,4,5,7,10.0,3,2,1,2
2,3,1,1,1,2,2.0,3,1,1,2
3,6,8,8,1,3,4.0,3,7,1,2
4,4,1,1,3,2,1.0,3,1,1,2



Dataset Description:


Unnamed: 0,Clump_thickness,Uniformity_of_cell_size,Uniformity_of_cell_shape,Marginal_adhesion,Single_epithelial_cell_size,Bare_nuclei,Bland_chromatin,Normal_nucleoli,Mitoses,Class
count,699.0,699.0,699.0,699.0,699.0,683.0,699.0,699.0,699.0,699.0
mean,4.41774,3.134478,3.207439,2.806867,3.216023,3.544656,3.437768,2.866953,1.589413,2.689557
std,2.815741,3.051459,2.971913,2.855379,2.2143,3.643857,2.438364,3.053634,1.715078,0.951273
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,6.0,5.0,5.0,4.0,4.0,6.0,5.0,4.0,1.0,4.0
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0



Null Values Check:
Clump_thickness                 0
Uniformity_of_cell_size         0
Uniformity_of_cell_shape        0
Marginal_adhesion               0
Single_epithelial_cell_size     0
Bare_nuclei                    16
Bland_chromatin                 0
Normal_nucleoli                 0
Mitoses                         0
Class                           0
dtype: int64
Total null values: 16

Target Variable Distribution:
Class
2    458
4    241
Name: count, dtype: int64
Class distribution: [458 241]
Balance ratio: 0.526

Feature names (8 features):
 1. Uniformity_of_cell_size
 2. Uniformity_of_cell_shape
 3. Marginal_adhesion
 4. Single_epithelial_cell_size
 5. Bare_nuclei
 6. Bland_chromatin
 7. Normal_nucleoli
 8. Mitoses

Data loading and exploration completed successfully!


## 3. Data Preprocessing Pipeline

In [None]:
# Preprocess data using our module
X_train, X_test, y_train, y_test, feature_names = preprocess_data(dataset)

print("‚úÖ Data preprocessing completed successfully!")
print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
print(f"Number of features: {len(feature_names)}")

# Store for later use
dataset_info = {
    'feature_names': feature_names,
    'n_features': len(feature_names),
    'n_samples': len(dataset)
}

Number of null data after processing:
__________________________
Clump_thickness                0
Uniformity_of_cell_size        0
Uniformity_of_cell_shape       0
Marginal_adhesion              0
Single_epithelial_cell_size    0
Bare_nuclei                    0
Bland_chromatin                0
Normal_nucleoli                0
Mitoses                        0
Class                          0
dtype: int64

Declaring features and dependent variables...
On the features, remove the 'Sample code number' because it is not relevant to the prediction
Features (X) shape: (683, 8)
Target (y) shape: (683,)

Splitting the dataset into Training set and Test set...
DataSet Splitting:
_______________________________
X_train:  4368
X_test:  1096
y_train: 546
y_test 137

Feature Scaling...
Feature Scaling Applied Successfully!
_____________________________________
X_train shape: (546, 8)
X_test shape: (137, 8)

Training set - First 5 samples after scaling:
array([[-0.69781134, -0.74152574, -0.63363747,

## 4. Model Training and Evaluation Framework

In [None]:
# Initialize results storage
model_results = {}
all_models = {}

print("‚úÖ Model training framework ready!")
print("Using functions from model_trainer.py:")
print("- train_and_evaluate_model(): Train and evaluate any model")
print("Using functions from visualizer.py:")  
print("- plot_confusion_matrix(): Visualize confusion matrix")
print("- plot_decision_boundary(): Show decision boundary")
print("Storage:")
print("- model_results: Dictionary to store all results")
print("- all_models: Dictionary to store trained models")

## 5. Logistic Regression Implementation

In [5]:
# Create and train Logistic Regression model
logistic_model = LogisticRegression(random_state=0)
lr_results = train_and_evaluate_model(
    logistic_model, "Logistic Regression", 
    X_train, X_test, y_train, y_test
)

# Store results
model_results['Logistic Regression'] = lr_results
all_models['Logistic Regression'] = logistic_model

NameError: name 'train_and_evaluate_model' is not defined

In [None]:
# Visualize Logistic Regression results using our modules
plot_confusion_matrix(lr_results, figsize=(6, 5))
plot_decision_boundary(logistic_model, "Logistic Regression", X_train, y_train, feature_names)

In [None]:
# Analyze feature importance using our module
feature_importance_lr = analyze_feature_importance(logistic_model, feature_names, 'Logistic Regression')

# Plot feature importance using our module
plot_feature_importance(feature_importance_lr, 'Logistic Regression', figsize=(12, 6))

print(f"\nLogistic Regression Summary:")
print(f"‚úì Test Accuracy: {lr_results['test_accuracy']:.4f}")
print(f"‚úì F1-Score: {lr_results['f1_score']:.4f}")
print(f"‚úì Training Time: {lr_results['training_time']:.4f}s")
print(f"‚úì Overfitting: {lr_results['overfitting']:.4f}")

## 6. K-Nearest Neighbors Implementation

In [None]:
# Create and train KNN model
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_results = train_and_evaluate_model(
    knn_model, "K-Nearest Neighbors", 
    X_train, X_test, y_train, y_test
)

# Store results
model_results['KNN'] = knn_results
all_models['KNN'] = knn_model

In [None]:
# Visualize KNN results using our modules
plot_confusion_matrix(knn_results, figsize=(6, 5))
plot_decision_boundary(knn_model, "K-Nearest Neighbors", X_train, y_train, feature_names)

In [None]:
# KNN K-value optimization using our module
k_results, optimal_k = optimize_knn_k(X_train, X_test, y_train, y_test)

In [None]:
# Plot KNN analysis using our module
plot_knn_analysis(k_results, knn_results, X_train, y_train, optimal_k)

print(f"\nKNN Summary:")
print(f"‚úì Test Accuracy: {knn_results['test_accuracy']:.4f}")
print(f"‚úì F1-Score: {knn_results['f1_score']:.4f}")
print(f"‚úì Training Time: {knn_results['training_time']:.4f}s")
print(f"‚úì Overfitting: {knn_results['overfitting']:.4f}")
print(f"‚úì Optimal K: {optimal_k}")

## 7. Support Vector Machine Implementation

In [None]:
# Linear SVM
svm_linear = SVC(kernel='linear', random_state=0)
svm_linear_results = train_and_evaluate_model(
    svm_linear, "SVM (Linear)", 
    X_train, X_test, y_train, y_test
)

# Store results
model_results['SVM Linear'] = svm_linear_results
all_models['SVM Linear'] = svm_linear

In [None]:
# RBF SVM
svm_rbf = SVC(kernel='rbf', random_state=0)
svm_rbf_results = train_and_evaluate_model(
    svm_rbf, "SVM (RBF)", 
    X_train, X_test, y_train, y_test
)

# Store results
model_results['SVM RBF'] = svm_rbf_results
all_models['SVM RBF'] = svm_rbf

In [None]:
# SVM Models Comparison Visualization using our module
plot_svm_comparison(svm_linear_results, svm_rbf_results)

In [None]:
# SVM Decision Boundaries using our module  
plot_decision_boundary(svm_linear, "SVM (Linear)", X_train, y_train, feature_names)
plot_decision_boundary(svm_rbf, "SVM (RBF)", X_train, y_train, feature_names)

print(f"\nSVM Comparison Summary:")
print("=" * 30)
print(f"Linear SVM - Test Accuracy: {svm_linear_results['test_accuracy']:.4f}")
print(f"RBF SVM    - Test Accuracy: {svm_rbf_results['test_accuracy']:.4f}")
print(f"Linear SVM - Training Time: {svm_linear_results['training_time']:.4f}s")
print(f"RBF SVM    - Training Time: {svm_rbf_results['training_time']:.4f}s")

## 8. Decision Tree and Random Forest Implementation

In [None]:
# Decision Tree
dt_model = DecisionTreeClassifier(criterion='entropy', random_state=0)
dt_results = train_and_evaluate_model(
    dt_model, "Decision Tree", 
    X_train, X_test, y_train, y_test
)

# Store results
model_results['Decision Tree'] = dt_results
all_models['Decision Tree'] = dt_model

In [None]:
# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, criterion='entropy', random_state=0)
rf_results = train_and_evaluate_model(
    rf_model, "Random Forest", 
    X_train, X_test, y_train, y_test
)

# Store results
model_results['Random Forest'] = rf_results
all_models['Random Forest'] = rf_model

In [None]:
# Naive Bayes
nb_model = GaussianNB()
nb_results = train_and_evaluate_model(
    nb_model, "Naive Bayes", 
    X_train, X_test, y_train, y_test
)

# Store results
model_results['Naive Bayes'] = nb_results
all_models['Naive Bayes'] = nb_model

In [None]:
# Feature Importance Analysis for Tree-based Models using our module
dt_importance = analyze_feature_importance(dt_model, feature_names, 'Decision Tree')
rf_importance = analyze_feature_importance(rf_model, feature_names, 'Random Forest')

In [None]:
# Tree Models Visualization using our module
plot_tree_models_comparison(dt_importance, rf_importance, dt_results, rf_results)

print(f"\nTree-based Models Summary:")
print("=" * 35)
print(f"Decision Tree - Test Accuracy: {dt_results['test_accuracy']:.4f}")
print(f"Random Forest - Test Accuracy: {rf_results['test_accuracy']:.4f}")
print(f"Naive Bayes   - Test Accuracy: {nb_results['test_accuracy']:.4f}")
print(f"Decision Tree - Overfitting: {dt_results['overfitting']:.4f}")
print(f"Random Forest - Overfitting: {rf_results['overfitting']:.4f}")
print(f"Naive Bayes   - Overfitting: {nb_results['overfitting']:.4f}")

## 9. Model Persistence (Save/Load Models)

In [None]:
# Save all trained models using our module
all_results = {}
for model_name, results in model_results.items():
    all_results[model_name] = {
        'model': all_models[model_name],
        'results': results
    }

save_summary = save_all_models(all_results, save_dir="../Models")

print(f"\n‚úÖ All {len(save_summary)} models saved successfully!")
print("Models can be loaded later using model_persistence.load_model()")

In [None]:
# Test Model Loading using our module
print(f"\nTesting Model Loading:")
print("=" * 25)

# Load best model (Random Forest) for testing
loaded_model, loaded_metadata = load_model_by_name('Random Forest', save_dir="../Models")

if loaded_model and loaded_metadata:
    print(f"‚úÖ Model loaded successfully!")
    print(f"Model: {loaded_metadata['model_name']}")
    print(f"Test Accuracy: {loaded_metadata['results']['test_accuracy']:.4f}")
else:
    print("‚ùå No saved model found")

print(f"\n" + "=" * 50)
print("MODEL PERSISTENCE SUMMARY")
print("=" * 50)
print("‚úÖ All models saved with complete metadata")
print("‚úÖ Models can be loaded independently")
print("‚úÖ Easy to use load_model_by_name() function")
print("‚úÖ Model comparison metrics preserved")
print("=" * 50)

## 10. Comprehensive Model Comparison and Analysis

In [None]:
# Create comprehensive comparison using our module
comparison_df = create_comparison_dataframe(all_results)

print("COMPREHENSIVE MODEL COMPARISON")
print("=" * 60)
display(comparison_df.round(4))

# Generate detailed analysis using our module
generate_model_summary_report(comparison_df, all_results)

# Create comprehensive visualizations using our module
plot_comprehensive_comparison(comparison_df, figsize=(16, 12))

# Create performance radar chart using our module
create_performance_radar_chart(comparison_df, figsize=(12, 8))