# Lab 4: Machine Learning Ensemble Methods
## Activity: Breast Cancer Classification with Ensemble Methods

This notebook implements a complete machine learning pipeline with ensemble methods for breast cancer classification.

In [1]:
# 1. Load the dataset (5pts)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
data = pd.read_csv('data.csv')
print("Dataset loaded successfully!")
print(f"Dataset has {len(data)} rows and {len(data.columns)} columns")

Dataset loaded successfully!
Dataset has 569 rows and 33 columns


In [2]:
# 2. Identify the shape of your data (5 pts)
print("Shape of the dataset:")
print(f"Data shape: {data.shape}")
print(f"Number of rows: {data.shape[0]}")
print(f"Number of columns: {data.shape[1]}")

Shape of the dataset:
Data shape: (569, 33)
Number of rows: 569
Number of columns: 33


In [3]:
# 3. Enlist the column names in your dataset (5pts)
print("Column names in the dataset:")
print("\nAll columns:")
for i, col in enumerate(data.columns, 1):
    print(f"{i:2d}. {col}")

print(f"\nTotal number of columns: {len(data.columns)}")

Column names in the dataset:

All columns:
 1. id
 2. diagnosis
 3. radius_mean
 4. texture_mean
 5. perimeter_mean
 6. area_mean
 7. smoothness_mean
 8. compactness_mean
 9. concavity_mean
10. concave points_mean
11. symmetry_mean
12. fractal_dimension_mean
13. radius_se
14. texture_se
15. perimeter_se
16. area_se
17. smoothness_se
18. compactness_se
19. concavity_se
20. concave points_se
21. symmetry_se
22. fractal_dimension_se
23. radius_worst
24. texture_worst
25. perimeter_worst
26. area_worst
27. smoothness_worst
28. compactness_worst
29. concavity_worst
30. concave points_worst
31. symmetry_worst
32. fractal_dimension_worst
33. Unnamed: 32

Total number of columns: 33


In [4]:
# 4. Establish X and Y Matrix (5pts)
# Clean the data first - remove unnecessary columns
print("Data info before cleaning:")
print(data.info())

# Remove 'id' column and any unnamed columns
data_clean = data.drop(['id'], axis=1)
if 'Unnamed: 32' in data_clean.columns:
    data_clean = data_clean.drop(['Unnamed: 32'], axis=1)

# Create feature matrix X and target variable y
X = data_clean.drop(['diagnosis'], axis=1)  # Features (all columns except diagnosis)
y = data_clean['diagnosis']  # Target variable (diagnosis)

print(f"\nFeature matrix X shape: {X.shape}")
print(f"Target variable y shape: {y.shape}")
print(f"\nTarget variable unique values: {y.unique()}")
print(f"Target variable value counts:\n{y.value_counts()}")

Data info before cleaning:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14 

In [5]:
# 5. Perform 70/30 Data Split (5pts)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.30, 
    random_state=42, 
    stratify=y
)

print("Data split completed successfully!")
print(f"Training set: 70% of the data")
print(f"Testing set: 30% of the data")
print(f"Random state: 42 (for reproducibility)")
print(f"Stratified split: Yes (maintains class distribution)")

Data split completed successfully!
Training set: 70% of the data
Testing set: 30% of the data
Random state: 42 (for reproducibility)
Stratified split: Yes (maintains class distribution)


In [6]:
# 6. Provide data dimension (train and test) (5pts)
print("Data Dimensions After Split:")
print("=" * 40)
print(f"Training set features (X_train): {X_train.shape}")
print(f"Training set target (y_train): {y_train.shape}")
print(f"Testing set features (X_test): {X_test.shape}")
print(f"Testing set target (y_test): {y_test.shape}")

print("\nDetailed Information:")
print(f"• Total original samples: {len(X)}")
print(f"• Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"• Testing samples: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")
print(f"• Number of features: {X_train.shape[1]}")

print("\nClass distribution in training set:")
print(y_train.value_counts())
print("\nClass distribution in testing set:")
print(y_test.value_counts())

Data Dimensions After Split:
Training set features (X_train): (398, 30)
Training set target (y_train): (398,)
Testing set features (X_test): (171, 30)
Testing set target (y_test): (171,)

Detailed Information:
• Total original samples: 569
• Training samples: 398 (69.9%)
• Testing samples: 171 (30.1%)
• Number of features: 30

Class distribution in training set:
diagnosis
B    250
M    148
Name: count, dtype: int64

Class distribution in testing set:
diagnosis
B    107
M     64
Name: count, dtype: int64


In [7]:
# 7. Define the Model (5pts)
# Creating three different models with your specified names
print("Defining three machine learning models:")
print("=" * 50)

# Model 1: Jan - Random Forest Classifier
Jan = RandomForestClassifier(
    n_estimators=100, 
    random_state=42, 
    max_depth=10
)
print("✓ Jan: Random Forest Classifier")
print("  - n_estimators: 100")
print("  - max_depth: 10")
print("  - random_state: 42")

# Model 2: Paul - Support Vector Machine
Paul = SVC(
    kernel='rbf', 
    random_state=42, 
    probability=True  # Needed for ensemble voting
)
print("\n✓ Paul: Support Vector Machine (SVM)")
print("  - kernel: rbf")
print("  - probability: True")
print("  - random_state: 42")

# Model 3: Llatuna - Logistic Regression
Llatuna = LogisticRegression(
    random_state=42, 
    max_iter=1000
)
print("\n✓ Llatuna: Logistic Regression")
print("  - max_iter: 1000")
print("  - random_state: 42")

print("\nAll three models defined successfully!")

Defining three machine learning models:
✓ Jan: Random Forest Classifier
  - n_estimators: 100
  - max_depth: 10
  - random_state: 42

✓ Paul: Support Vector Machine (SVM)
  - kernel: rbf
  - probability: True
  - random_state: 42

✓ Llatuna: Logistic Regression
  - max_iter: 1000
  - random_state: 42

All three models defined successfully!


In [8]:
# 8. Build the training model (5pts)
import time

print("Training all three models...")
print("=" * 40)

# Train Jan (Random Forest)
start_time = time.time()
Jan.fit(X_train, y_train)
jan_training_time = time.time() - start_time
print(f"✓ Jan (Random Forest) trained in {jan_training_time:.3f} seconds")

# Train Paul (SVM)
start_time = time.time()
Paul.fit(X_train, y_train)
paul_training_time = time.time() - start_time
print(f"✓ Paul (SVM) trained in {paul_training_time:.3f} seconds")

# Train Llatuna (Logistic Regression)
start_time = time.time()
Llatuna.fit(X_train, y_train)
llatuna_training_time = time.time() - start_time
print(f"✓ Llatuna (Logistic Regression) trained in {llatuna_training_time:.3f} seconds")

total_training_time = jan_training_time + paul_training_time + llatuna_training_time
print(f"\nTotal training time: {total_training_time:.3f} seconds")
print("All models have been successfully trained!")

Training all three models...
✓ Jan (Random Forest) trained in 0.244 seconds
✓ Paul (SVM) trained in 0.072 seconds
✓ Llatuna (Logistic Regression) trained in 0.654 seconds

Total training time: 0.971 seconds
All models have been successfully trained!


In [9]:
# 9. Perform prediction on test data (5pts)
print("Making predictions on test data...")
print("=" * 40)

# Make predictions with each model
jan_predictions = Jan.predict(X_test)
paul_predictions = Paul.predict(X_test)
llatuna_predictions = Llatuna.predict(X_test)

print("✓ Jan (Random Forest) predictions completed")
print("✓ Paul (SVM) predictions completed")
print("✓ Llatuna (Logistic Regression) predictions completed")

print(f"\nPrediction details:")
print(f"• Test samples: {len(X_test)}")
print(f"• Jan predictions shape: {jan_predictions.shape}")
print(f"• Paul predictions shape: {paul_predictions.shape}")
print(f"• Llatuna predictions shape: {llatuna_predictions.shape}")

print(f"\nSample predictions (first 10 samples):")
print(f"Jan:     {jan_predictions[:10]}")
print(f"Paul:    {paul_predictions[:10]}")
print(f"Llatuna: {llatuna_predictions[:10]}")
print(f"Actual:  {y_test.iloc[:10].values}")

Making predictions on test data...
✓ Jan (Random Forest) predictions completed
✓ Paul (SVM) predictions completed
✓ Llatuna (Logistic Regression) predictions completed

Prediction details:
• Test samples: 171
• Jan predictions shape: (171,)
• Paul predictions shape: (171,)
• Llatuna predictions shape: (171,)

Sample predictions (first 10 samples):
Jan:     ['B' 'B' 'B' 'B' 'B' 'B' 'M' 'B' 'M' 'B']
Paul:    ['B' 'B' 'B' 'B' 'B' 'B' 'M' 'B' 'B' 'B']
Llatuna: ['B' 'B' 'B' 'B' 'B' 'B' 'M' 'B' 'B' 'B']
Actual:  ['B' 'B' 'B' 'B' 'B' 'B' 'M' 'B' 'M' 'B']


In [11]:
# 10. Print Model Performance (5pts)
print("Individual Model Performance")
print("=" * 50)

# Jan (Random Forest) Performance
jan_accuracy = accuracy_score(y_test, jan_predictions)
print(f"\n🌟 Jan (Random Forest) Performance:")
print(f"Accuracy: {jan_accuracy:.4f} ({jan_accuracy*100:.2f}%)")
print("Classification Report:")
print(classification_report(y_test, jan_predictions))

# Paul (SVM) Performance
paul_accuracy = accuracy_score(y_test, paul_predictions)
print(f"\n🌟 Paul (SVM) Performance:")
print(f"Accuracy: {paul_accuracy:.4f} ({paul_accuracy*100:.2f}%)")
print("Classification Report:")
print(classification_report(y_test, paul_predictions))

# Llatuna (Logistic Regression) Performance
llatuna_accuracy = accuracy_score(y_test, llatuna_predictions)
print(f"\n🌟 Llatuna (Logistic Regression) Performance:")
print(f"Accuracy: {llatuna_accuracy:.4f} ({llatuna_accuracy*100:.2f}%)")
print("Classification Report:")
print(classification_report(y_test, llatuna_predictions))

# Summary
print(f"\n📊 Model Performance Summary:")
print(f"Jan (Random Forest):     {jan_accuracy:.4f}")
print(f"Paul (SVM):              {paul_accuracy:.4f}")
print(f"Llatuna (Log. Reg.):     {llatuna_accuracy:.4f}")

# Find best individual model
accuracies = {'Jan': jan_accuracy, 'Paul': paul_accuracy, 'Llatuna': llatuna_accuracy}
best_model = max(accuracies, key=accuracies.get)
print(f"Best individual model:   {best_model} ({accuracies[best_model]:.4f})")

Individual Model Performance

🌟 Jan (Random Forest) Performance:
Accuracy: 0.9649 (96.49%)
Classification Report:
              precision    recall  f1-score   support

           B       0.95      1.00      0.97       107
           M       1.00      0.91      0.95        64

    accuracy                           0.96       171
   macro avg       0.97      0.95      0.96       171
weighted avg       0.97      0.96      0.96       171


🌟 Paul (SVM) Performance:
Accuracy: 0.9006 (90.06%)
Classification Report:
              precision    recall  f1-score   support

           B       0.86      1.00      0.93       107
           M       1.00      0.73      0.85        64

    accuracy                           0.90       171
   macro avg       0.93      0.87      0.89       171
weighted avg       0.91      0.90      0.90       171


🌟 Llatuna (Logistic Regression) Performance:
Accuracy: 0.9474 (94.74%)
Classification Report:
              precision    recall  f1-score   support

      

In [12]:
# 11. Ensemble Method using VotingClassifier
print("🚀 Ensemble Method: VotingClassifier")
print("=" * 50)

# Create the ensemble model using VotingClassifier
ensemble_model = VotingClassifier(
    estimators=[
        ('Jan', Jan),           # Random Forest
        ('Paul', Paul),         # SVM
        ('Llatuna', Llatuna)    # Logistic Regression
    ],
    voting='soft'  # Use soft voting (probability-based)
)

print("✓ VotingClassifier created with the following models:")
print("  - Jan (Random Forest)")
print("  - Paul (SVM)")  
print("  - Llatuna (Logistic Regression)")
print("  - Voting method: Soft (probability-based)")

# Train the ensemble model
print("\n🔧 Training the ensemble model...")
start_time = time.time()
ensemble_model.fit(X_train, y_train)
ensemble_training_time = time.time() - start_time
print(f"✓ Ensemble model trained in {ensemble_training_time:.3f} seconds")

# Make predictions with the ensemble
print("\n🎯 Making predictions with ensemble...")
ensemble_predictions = ensemble_model.predict(X_test)
ensemble_accuracy = accuracy_score(y_test, ensemble_predictions)

print(f"✓ Ensemble predictions completed")
print(f"✓ Ensemble accuracy: {ensemble_accuracy:.4f} ({ensemble_accuracy*100:.2f}%)")

🚀 Ensemble Method: VotingClassifier
✓ VotingClassifier created with the following models:
  - Jan (Random Forest)
  - Paul (SVM)
  - Llatuna (Logistic Regression)
  - Voting method: Soft (probability-based)

🔧 Training the ensemble model...
✓ Ensemble model trained in 0.518 seconds

🎯 Making predictions with ensemble...
✓ Ensemble predictions completed
✓ Ensemble accuracy: 0.9591 (95.91%)


In [13]:
# 12. Ensemble Performance Analysis & Comparison
print("\n🏆 ENSEMBLE vs INDIVIDUAL MODELS COMPARISON")
print("=" * 60)

# Ensemble detailed performance
print(f"\n🔥 Ensemble Model Performance:")
print(f"Accuracy: {ensemble_accuracy:.4f} ({ensemble_accuracy*100:.2f}%)")
print("Classification Report:")
print(classification_report(y_test, ensemble_predictions))

# Complete comparison table
print(f"\n📊 FINAL PERFORMANCE COMPARISON:")
print("=" * 60)
print(f"{'Model':<25} {'Accuracy':<10} {'Percentage':<12}")
print("-" * 60)
print(f"{'Jan (Random Forest)':<25} {jan_accuracy:<10.4f} {jan_accuracy*100:<12.2f}%")
print(f"{'Paul (SVM)':<25} {paul_accuracy:<10.4f} {paul_accuracy*100:<12.2f}%")
print(f"{'Llatuna (Logistic Reg.)':<25} {llatuna_accuracy:<10.4f} {llatuna_accuracy*100:<12.2f}%")
print("-" * 60)
print(f"{'🏆 ENSEMBLE (Voting)':<25} {ensemble_accuracy:<10.4f} {ensemble_accuracy*100:<12.2f}%")
print("=" * 60)

# Analysis
all_accuracies = {
    'Jan': jan_accuracy,
    'Paul': paul_accuracy, 
    'Llatuna': llatuna_accuracy,
    'Ensemble': ensemble_accuracy
}

best_overall = max(all_accuracies, key=all_accuracies.get)
print(f"\n🎯 ANALYSIS:")
print(f"• Best performing model: {best_overall} ({all_accuracies[best_overall]:.4f})")
print(f"• Ensemble accuracy compared to best individual:")
print(f"  {ensemble_accuracy:.4f} vs {max(jan_accuracy, paul_accuracy, llatuna_accuracy):.4f}")

if ensemble_accuracy > max(jan_accuracy, paul_accuracy, llatuna_accuracy):
    improvement = ensemble_accuracy - max(jan_accuracy, paul_accuracy, llatuna_accuracy)
    print(f"  ✅ Ensemble improved by {improvement:.4f} ({improvement*100:.2f}%)")
elif ensemble_accuracy < max(jan_accuracy, paul_accuracy, llatuna_accuracy):
    decrease = max(jan_accuracy, paul_accuracy, llatuna_accuracy) - ensemble_accuracy
    print(f"  ⚠️ Ensemble decreased by {decrease:.4f} ({decrease*100:.2f}%)")
else:
    print(f"  ➡️ Ensemble performed equally to best individual model")

print(f"\n✅ Lab 4 Activity Complete!")
print(f"✅ All models trained and evaluated successfully!")
print(f"✅ Ensemble method implemented with VotingClassifier!")


🏆 ENSEMBLE vs INDIVIDUAL MODELS COMPARISON

🔥 Ensemble Model Performance:
Accuracy: 0.9591 (95.91%)
Classification Report:
              precision    recall  f1-score   support

           B       0.94      1.00      0.97       107
           M       1.00      0.89      0.94        64

    accuracy                           0.96       171
   macro avg       0.97      0.95      0.96       171
weighted avg       0.96      0.96      0.96       171


📊 FINAL PERFORMANCE COMPARISON:
Model                     Accuracy   Percentage  
------------------------------------------------------------
Jan (Random Forest)       0.9649     96.49       %
Paul (SVM)                0.9006     90.06       %
Llatuna (Logistic Reg.)   0.9474     94.74       %
------------------------------------------------------------
🏆 ENSEMBLE (Voting)       0.9591     95.91       %

🎯 ANALYSIS:
• Best performing model: Jan (0.9649)
• Ensemble accuracy compared to best individual:
  0.9591 vs 0.9649
  ⚠️ Ensemble decreas