In [1]:
# Step 1: Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer  # Breast Cancer dataset
from sklearn.model_selection import train_test_split  # Train-test split
from sklearn.linear_model import LogisticRegression  # Logistic Regression
from sklearn.svm import SVC  # Support Vector Machine
from sklearn.tree import DecisionTreeClassifier  # Decision Tree
from sklearn.ensemble import VotingClassifier  # Voting Classifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report  # Evaluation metrics


In [2]:
# Step 2: Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data  # Features (numerical measurements of tumor cells)
y = data.target  # Target (0 = malignant, 1 = benign)

In [3]:
# Step 3: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [4]:
# Step 4: Create individual classifiers
log_clf = LogisticRegression(max_iter=200, random_state=42)
svm_clf = SVC(probability=True, random_state=42)  # 'probability=True' is needed for soft voting
tree_clf = DecisionTreeClassifier(random_state=42)


In [5]:
# Step 5: Create an ensemble model using VotingClassifier
# We use both hard and soft voting

# Hard voting: Based on majority class vote
voting_clf_hard = VotingClassifier(estimators=[
    ('lr', log_clf),
    ('svm', svm_clf),
    ('tree', tree_clf)
], voting='hard')

# Soft voting: Based on the averaged probability of class labels
voting_clf_soft = VotingClassifier(estimators=[
    ('lr', log_clf),
    ('svm', svm_clf),
    ('tree', tree_clf)
], voting='soft')

In [6]:
# Step 6: Train the individual models and the ensemble model
log_clf.fit(X_train, y_train)
svm_clf.fit(X_train, y_train)
tree_clf.fit(X_train, y_train)
voting_clf_hard.fit(X_train, y_train)
voting_clf_soft.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [7]:
# Step 7: Make predictions and evaluate the models
# Predictions for hard voting
y_pred_hard = voting_clf_hard.predict(X_test)

# Predictions for soft voting
y_pred_soft = voting_clf_soft.predict(X_test)

In [8]:
# Evaluate the accuracy of individual models and the ensemble
print("Logistic Regression Accuracy:", accuracy_score(y_test, log_clf.predict(X_test)))
print("SVM Accuracy:", accuracy_score(y_test, svm_clf.predict(X_test)))
print("Decision Tree Accuracy:", accuracy_score(y_test, tree_clf.predict(X_test)))
print("Hard Voting Ensemble Accuracy:", accuracy_score(y_test, y_pred_hard))
print("Soft Voting Ensemble Accuracy:", accuracy_score(y_test, y_pred_soft))

# Display confusion matrix for ensemble models
print("\nConfusion Matrix for Hard Voting:\n", confusion_matrix(y_test, y_pred_hard))
print("\nConfusion Matrix for Soft Voting:\n", confusion_matrix(y_test, y_pred_soft))

Logistic Regression Accuracy: 0.9649122807017544
SVM Accuracy: 0.935672514619883
Decision Tree Accuracy: 0.9415204678362573
Hard Voting Ensemble Accuracy: 0.9707602339181286
Soft Voting Ensemble Accuracy: 0.9766081871345029

Confusion Matrix for Hard Voting:
 [[ 59   4]
 [  1 107]]

Confusion Matrix for Soft Voting:
 [[ 61   2]
 [  2 106]]


In [9]:

# Classification reports
print("\nClassification Report for Hard Voting:\n", classification_report(y_test, y_pred_hard))
print("\nClassification Report for Soft Voting:\n", classification_report(y_test, y_pred_soft))


Classification Report for Hard Voting:
               precision    recall  f1-score   support

           0       0.98      0.94      0.96        63
           1       0.96      0.99      0.98       108

    accuracy                           0.97       171
   macro avg       0.97      0.96      0.97       171
weighted avg       0.97      0.97      0.97       171


Classification Report for Soft Voting:
               precision    recall  f1-score   support

           0       0.97      0.97      0.97        63
           1       0.98      0.98      0.98       108

    accuracy                           0.98       171
   macro avg       0.97      0.97      0.97       171
weighted avg       0.98      0.98      0.98       171



## Individual Model Performance:

*  SVM performs the best as a standalone classifier, followed by Logistic
Regression.
*  Decision Tree performs the worst among the individual models, as it tends to overfit without tuning.
## Ensemble Performance:

*  The Hard Voting ensemble slightly improves the accuracy over the Decision Tree, but it doesn’t outperform the best individual model (SVM).
*  The Soft Voting ensemble performs better than Hard Voting, and it gets close to the best individual classifier (SVM), with improved recall and precision.

## Why Soft Voting Works Better:

*  Soft Voting tends to perform better because it takes into account the predicted probabilities of each class, instead of just the majority vote. This allows classifiers with higher confidence in certain predictions to influence the final outcome more.

## Conclusion:
*  Ensemble Learning shows the benefits of combining multiple models:
The ensemble typically performs as well as, or better than, the best individual
*  Soft Voting outperforms Hard Voting due to its ability to leverage the confidence (probability) of individual classifiers.
*  Class Imbalance Handling: Even though the dataset has slightly imbalanced classes, using ensemble techniques such as Voting Classifier helps to maintain high accuracy while improving generalization.