# Binary Classification on Breast Cancer Dataset

In this notebook, we will practice binary classification on the Breast Cancer dataset, which is widely used for testing classification algorithms. Our goal is to predict whether a tumor is malignant or benign based on various features.

### Dataset
The Breast Cancer dataset is available through `sklearn.datasets`. Each data point represents a tumor, with features such as radius, texture, smoothness, and other measurements. The target variable indicates whether the tumor is malignant (1) or benign (0).

Let's start by loading the dataset and exploring it.



## Loading and Exploring the Dataset

In [1]:
# TODO: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Convert to DataFrame for easier exploration
df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y

# TODO: Display the first few rows of the DataFrame
# Hint: Use the .head() method
df.head()

# TODO: Check for missing values and data types
# Hint: Use .info() on the DataFrame
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [2]:
df.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


In [3]:
# TODO: Check for missing values
# Hint: Use .isnull().sum() to verify if there are any missing values
df.isnull().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64

## Data Preprocessing 

In [4]:


# Since there are no missing values, we can proceed to feature scaling.

# Standardize the feature values
from sklearn.preprocessing import StandardScaler

# TODO: Initialize the scaler and transform features
scaler = StandardScaler()

# TODO: Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25, random_state= 12)

# TODO: Fit and transform features
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)


## Implementing Classification Algorithms

We'll use cross-validation to evaluate each model. For cross-validation, we will use the cross_val_score function with 5 folds.

Using cross-validation will give us more insight into the stability and robustness of each model by evaluating them on multiple subsets of the data.



### 1 - Logistic Regression with Cross-Validation

In [5]:
# TODO: Import necessary classes for Logistic Regression, evaluation, and cross-validation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

# TODO: Initialize the Logistic Regression model
log_reg = LogisticRegression()

# TODO: Perform cross-validation on Logistic Regression model
log_reg_cv_scores = cross_val_score(log_reg, X_train_scaled, y_train, cv = 5)

# Print cross-validation scores and average accuracy
print("Logistic Regression Cross-Validation Scores:", log_reg_cv_scores)
print("Average Cross-Validation Score:", log_reg_cv_scores.mean())

# Train and evaluate Logistic Regression on the test set
log_reg.fit(X_train_scaled, y_train)
y_pred_log = log_reg.predict(X_test_scaled)

# Print Logistic Regression performance metrics
print("Logistic Regression Performance on Test Set:")
print(classification_report(y_test, y_pred_log))
print(confusion_matrix(y_test, y_pred_log))


Logistic Regression Cross-Validation Scores: [0.97674419 0.98823529 0.96470588 0.98823529 0.97647059]
Average Cross-Validation Score: 0.9788782489740082
Logistic Regression Performance on Test Set:
              precision    recall  f1-score   support

           0       0.96      0.94      0.95        53
           1       0.97      0.98      0.97        90

    accuracy                           0.97       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.96      0.97      0.96       143

[[50  3]
 [ 2 88]]


## 2 - Support Vector Machine (SVM) with Cross-Validation

In [6]:
# TODO: Import SVC from sklearn.svm
from sklearn.svm import SVC

# TODO: Initialize the SVM model with kernel='linear'
svm = SVC(kernel= 'linear')

# TODO: Perform cross-validation on the SVM model
svm_cv_scores = cross_val_score(svm, X_train_scaled, y_train, cv = 5)

# Print cross-validation scores and average accuracy
print("SVM Cross-Validation Scores:", svm_cv_scores)
print("Average Cross-Validation Score:", svm_cv_scores.mean())

# Train and evaluate SVM on the test set
svm.fit(X_train_scaled, y_train)
y_pred_svm = svm.predict(X_test_scaled)

# Print SVM performance metrics
print("SVM Performance on Test Set:")
print(classification_report(y_test, y_pred_svm))
print(confusion_matrix(y_test, y_pred_svm))


SVM Cross-Validation Scores: [0.98837209 0.98823529 0.97647059 0.98823529 0.97647059]
Average Cross-Validation Score: 0.9835567715458277
SVM Performance on Test Set:
              precision    recall  f1-score   support

           0       0.94      0.94      0.94        53
           1       0.97      0.97      0.97        90

    accuracy                           0.96       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.96      0.96      0.96       143

[[50  3]
 [ 3 87]]


## 3 - K-Nearest Neighbors (KNN) with Cross-Validation

In [7]:
# TODO: Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier

# TODO: Initialize the KNN model with n_neighbors=5
knn = KNeighborsClassifier(n_neighbors=5)

# TODO: Perform cross-validation on the KNN model
knn_cv_scores = cross_val_score(knn, X_train_scaled, y_train, cv =5)

# Print cross-validation scores and average accuracy
print("KNN Cross-Validation Scores:", knn_cv_scores)
print("Average Cross-Validation Score:", knn_cv_scores.mean())

# Train and evaluate KNN on the test set
knn.fit(X_train_scaled, y_train)
y_pred_knn = knn.predict(X_test_scaled)

# Print KNN performance metrics
print("KNN Performance on Test Set:")
print(classification_report(y_test, y_pred_knn))
print(confusion_matrix(y_test, y_pred_knn))


KNN Cross-Validation Scores: [0.96511628 0.97647059 0.96470588 0.98823529 0.96470588]
Average Cross-Validation Score: 0.9718467852257182
KNN Performance on Test Set:
              precision    recall  f1-score   support

           0       0.94      0.91      0.92        53
           1       0.95      0.97      0.96        90

    accuracy                           0.94       143
   macro avg       0.94      0.94      0.94       143
weighted avg       0.94      0.94      0.94       143

[[48  5]
 [ 3 87]]


## 4 - Decision Tree with Cross-Validation

In [8]:
# TODO: Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier

# TODO: Initialize the Decision Tree model
dt = DecisionTreeClassifier()

# TODO: Perform cross-validation on the Decision Tree model
dt_cv_scores = cross_val_score(dt, X_train_scaled, y_train, cv = 5)

# Print cross-validation scores and average accuracy
print("Decision Tree Cross-Validation Scores:", dt_cv_scores)
print("Average Cross-Validation Score:", dt_cv_scores.mean())

# Train and evaluate Decision Tree on the test set
dt.fit(X_train_scaled, y_train)
y_pred_dt = dt.predict(X_test_scaled)

# Print Decision Tree performance metrics
print("Decision Tree Performance on Test Set:")
print(classification_report(y_test, y_pred_dt))
print(confusion_matrix(y_test, y_pred_dt))


Decision Tree Cross-Validation Scores: [0.93023256 0.95294118 0.96470588 0.96470588 0.91764706]
Average Cross-Validation Score: 0.9460465116279071
Decision Tree Performance on Test Set:
              precision    recall  f1-score   support

           0       0.90      0.85      0.87        53
           1       0.91      0.94      0.93        90

    accuracy                           0.91       143
   macro avg       0.91      0.90      0.90       143
weighted avg       0.91      0.91      0.91       143

[[45  8]
 [ 5 85]]


## 5 - Random Forest with Cross-Validation

In [9]:
# TODO: Import RandomForestClassifier from sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier

# TODO: Initialize the Random Forest model
rf = RandomForestClassifier()

# TODO: Perform cross-validation on the Random Forest model
rf_cv_scores = cross_val_score(rf, X_train_scaled, y_train, cv = 5)

# Print cross-validation scores and average accuracy
print("Random Forest Cross-Validation Scores:", rf_cv_scores)
print("Average Cross-Validation Score:", rf_cv_scores.mean())

# Train and evaluate Random Forest on the test set
rf.fit(X_train_scaled, y_train)
y_pred_rf = rf.predict(X_test_scaled)

# Print Random Forest performance metrics
print("Random Forest Performance on Test Set:")
print(classification_report(y_test, y_pred_rf))
print(confusion_matrix(y_test, y_pred_rf))


Random Forest Cross-Validation Scores: [0.94186047 0.97647059 0.97647059 0.98823529 0.96470588]
Average Cross-Validation Score: 0.9695485636114911
Random Forest Performance on Test Set:
              precision    recall  f1-score   support

           0       0.87      0.91      0.89        53
           1       0.94      0.92      0.93        90

    accuracy                           0.92       143
   macro avg       0.91      0.91      0.91       143
weighted avg       0.92      0.92      0.92       143

[[48  5]
 [ 7 83]]


## 6 - Gradient Boosting with Cross-Validation

In [10]:
# TODO: Import GradientBoostingClassifier from sklearn.ensemble
from sklearn.ensemble import GradientBoostingClassifier

# TODO: Initialize the Gradient Boosting model
gb = GradientBoostingClassifier()

# TODO: Perform cross-validation on the Gradient Boosting model
gb_cv_scores = cross_val_score(gb, X_train_scaled, y_train, cv = 5)

# Print cross-validation scores and average accuracy
print("Gradient Boosting Cross-Validation Scores:", gb_cv_scores)
print("Average Cross-Validation Score:", gb_cv_scores.mean())

# Train and evaluate Gradient Boosting on the test set
gb.fit(X_train_scaled, y_train)
y_pred_gb = gb.predict(X_test_scaled)

# Print Gradient Boosting performance metrics
print("Gradient Boosting Performance on Test Set:")
print(classification_report(y_test, y_pred_gb))
print(confusion_matrix(y_test, y_pred_gb))


Gradient Boosting Cross-Validation Scores: [0.95348837 0.97647059 0.97647059 0.98823529 0.92941176]
Average Cross-Validation Score: 0.9648153214774282
Gradient Boosting Performance on Test Set:
              precision    recall  f1-score   support

           0       0.91      0.91      0.91        53
           1       0.94      0.94      0.94        90

    accuracy                           0.93       143
   macro avg       0.93      0.93      0.93       143
weighted avg       0.93      0.93      0.93       143

[[48  5]
 [ 5 85]]



## Model Comparison and Conclusion

Now that we have explored six classification algorithms, let’s compare their performance using the cross-validation scores as well as the accuracy, precision, recall, and F1-score on the test set.

Consider the stability and reliability of each model based on their cross-validation scores. Reflect on which models might be most suitable for predicting tumor malignancy and why, based on your observations.


In [16]:
# Import necessary library for DataFrame
import pandas as pd
from sklearn.metrics import classification_report, accuracy_score

# Create a dictionary to hold the model names and their performance metrics
model_results = {
    "Model": ["Logistic Regression", "SVM", "KNN", "Decision Tree", "Random Forest", "Gradient Boosting"],
    "Avg. Cross-Validation Score": [
        log_reg_cv_scores.mean(), 
        svm_cv_scores.mean(), 
        knn_cv_scores.mean(), 
        dt_cv_scores.mean(), 
        rf_cv_scores.mean(), 
        gb_cv_scores.mean()
    ],
"Test Accuracy": [
    accuracy_score(y_test, y_pred_log),
    accuracy_score(y_test, y_pred_svm),
    accuracy_score(y_test, y_pred_knn),
    accuracy_score(y_test, y_pred_dt),
    accuracy_score(y_test, y_pred_rf),
    accuracy_score(y_test, y_pred_gb)
],
    "Precision": [
        classification_report(y_test, y_pred_log, output_dict=True)['1']['precision'],
        classification_report(y_test, y_pred_svm, output_dict=True)['1']['precision'],
        classification_report(y_test, y_pred_knn, output_dict=True)['1']['precision'],
        classification_report(y_test, y_pred_dt, output_dict=True)['1']['precision'],
        classification_report(y_test, y_pred_rf, output_dict=True)['1']['precision'],
        classification_report(y_test, y_pred_gb, output_dict=True)['1']['precision']
    ],
    "Recall": [
        classification_report(y_test, y_pred_log, output_dict=True)['1']['recall'],
        classification_report(y_test, y_pred_svm, output_dict=True)['1']['recall'],
        classification_report(y_test, y_pred_knn, output_dict=True)['1']['recall'],
        classification_report(y_test, y_pred_dt, output_dict=True)['1']['recall'],
        classification_report(y_test, y_pred_rf, output_dict=True)['1']['recall'],
        classification_report(y_test, y_pred_gb, output_dict=True)['1']['recall']
    ],
    "F1-Score": [
        classification_report(y_test, y_pred_log, output_dict=True)['1']['f1-score'],
        classification_report(y_test, y_pred_svm, output_dict=True)['1']['f1-score'],
        classification_report(y_test, y_pred_knn, output_dict=True)['1']['f1-score'],
        classification_report(y_test, y_pred_dt, output_dict=True)['1']['f1-score'],
        classification_report(y_test, y_pred_rf, output_dict=True)['1']['f1-score'],
        classification_report(y_test, y_pred_gb, output_dict=True)['1']['f1-score']
    ]
}

# Convert the dictionary to a DataFrame for easy visualization
results_df = pd.DataFrame(model_results)

# Display the table
results_df


Unnamed: 0,Model,Avg. Cross-Validation Score,Test Accuracy,Precision,Recall,F1-Score
0,Logistic Regression,0.978878,0.965035,0.967033,0.977778,0.972376
1,SVM,0.983557,0.958042,0.966667,0.966667,0.966667
2,KNN,0.971847,0.944056,0.945652,0.966667,0.956044
3,Decision Tree,0.946047,0.909091,0.913978,0.944444,0.928962
4,Random Forest,0.969549,0.916084,0.943182,0.922222,0.932584
5,Gradient Boosting,0.964815,0.93007,0.944444,0.944444,0.944444


### Are the Results Robust?

Consistency: If both cross-validation and test scores are high, this generally suggests a robust model. Consistency between cross-validation scores and test set performance indicates the model is not overfitting or underfitting, which is a positive sign for robustness.

In medical contexts, a slight preference might be given to models with higher recall (measures how well a model finds all the true positive cases.), even if it means a slight trade-off in precision. Missing a malignant tumor is often more serious than incorrectly identifying a benign tumor as malignant.