# Binary Classification on Breast Cancer Dataset

In this notebook, we will practice binary classification on the Breast Cancer dataset, which is widely used for testing classification algorithms. Our goal is to predict whether a tumor is malignant or benign based on various features.

### Dataset
The Breast Cancer dataset is available through `sklearn.datasets`. Each data point represents a tumor, with features such as radius, texture, smoothness, and other measurements. The target variable indicates whether the tumor is malignant (1) or benign (0).

Let's start by loading the dataset and exploring it.



## Loading and Exploring the Dataset

In [None]:
# TODO: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Convert to DataFrame for easier exploration
df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y

# TODO: Display the first few rows of the DataFrame
# Hint: Use the .head() method


# TODO: Check for missing values and data types
# Hint: Use .info() on the DataFrame



## Data Preprocessing 

In [None]:
# TODO: Check for missing values
# Hint: Use .isnull().sum() to verify if there are any missing values


# Since there are no missing values, we can proceed to feature scaling.

# Standardize the feature values
from sklearn.preprocessing import StandardScaler

# TODO: Initialize the scaler and transform features


# TODO: Fit and transform features


# TODO: Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data into training and test sets



## Implementing Classification Algorithms

We'll use cross-validation to evaluate each model. For cross-validation, we will use the cross_val_score function with 5 folds.

Using cross-validation will give us more insight into the stability and robustness of each model by evaluating them on multiple subsets of the data.



### 1 - Logistic Regression with Cross-Validation

In [None]:
# TODO: Import necessary classes for Logistic Regression, evaluation, and cross-validation


# TODO: Initialize the Logistic Regression model


# TODO: Perform cross-validation on Logistic Regression model


# Print cross-validation scores and average accuracy
print("Logistic Regression Cross-Validation Scores:", log_reg_cv_scores)
print("Average Cross-Validation Score:", log_reg_cv_scores.mean())

# Train and evaluate Logistic Regression on the test set
log_reg.fit(X_train, y_train)
y_pred_log = log_reg.predict(X_test)

# Print Logistic Regression performance metrics
print("Logistic Regression Performance on Test Set:")
print(classification_report(y_test, y_pred_log))
print(confusion_matrix(y_test, y_pred_log))


## 2 - Support Vector Machine (SVM) with Cross-Validation

In [None]:
# TODO: Import SVC from sklearn.svm


# TODO: Initialize the SVM model with kernel='linear'


# TODO: Perform cross-validation on the SVM model


# Print cross-validation scores and average accuracy
print("SVM Cross-Validation Scores:", svm_cv_scores)
print("Average Cross-Validation Score:", svm_cv_scores.mean())

# Train and evaluate SVM on the test set
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_test)

# Print SVM performance metrics
print("SVM Performance on Test Set:")
print(classification_report(y_test, y_pred_svm))
print(confusion_matrix(y_test, y_pred_svm))


## 3 - K-Nearest Neighbors (KNN) with Cross-Validation

In [None]:
# TODO: Import KNeighborsClassifier from sklearn.neighbors


# TODO: Initialize the KNN model with n_neighbors=5


# TODO: Perform cross-validation on the KNN model


# Print cross-validation scores and average accuracy
print("KNN Cross-Validation Scores:", knn_cv_scores)
print("Average Cross-Validation Score:", knn_cv_scores.mean())

# Train and evaluate KNN on the test set
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)

# Print KNN performance metrics
print("KNN Performance on Test Set:")
print(classification_report(y_test, y_pred_knn))
print(confusion_matrix(y_test, y_pred_knn))


## 4 - Decision Tree with Cross-Validation

In [None]:
# TODO: Import DecisionTreeClassifier from sklearn.tree


# TODO: Initialize the Decision Tree model


# TODO: Perform cross-validation on the Decision Tree model


# Print cross-validation scores and average accuracy
print("Decision Tree Cross-Validation Scores:", dt_cv_scores)
print("Average Cross-Validation Score:", dt_cv_scores.mean())

# Train and evaluate Decision Tree on the test set
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

# Print Decision Tree performance metrics
print("Decision Tree Performance on Test Set:")
print(classification_report(y_test, y_pred_dt))
print(confusion_matrix(y_test, y_pred_dt))


## 5 - Random Forest with Cross-Validation

In [None]:
# TODO: Import RandomForestClassifier from sklearn.ensemble


# TODO: Initialize the Random Forest model


# TODO: Perform cross-validation on the Random Forest model


# Print cross-validation scores and average accuracy
print("Random Forest Cross-Validation Scores:", rf_cv_scores)
print("Average Cross-Validation Score:", rf_cv_scores.mean())

# Train and evaluate Random Forest on the test set
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Print Random Forest performance metrics
print("Random Forest Performance on Test Set:")
print(classification_report(y_test, y_pred_rf))
print(confusion_matrix(y_test, y_pred_rf))


## 6 - Gradient Boosting with Cross-Validation

In [None]:
# TODO: Import GradientBoostingClassifier from sklearn.ensemble


# TODO: Initialize the Gradient Boosting model


# TODO: Perform cross-validation on the Gradient Boosting model


# Print cross-validation scores and average accuracy
print("Gradient Boosting Cross-Validation Scores:", gb_cv_scores)
print("Average Cross-Validation Score:", gb_cv_scores.mean())

# Train and evaluate Gradient Boosting on the test set
gb.fit(X_train, y_train)
y_pred_gb = gb.predict(X_test)

# Print Gradient Boosting performance metrics
print("Gradient Boosting Performance on Test Set:")
print(classification_report(y_test, y_pred_gb))
print(confusion_matrix(y_test, y_pred_gb))



## Model Comparison and Conclusion

Now that we have explored six classification algorithms, let’s compare their performance using the cross-validation scores as well as the accuracy, precision, recall, and F1-score on the test set.

Consider the stability and reliability of each model based on their cross-validation scores. Reflect on which models might be most suitable for predicting tumor malignancy and why, based on your observations.


In [None]:
# Import necessary library for DataFrame
import pandas as pd

# Create a dictionary to hold the model names and their performance metrics
model_results = {
    "Model": ["Logistic Regression", "SVM", "KNN", "Decision Tree", "Random Forest", "Gradient Boosting"],
    "Avg. Cross-Validation Score": [
        log_reg_cv_scores.mean(), 
        svm_cv_scores.mean(), 
        knn_cv_scores.mean(), 
        dt_cv_scores.mean(), 
        rf_cv_scores.mean(), 
        gb_cv_scores.mean()
    ],
    "Test Accuracy": [
        log_reg.score(X_test, y_test),
        svm.score(X_test, y_test),
        knn.score(X_test, y_test),
        dt.score(X_test, y_test),
        rf.score(X_test, y_test),
        gb.score(X_test, y_test)
    ],
    "Precision": [
        classification_report(y_test, y_pred_log, output_dict=True)['1']['precision'],
        classification_report(y_test, y_pred_svm, output_dict=True)['1']['precision'],
        classification_report(y_test, y_pred_knn, output_dict=True)['1']['precision'],
        classification_report(y_test, y_pred_dt, output_dict=True)['1']['precision'],
        classification_report(y_test, y_pred_rf, output_dict=True)['1']['precision'],
        classification_report(y_test, y_pred_gb, output_dict=True)['1']['precision']
    ],
    "Recall": [
        classification_report(y_test, y_pred_log, output_dict=True)['1']['recall'],
        classification_report(y_test, y_pred_svm, output_dict=True)['1']['recall'],
        classification_report(y_test, y_pred_knn, output_dict=True)['1']['recall'],
        classification_report(y_test, y_pred_dt, output_dict=True)['1']['recall'],
        classification_report(y_test, y_pred_rf, output_dict=True)['1']['recall'],
        classification_report(y_test, y_pred_gb, output_dict=True)['1']['recall']
    ],
    "F1-Score": [
        classification_report(y_test, y_pred_log, output_dict=True)['1']['f1-score'],
        classification_report(y_test, y_pred_svm, output_dict=True)['1']['f1-score'],
        classification_report(y_test, y_pred_knn, output_dict=True)['1']['f1-score'],
        classification_report(y_test, y_pred_dt, output_dict=True)['1']['f1-score'],
        classification_report(y_test, y_pred_rf, output_dict=True)['1']['f1-score'],
        classification_report(y_test, y_pred_gb, output_dict=True)['1']['f1-score']
    ]
}

# Convert the dictionary to a DataFrame for easy visualization
results_df = pd.DataFrame(model_results)

# Display the table
results_df


### Are the Results Robust?

Consistency: If both cross-validation and test scores are high, this generally suggests a robust model. Consistency between cross-validation scores and test set performance indicates the model is not overfitting or underfitting, which is a positive sign for robustness.

In medical contexts, a slight preference might be given to models with higher recall (measures how well a model finds all the true positive cases.), even if it means a slight trade-off in precision. Missing a malignant tumor is often more serious than incorrectly identifying a benign tumor as malignant.