# Breast Cancer Detection using Classification Models

**Scenario**: You receive a dataset which includes various cytological features on breast masses and whether the mass is benign (non-cancerous) or malignant (cancerous).

**Goal**: Develop prediction models using all classification models introduced in the course to determine whether a new mass being diagnosed for breast cancer is benign or malignant.

**Results**: 

The dataset has 683 rows and nine cytological features, in which 65% of the masses are benign and the remaining 35% are malignant.

The dependent variable `Class` was label encoded to provide more intuitive values and because the XGBoost module cannot properly read the default values. The benign class value of `2` was relabeled `0` and the malignant class value of `4` was relabeled `1`.

All the classification models used and their results are shown below in descending order of Accuracy (w/ k-Fold CV).

| Model | Confusion matrix |  Accuracy Score<br>(single test set) | Accuracy<br>(w/ k-Fold CV) | Standard Deviation<br>(w/ k-Fold CV) |
| :-- | :--: | :--: | :--: | :--: |
| Support Vector Machine (SVM) | `[83 4]`<br>`[2 48]` | 0.956 | 97.07% | 2.19% |
| XGBoost | `[84 3]`<br>`[1 49]` | 0.9708 | 96.89% | 2.17% |
| Kernel SVM | `[82 5]`<br>`[1 49]` | 0.9562 | 96.89% | 2.17% |
| K-Nearest Neighbors (K-NN) | `[83 4]`<br>`[2 48]` | 0.9562 | 96.70% | 1.79% |
| Logistic Regression | `[84 3]`<br>`[3 47]` | 0.9562 | 96.70% | 1.97% |
| CatBoost | `[84 3]`<br>`[0 50]` | 0.9781 | 96.53% | 2.50% |
| Naive Bayes | `[80 7]`<br>`[0 50]` | 0.9489 | 96.52% | 2.24% |
| Random Forest | `[83 4]`<br>`[3 47]` | 0.9489 | 96.34% | 2.16% |
| Decision Tree | `[84 3]`<br>`[3 47]` | 0.9562 | 94.33% | 2.65% |

Features scaling was applied to normalize values in the logistic regression, K-NN, SVM, and kernel SVM models. This had no effect on the other models as they are insensitive to feature scaling.

Comparing all the models, the CatBoost model had the highest accuracy of 97.8% on the single test set. This model as well as Naive Bayes predicted no false negatives. However, the Naive Bayes model predicted a large amount of false positives.

With k=10 cross validation, the SVM model had the highest accuracy. While all accuracies tend to be near 96% or above, the lowest performing model is Decision Tree at 94.33% accuracy.

Note that all models were not tuned for optimal parameter values except CatBoost which is self-tuning.

## Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

## Importing the dataset

We disregard the first column `Sample code number` as an independent variable since it only identifies the patient and provides no substantial measure in assessing whether the cancer is benign or malignant. Therefore, the subset `X` does not include this column.

In [2]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

In [3]:
len(dataset)

683

The dataset has 683 rows of data. Since this is a smaller dataset, we'll use an 80/20 split for the training set and test set.

## Label encoding the dependent variable 

The dependent variable `y` is the `Class` that indicates whether the cancer is benign (`Class` = `2`) or malignant (`Class` = `4).

Because these values are non-intuitive at first glance and the XGBoost model cannot read these values properly, we use label encoding to transform the dependent variable.

In [4]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [5]:
print(y[:10])

[0 0 0 0 0 1 0 0 0 0]


The encoding performed the following:
- benign `Class = 2` -> `Class = 0`
- malignant `Class = 4` -> `Class = 1`

## Splitting the dataset into the Training set and Test set

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [7]:
len(X_train)

546

In [8]:
len(X_test)

137

## Feature scaling

For consistency, we apply feature scaling to all the classification models to ensure all features are normalized to weigh them equally. Note that many of the models are insensitive to feature scaling like the naive Bayes, decision tree, random forest, XGBoost, and CatBoost.

In [9]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Defining each classification model as a function

To compare results between the classification models, we define each model as functions to call them more easily.

In [10]:
# 1. Logistic Regression
def log_reg():
    from sklearn.linear_model import LogisticRegression
    classifier = LogisticRegression(random_state = 0)
    classifier.fit(X_train, y_train)
    return classifier

# 2. K-Nearest Neighbors
def k_nn():
    from sklearn.neighbors import KNeighborsClassifier
    # p = 2 for Euclidean (p = 1 for Manhattan)
    classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
    classifier.fit(X_train, y_train)
    return classifier

# 3. Support Vector Machine
def svm():
    from sklearn.svm import SVC
    classifier = SVC(kernel = 'linear', random_state = 0)
    classifier.fit(X_train, y_train)
    return classifier

# 4. Kernel SVM
def kernel_svm():
    from sklearn.svm import SVC
    classifier = SVC(kernel = 'rbf', random_state = 0)
    classifier.fit(X_train, y_train)   
    return classifier

# 5. Naive Bayes
def naive_bayes():
    from sklearn.naive_bayes import GaussianNB
    classifier = GaussianNB()
    classifier.fit(X_train, y_train)
    return classifier

# 6. Decision Tree
def decision_tree():
    from sklearn.tree import DecisionTreeClassifier
    classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
    classifier.fit(X_train, y_train)
    return classifier

# 7. Random Forest
def random_forest():
    from sklearn.ensemble import RandomForestClassifier
    classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
    classifier.fit(X_train, y_train)
    return classifier

# 8. XGBoost
def xgboost():
    from xgboost import XGBClassifier
    classifier = XGBClassifier()
    classifier.fit(X_train, y_train)
    return classifier

# 9. CatBoost
def catboost():
    from catboost import CatBoostClassifier
    # Suppress iteration output with 'Silent' logging level
    classifier = CatBoostClassifier(logging_level="Silent")
    classifier.fit(X_train, y_train)
    return classifier

## Making the Confusion Matrix

In [11]:
def confusion_matrix(classifier):
    from sklearn.metrics import confusion_matrix, accuracy_score
    y_pred = classifier.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    print(f"Confusion matrix:\n{cm}")
    accuracy_score(y_test, y_pred)
    acc_score = accuracy_score(y_test, y_pred)
    print(f"Accuracy score: {acc_score}")

## Applying k-Fold Cross Validation

In [12]:
def k_fold_cv(classifier):
    cv = 10
    from sklearn.model_selection import cross_val_score
    accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = cv)
    print(f"With {cv}-fold cross val:")
    print(f"    Accuracy: {accuracies.mean()*100:.2f}")
    print(f"    Standard Deviation: {accuracies.std()*100:.2f}")

## Displaying performance results of each classification model

In [13]:
models = {
    "Logistic Regression": log_reg(), 
    "K-Nearest Neighbors": k_nn(),
    "Support Vector Machine": svm(), 
    "Kernel SVM": kernel_svm(),
    "Naive Bayes": naive_bayes(),
    "Decision Tree": decision_tree(),
    "Random Forest": random_forest(),
    "XGBoost": xgboost(),
    "CatBoost": catboost()
}

for model_name, model_function in models.items():
    print(f"Model: {model_name}")
    classifier = model_function
    confusion_matrix(classifier)
    k_fold_cv(classifier)
    print("-"*30)

Model: Logistic Regression
Confusion matrix:
[[84  3]
 [ 3 47]]
Accuracy score: 0.9562043795620438
With 10-fold cross val:
    Accuracy: 96.70
    Standard Deviation: 1.97
------------------------------
Model: K-Nearest Neighbors
Confusion matrix:
[[83  4]
 [ 2 48]]
Accuracy score: 0.9562043795620438
With 10-fold cross val:
    Accuracy: 96.70
    Standard Deviation: 1.79
------------------------------
Model: Support Vector Machine
Confusion matrix:
[[83  4]
 [ 2 48]]
Accuracy score: 0.9562043795620438
With 10-fold cross val:
    Accuracy: 97.07
    Standard Deviation: 2.19
------------------------------
Model: Kernel SVM
Confusion matrix:
[[82  5]
 [ 1 49]]
Accuracy score: 0.9562043795620438
With 10-fold cross val:
    Accuracy: 96.89
    Standard Deviation: 2.17
------------------------------
Model: Naive Bayes
Confusion matrix:
[[80  7]
 [ 0 50]]
Accuracy score: 0.948905109489051
With 10-fold cross val:
    Accuracy: 96.52
    Standard Deviation: 2.24
------------------------------
