# Breast Cancer Detection using XGBoost

**Scenario**: You receive a dataset which includes various cytological features on breast masses and whether the mass is benign (non-cancerous) or malignant (cancerous).

**Goal**: Develop a prediction model using XGBoost to determine whether a new mass being diagnosed for breast cancer is benign or malignant.

**Results**: 

The dataset has 683 rows and nine cytological features, in which 65% of the masses are benign and the remaining 35% are malignant.

The dependent variable `Class` was label encoded to provide more intuitive values and because the XGBoost module cannot properly read the default values. The benign class value of `2` was relabeled `0` and the malignant class value of `4` was relabeled `1`.

Out of the 137 rows in the single test set, the predictions were:
- True Positive: 61% (84 masses)
- True Negative: 36% (49 masses)
- False Positive: 2% (3 masses)
- False Negative: 1% (1 mass)

This resulted in an accuracy score of 97%.

Applying k=`10` folds for cross validation, we obtain an accuracy score of 96.89% with a standard deviation of 2.17%. This is a high-performing model that provides accurate diagnoses.


## Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

## Importing the dataset

We disregard the first column `Sample code number` as an independent variable since it only identifies the patient and provides no substantial measure in assessing whether the cancer is benign or malignant. Therefore, the subset `X` does not include this column.

In [2]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

In [3]:
X.shape

(683, 9)

The dataset has 683 rows of data and nine independent features. Since this is a smaller dataset, we'll use an 80/20 split for the training set and test set.

In [4]:
diag = pd.DataFrame(y)
diag_counts = diag.value_counts()
diag_percents = diag.value_counts(normalize=True)
print(diag_counts)
print(diag_percents)

2    444
4    239
Name: count, dtype: int64
2    0.650073
4    0.349927
Name: proportion, dtype: float64


65% (444) of the dataset includes benign masses while the remaining 35% (239) are malignant.

## Label encoding the dependent variable 

The dependent variable `y` is the `Class` that indicates whether the cancer is benign (`Class` = `2`) or malignant (`Class` = `4).

Because these values are non-intuitive at first glance and the XGBoost model cannot read these values properly, we use label encoding to transform the dependent variable.

In [5]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [6]:
print(y[:10])

[0 0 0 0 0 1 0 0 0 0]


The encoding performed the following:
- benign `Class = 2` -> `Class = 0`
- malignant `Class = 4` -> `Class = 1`

## Splitting the dataset into the Training set and Test set

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [8]:
print(X_train)

[[10  1  1 ...  5  4  1]
 [ 1  1  1 ...  3  1  1]
 [ 5  1  1 ...  3  1  1]
 ...
 [ 1  1  1 ...  1  1  1]
 [ 3  1  1 ...  2  1  1]
 [10  9  7 ...  7  7  1]]


In [9]:
print(X_test)

[[ 1  1  1 ...  1  1  1]
 [ 3  1  1 ...  2  1  1]
 [ 5  5  5 ...  4  3  1]
 ...
 [ 4  1  1 ...  1  1  1]
 [ 4 10  4 ...  9 10  1]
 [ 2  1  1 ...  2  1  1]]


In [10]:
print(y_train)

[1 0 0 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1
 0 0 0 1 0 0 1 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0
 1 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 1
 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0
 1 1 1 1 0 1 0 1 0 1 1 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0
 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 1 1 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0
 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0
 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 1
 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1
 0 0 0 1 0 0 1 1 1 0 1 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0
 1 0 1 0 0 0 0 1 1 0 0 0 1 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0
 0 0 1 1 0 0 1 1 0 0 1 1 1 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1
 0 1 0 0 1 0 0 0 0 1 0 0 

In [11]:
print(y_test)

[0 0 1 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 0 0 0 1
 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0
 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 1 1 1 0 0 0 1 1 0 0 1 1 0 0 1
 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0]


## Training XGBoost on the Training set

In [12]:
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

## Making the Confusion Matrix

In [13]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(f"Confusion matrix:\n{cm}")
accuracy_score(y_test, y_pred)
acc_score = accuracy_score(y_test, y_pred)
print(f"Accuracy score: {acc_score}")

Confusion matrix:
[[84  3]
 [ 1 49]]
Accuracy score: 0.9708029197080292


In [14]:
len(X_test)

137

There are 137 rows in the test set.

We then normalize the confusion matrix to intuitively see the breakdown of the predictions as percentages.

In [15]:
cm_norm = np.round(cm.astype('float') / cm.sum(), 2)
print("Normalized confusion matrix:")
print(cm_norm)

Normalized confusion matrix:
[[0.61 0.02]
 [0.01 0.36]]


There are 137 rows in the test set.

## Applying k-Fold Cross Validation

In [16]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print(f"Accuracy: {accuracies.mean()*100:.2f}")
print(f"Standard Deviation: {accuracies.std()*100:.2f}")

Accuracy: 96.89
Standard Deviation: 2.17
