# Breast Cancer Detection using CatBoost

**Scenario**: You receive a dataset which includes various cytological features on breast masses and whether the mass is benign (non-cancerous) or malignant (cancerous).

**Goal**: Develop a prediction model using CatBoost to determine whether a new mass being diagnosed for breast cancer is benign or malignant.

**Results**: 

The dataset has 683 rows and nine cytological features, in which 65% of the masses are benign and the remaining 35% are malignant.

The dependent variable `Class` was label encoded to provide more intuitive values and because the XGBoost module cannot properly read the default values. The benign class value of `2` was relabeled `0` and the malignant class value of `4` was relabeled `1`.

Out of the 137 rows in the single test set, the predictions were:
- True Positive: 61% (84 masses)
- True Negative: 36% (50 masses)
- False Positive: 2% (3 masses)
- False Negative: 0% (no masses)

This resulted in an accuracy score of 97.8% which is higher than the XGBoost's accuracy score of 97% on the same test set.

Applying k=`10` folds for cross validation, we obtain an accuracy score of 96.53% with a standard deviation of 2.50%. This is a high-performing model that provides accurate diagnoses however is slightly performing less than XGBoost which had an accuracy score of 96.89% with a standard deviation of 2.17%.


## Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

## Importing the dataset

We disregard the first column `Sample code number` as an independent variable since it only identifies the patient and provides no substantial measure in assessing whether the cancer is benign or malignant. Therefore, the subset `X` does not include this column.

In [2]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

In [3]:
X.shape

(683, 9)

The dataset has 683 rows of data and nine independent features. Since this is a smaller dataset, we'll use an 80/20 split for the training set and test set.

In [4]:
diag = pd.DataFrame(y)
diag_counts = diag.value_counts()
diag_percents = diag.value_counts(normalize=True)
print(diag_counts)
print(diag_percents)

2    444
4    239
Name: count, dtype: int64
2    0.650073
4    0.349927
Name: proportion, dtype: float64


65% (444) of the dataset includes benign masses while the remaining 35% (239) are malignant.

## Label encoding the dependent variable 

The dependent variable `y` is the `Class` that indicates whether the cancer is benign (`Class` = `2`) or malignant (`Class` = `4).

Because these values are non-intuitive at first glance , we use label encoding to transform the dependent variable.

In [5]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [6]:
print(y[:10])

[0 0 0 0 0 1 0 0 0 0]


The encoding performed the following:
- benign `Class = 2` -> `Class = 0`
- malignant `Class = 4` -> `Class = 1`

## Splitting the dataset into the Training set and Test set

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training CatBoost on the Training set

In [8]:
from catboost import CatBoostClassifier
classifier = CatBoostClassifier()
classifier.fit(X_train, y_train)

Learning rate set to 0.007956
0:	learn: 0.6773046	total: 60.9ms	remaining: 1m
1:	learn: 0.6597974	total: 61.3ms	remaining: 30.6s
2:	learn: 0.6444870	total: 61.7ms	remaining: 20.5s
3:	learn: 0.6292694	total: 62.1ms	remaining: 15.5s
4:	learn: 0.6167805	total: 62.6ms	remaining: 12.4s
5:	learn: 0.6006774	total: 63ms	remaining: 10.4s
6:	learn: 0.5863012	total: 63.4ms	remaining: 8.99s
7:	learn: 0.5719487	total: 63.8ms	remaining: 7.91s
8:	learn: 0.5584160	total: 64.1ms	remaining: 7.06s
9:	learn: 0.5444027	total: 64.5ms	remaining: 6.39s
10:	learn: 0.5324566	total: 64.9ms	remaining: 5.83s
11:	learn: 0.5230007	total: 65.2ms	remaining: 5.37s
12:	learn: 0.5107660	total: 65.6ms	remaining: 4.98s
13:	learn: 0.4993511	total: 66ms	remaining: 4.65s
14:	learn: 0.4890182	total: 66.4ms	remaining: 4.36s
15:	learn: 0.4798579	total: 66.7ms	remaining: 4.1s
16:	learn: 0.4688945	total: 67.1ms	remaining: 3.88s
17:	learn: 0.4603258	total: 67.5ms	remaining: 3.68s
18:	learn: 0.4500329	total: 67.9ms	remaining: 3.5s
1

<catboost.core.CatBoostClassifier at 0x12fa059d0>

## Making the Confusion Matrix

In [9]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(f"Confusion matrix:\n{cm}")
accuracy_score(y_test, y_pred)
acc_score = accuracy_score(y_test, y_pred)
print(f"Accuracy score: {acc_score}")

Confusion matrix:
[[84  3]
 [ 0 50]]
Accuracy score: 0.9781021897810219


In [10]:
len(X_test)

137

There are 137 rows in the test set.

We then normalize the confusion matrix to intuitively see the breakdown of the predictions as percentages.

In [11]:
cm_norm = np.round(cm.astype('float') / cm.sum(), 2)
print("Normalized confusion matrix:")
print(cm_norm)

Normalized confusion matrix:
[[0.61 0.02]
 [0.   0.36]]


## Applying k-Fold Cross Validation

In [12]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print(f"Accuracy: {accuracies.mean()*100:.2f}")
print(f"Standard Deviation: {accuracies.std()*100:.2f}")

Learning rate set to 0.007604
0:	learn: 0.6773604	total: 1.24ms	remaining: 1.24s
1:	learn: 0.6638264	total: 1.64ms	remaining: 817ms
2:	learn: 0.6487767	total: 2.07ms	remaining: 688ms
3:	learn: 0.6331716	total: 2.52ms	remaining: 629ms
4:	learn: 0.6208046	total: 2.83ms	remaining: 562ms
5:	learn: 0.6054732	total: 3.15ms	remaining: 521ms
6:	learn: 0.5903625	total: 3.54ms	remaining: 502ms
7:	learn: 0.5763344	total: 3.82ms	remaining: 473ms
8:	learn: 0.5653778	total: 4.1ms	remaining: 451ms
9:	learn: 0.5531434	total: 4.37ms	remaining: 433ms
10:	learn: 0.5417318	total: 4.64ms	remaining: 417ms
11:	learn: 0.5322945	total: 4.86ms	remaining: 400ms
12:	learn: 0.5185135	total: 5.62ms	remaining: 427ms
13:	learn: 0.5075524	total: 5.93ms	remaining: 417ms
14:	learn: 0.4967004	total: 6.21ms	remaining: 408ms
15:	learn: 0.4866503	total: 6.43ms	remaining: 395ms
16:	learn: 0.4758187	total: 6.7ms	remaining: 388ms
17:	learn: 0.4667132	total: 6.98ms	remaining: 381ms
18:	learn: 0.4566130	total: 7.42ms	remaining: 

146:	learn: 0.0893408	total: 48.9ms	remaining: 284ms
147:	learn: 0.0886783	total: 49.3ms	remaining: 284ms
148:	learn: 0.0879841	total: 49.6ms	remaining: 283ms
149:	learn: 0.0874665	total: 49.9ms	remaining: 283ms
150:	learn: 0.0867642	total: 50.2ms	remaining: 282ms
151:	learn: 0.0861648	total: 50.5ms	remaining: 282ms
152:	learn: 0.0855309	total: 50.8ms	remaining: 281ms
153:	learn: 0.0849681	total: 51.1ms	remaining: 281ms
154:	learn: 0.0845019	total: 51.4ms	remaining: 280ms
155:	learn: 0.0839200	total: 51.7ms	remaining: 280ms
156:	learn: 0.0833843	total: 52ms	remaining: 279ms
157:	learn: 0.0829767	total: 52.3ms	remaining: 279ms
158:	learn: 0.0825572	total: 52.6ms	remaining: 278ms
159:	learn: 0.0819991	total: 52.9ms	remaining: 278ms
160:	learn: 0.0814320	total: 53.3ms	remaining: 278ms
161:	learn: 0.0807874	total: 53.6ms	remaining: 277ms
162:	learn: 0.0802422	total: 53.8ms	remaining: 277ms
163:	learn: 0.0798162	total: 54.1ms	remaining: 276ms
164:	learn: 0.0791730	total: 54.4ms	remaining: 2