Author: **Johny Ijaq**

# BREAST CANCER CLASSIFICATION

### Dataset Description

This is a classic breast cancer dataset from UCI machinery repository. In this dataset each row corresponds to a patient and each column correspnds to different features that determine whether the tumor is benign or malignant. If the tumour is benign, class takes the value of two and if the tumour is malignant, class takes a value of four.

**Objective:** Main aim of this problem is to deploy different classification models and figure out which one performed better in predicting whether the given tumour is benign or malignant.  

## 1. Importing Libraries

In [63]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

## 2. Importing Dataset

In [2]:
dataset = pd.read_csv("Data.csv")

In [3]:
dataset.head()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 683 entries, 0 to 682
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype
---  ------                       --------------  -----
 0   Sample code number           683 non-null    int64
 1   Clump Thickness              683 non-null    int64
 2   Uniformity of Cell Size      683 non-null    int64
 3   Uniformity of Cell Shape     683 non-null    int64
 4   Marginal Adhesion            683 non-null    int64
 5   Single Epithelial Cell Size  683 non-null    int64
 6   Bare Nuclei                  683 non-null    int64
 7   Bland Chromatin              683 non-null    int64
 8   Normal Nucleoli              683 non-null    int64
 9   Mitoses                      683 non-null    int64
 10  Class                        683 non-null    int64
dtypes: int64(11)
memory usage: 58.8 KB


In [5]:
dataset.shape

(683, 11)

In [6]:
dataset.describe()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
count,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0
mean,1076720.0,4.442167,3.150805,3.215227,2.830161,3.234261,3.544656,3.445095,2.869693,1.603221,2.699854
std,620644.0,2.820761,3.065145,2.988581,2.864562,2.223085,3.643857,2.449697,3.052666,1.732674,0.954592
min,63375.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,877617.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,1171795.0,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,1238705.0,6.0,5.0,5.0,4.0,4.0,6.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [7]:
dataset.isnull().sum()

Sample code number             0
Clump Thickness                0
Uniformity of Cell Size        0
Uniformity of Cell Shape       0
Marginal Adhesion              0
Single Epithelial Cell Size    0
Bare Nuclei                    0
Bland Chromatin                0
Normal Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64

In [8]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [9]:
print(X)

[[1000025       5       1 ...       3       1       1]
 [1002945       5       4 ...       3       2       1]
 [1015425       3       1 ...       3       1       1]
 ...
 [ 888820       5      10 ...       8      10       2]
 [ 897471       4       8 ...      10       6       1]
 [ 897471       4       8 ...      10       4       1]]


In [10]:
print(y)

[2 2 2 2 2 4 2 2 2 2 2 2 4 2 4 4 2 2 4 2 4 4 2 2 4 2 2 2 2 2 2 4 2 2 2 4 2
 4 4 4 4 4 4 2 4 2 2 4 4 4 4 4 4 4 4 4 4 4 4 2 4 4 2 4 2 4 4 2 2 4 2 4 4 2
 2 2 2 2 2 2 2 2 4 4 4 4 2 2 2 2 2 2 2 2 2 2 4 4 4 4 2 4 4 4 4 4 2 4 2 4 4
 4 2 2 2 4 2 2 2 2 4 4 4 2 4 2 4 2 2 2 4 2 2 2 2 2 2 2 2 4 2 2 4 2 2 4 2 4
 4 2 2 4 2 2 4 4 2 2 2 2 4 4 2 2 2 2 2 4 4 4 2 4 2 4 2 2 2 4 4 2 4 4 4 2 4
 4 2 2 2 2 2 2 2 2 4 4 2 2 2 4 4 2 2 2 4 4 2 4 4 4 2 2 4 2 2 4 4 4 4 2 4 4
 2 4 4 4 2 4 2 4 4 4 4 2 2 2 2 2 2 4 4 2 2 4 2 4 4 4 2 2 2 2 4 4 4 4 4 2 4
 4 4 2 4 2 4 4 2 2 2 2 4 2 2 4 4 4 4 4 2 4 4 2 2 4 4 2 2 4 4 2 4 2 4 4 2 2
 4 2 2 2 4 2 2 4 4 2 2 4 2 4 2 2 4 2 4 4 4 2 2 4 4 2 4 2 2 4 4 2 2 2 4 2 2
 2 4 4 2 2 2 4 2 2 4 4 4 4 4 4 2 2 2 2 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 4 2 2
 2 2 4 2 2 2 2 4 2 2 2 2 2 2 2 2 4 2 2 2 2 2 2 2 2 2 2 4 2 4 2 4 2 2 2 2 4
 2 2 2 4 2 4 2 2 2 2 2 2 2 4 4 2 2 2 4 2 2 2 2 2 2 2 2 4 2 2 2 4 2 4 4 4 2
 2 2 2 2 2 2 4 4 4 2 2 2 2 2 2 2 2 2 2 2 4 2 2 4 4 2 2 2 4 4 4 2 4 2 4 2 2
 2 2 2 2 2 2 2 2 2 2 4 2 

## 3. Splitting the Dataset into the Training set and Test set

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 40)

In [12]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(546, 10)
(137, 10)
(546,)
(137,)


## 4. Feature Scaling

In [13]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

In [14]:
print(X_train)

[[ 0.58169832  2.02339612  2.2251362  ... -0.19098898  0.69366865
  -0.33423456]
 [ 0.58252832 -0.13858878 -0.70631669 ... -0.99623981 -0.61686105
  -0.33423456]
 [-2.70801546  0.22174204  2.2251362  ...  2.62738894  1.02130107
   1.97579166]
 ...
 [-0.10772048 -0.85925041 -0.70631669 ... -0.19098898 -0.61686105
  -0.33423456]
 [ 0.6109826  -1.21958122 -0.70631669 ... -0.5936144  -0.61686105
  -0.33423456]
 [-0.82630647  2.02339612 -0.05488272 ...  1.41951269  1.02130107
  -0.33423456]]


In [15]:
print(X_test)

[[ 2.72378104e-02 -2.25754806e-01 -6.85929021e-01 ... -5.78597638e-01
  -5.97316414e-01 -4.05688531e-01]
 [ 1.08528169e-02 -1.23428986e+00 -6.85929021e-01 ... -1.01413542e+00
  -5.97316414e-01  3.06801952e+00]
 [ 4.33939854e-03  1.79131531e+00  2.28161653e+00 ...  2.03462906e+00
   2.36047932e+00 -4.05688531e-01]
 ...
 [ 8.12199415e-02 -1.23428986e+00 -6.85929021e-01 ... -1.43059855e-01
  -5.97316414e-01 -4.05688531e-01]
 [ 2.83992661e-02  1.10423546e-01 -6.85929021e-01 ... -1.43059855e-01
  -5.97316414e-01 -4.05688531e-01]
 [ 9.80167883e+00 -1.23428986e+00 -6.85929021e-01 ... -5.78597638e-01
  -5.97316414e-01 -4.05688531e-01]]


## 5. Training

### 5.1. Logistic Regression

In [73]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(random_state = 0)
log_reg.fit(X_train, y_train)


LogisticRegression(random_state=0)

### 5.2. SVM

In [17]:
from sklearn.svm import SVC
svm_classifier = SVC(kernel = "linear", random_state = 0)
svm_classifier.fit(X_train, y_train)

SVC(kernel='linear', random_state=0)

### 5.3. KNN

In [18]:
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors = 5, metric = "minkowski", p = 2)
knn_classifier.fit(X_train, y_train)

KNeighborsClassifier()

### 5.4. Kernel SVM

In [19]:
from sklearn.svm import SVC
kernel_svm = SVC(kernel = "rbf", random_state = 0)
kernel_svm.fit(X_train, y_train)

SVC(random_state=0)

### 5.5. Naive Bayes

In [20]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, y_train)

GaussianNB()

### 5.6. Decision Tree Classifier

In [21]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion = "entropy", random_state = 0)
dt.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=0)

### 5.7. Random Forest Classification

In [22]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 10, criterion = "entropy", random_state = 0)
rf.fit(X_train, y_train)

RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=0)

### 5.8. XGBoost

In [28]:
#conda install -c conda-forge xgboost

In [26]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-1.4.1-py3-none-win_amd64.whl (97.8 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.4.1


In [27]:
from xgboost import XGBClassifier
xgb = XGBClassifier() 
xgb.fit(X_train, y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

# 6.Testing

### 6.1. Logistic Regression

In [42]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_log_reg = log_reg.predict(X_test)
cm = confusion_matrix(y_test, y_pred_log_reg)
print(cm)
accuracy_score(y_test, y_pred_log_reg)

[[87  0]
 [ 4 46]]


0.9708029197080292

### 6.2. SVM

In [30]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_svm_classifier = svm_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred_svm_classifier)
print(cm)
accuracy_score(y_test, y_pred_svm_classifier)

[[86  1]
 [ 4 46]]


0.9635036496350365

#### Applying K-Fold Cross-Validation

In [48]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = svm_classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %" .format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %" .format(accuracies.std()*100))

Accuracy: 96.70 %
Standard Deviation: 1.61 %


### 6.3. KNN

In [31]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_knn_classifier = knn_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred_knn_classifier)
print(cm)
accuracy_score(y_test, y_pred_knn_classifier)

[[86  1]
 [ 5 45]]


0.9562043795620438

### 6.4. Kernel SVM

In [32]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_kernel_svm = kernel_svm.predict(X_test)
cm = confusion_matrix(y_test, y_pred_kernel_svm)
print(cm)
accuracy_score(y_test, y_pred_kernel_svm)

[[83  4]
 [ 4 46]]


0.9416058394160584

### 6.5. Naive Bayes

In [33]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_nb = nb.predict(X_test)
cm = confusion_matrix(y_test, y_pred_nb)
print(cm)
accuracy_score(y_test, y_pred_nb)

[[83  4]
 [ 4 46]]


0.9416058394160584

### 6.6. Decision Tree Classifier

In [34]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_dt = dt.predict(X_test)
cm = confusion_matrix(y_test, y_pred_dt)
print(cm)
accuracy_score(y_test, y_pred_dt)

[[85  2]
 [ 8 42]]


0.927007299270073

### 6.7. Random Forest

In [35]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_rf = rf. predict(X_test)
cm = confusion_matrix(y_test, y_pred_rf)
print(cm)
accuracy_score(y_test, y_pred_rf)

[[86  1]
 [ 4 46]]


0.9635036496350365

### 6.8. XGBoost

In [36]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_xgb = xgb.predict(X_test)
cm = confusion_matrix(y_test, y_pred_xgb)
print(cm)
accuracy_score(y_test, y_pred_xgb)

[[86  1]
 [ 5 45]]


0.9562043795620438

In [57]:
print("Accuracy from Logistic Regression:       {:.2f}" .format(accuracy_score(y_test, y_pred_log_reg)*100))
print("Accuracy from SVM Classifier:            {:.2f}" .format(accuracy_score(y_test, y_pred_svm_classifier)*100))
print("Accuracy from KNN Classifier:            {:.2f}" .format (accuracy_score(y_test, y_pred_knn_classifier)*100))
print("Accuracy from Kernel SVM:                {:.2f}" .format(accuracy_score(y_test, y_pred_kernel_svm)*100))
print("Accuracy from Naive Bayes Classifier:    {:.2f}" .format (accuracy_score(y_test, y_pred_nb)*100))
print("Accuracy from Decision Tree Classifier:  {:.2f}" .format(accuracy_score(y_test, y_pred_dt)*100))
print("Accuracy from Random Forest Classifier:  {:.2f}" .format(accuracy_score(y_test, y_pred_rf)*100))
print("Accuracy from XGBoost Classifier:        {:.2f}" .format(accuracy_score(y_test, y_pred_xgb)*100))

Accuracy from Logistic Regression:       97.08
Accuracy from SVM Classifier:            96.35
Accuracy from KNN Classifier:            95.62
Accuracy from Kernel SVM:                94.16
Accuracy from Naive Bayes Classifier:    94.16
Accuracy from Decision Tree Classifier:  92.70
Accuracy from Random Forest Classifier:  96.35
Accuracy from XGBoost Classifier:        95.62


Logistic regression has shown highest accuracy followed by SVM and Random Forest. 

## 7. Applying K-Fold Cross Validation

In [75]:
from sklearn.model_selection import cross_val_score

#logistic regression
accuracies_log_reg = cross_val_score(estimator = log_reg, X = X_train, y = y_train, cv = 10)

#SVM
accuracies_svm = cross_val_score(estimator = svm_classifier, X = X_train, y = y_train, cv = 10)

#knn
accuracies_knn = cross_val_score(estimator = knn_classifier, X = X_train, y = y_train, cv = 10)

#Kernel SVM
accuracies_kernel_svm = cross_val_score(estimator = kernel_svm, X = X_train, y = y_train, cv = 10)

#Naive Bayes
accuracies_nb = cross_val_score(estimator = nb, X = X_train, y = y_train, cv = 10)

#Decision Tree Classifier
accuracies_dt = cross_val_score(estimator = dt, X = X_train, y = y_train, cv = 10)

#Random Forest
accuracies_rf = cross_val_score(estimator = rf, X = X_train, y = y_train, cv = 10)

#XGBoost
accuracies_xgb = cross_val_score(estimator = xgb, X= X_train, y = y_train, cv = 10)

print("\n")
print("Accuracy from Logistic Regression:       {:.2f} %" .format(accuracies_log_reg.mean()*100))
print("Standard Deviation:                      {:.2f} %" .format(accuracies_log_reg.std()*100))
print("\n")
print("Accuracy from SVM Classifier:            {:.2f} %" .format(accuracies_svm.mean()*100))
print("Standard Deviation:                      {:.2f} %" .format(accuracies_svm.std()*100))
print("\n")
print("Accuracy from KNN Classifier:            {:.2f} %" .format(accuracies_knn.mean()*100))
print("Standard Deviation:                      {:.2f} %" .format(accuracies_knn.std()*100))
print("\n")
print("Accuracy from Kernel SVM:                {:.2f} %" .format(accuracies_kernel_svm.mean()*100))
print("Standard Deviation:                      {:.2f} %" .format(accuracies_kernel_svm.std()*100))
print("\n")
print("Accuracy from Naive Bayes Classifier:    {:.2f} %" .format(accuracies_nb.mean()*100))
print("Standard Deviation:                      {:.2f} %" .format(accuracies_nb.std()*100))
print("\n")
print("Accuracy from Decision Tree Classifier:  {:.2f} %" .format(accuracies_dt.mean()*100))
print("Standard Deviation:                      {:.2f} %" .format(accuracies_dt.std()*100))
print("\n")
print("Accuracy from Random Forest Classifier:  {:.2f} %" .format(accuracies_rf.mean()*100))
print("Standard Deviation:                      {:.2f} %" .format(accuracies_rf.std()*100))
print("\n")
print("Accuracy from XGBoost:                   {:.2f} %" .format(accuracies_xgb.mean()*100))
print("Standard Deviation:                      {:.2f} %" .format(accuracies_xgb.std()*100))


#print("Standard Deviation_Logistic Regression: {:.2f} %" .format(accuracies_log_reg.std()*100))
#print("Standard Deviation_SVM Classifier:      {:.2f} %" .format(accuracies_svm.std()*100))
#print("Standard Deviation_KNN Classifier:      {:.2f} %" .format(accuracies_knn.std()*100))
#print("Standard Deviation_Kernel_SVM:          {:.2f} %" .format(accuracies_kernel_svm.std()*100))
#print("Standard Deviation_Naive Bayes:         {:.2f} %" .format(accuracies_nb.std()*100))
#print("Standard Deviation_Decision Tree:       {:.2f} %" .format(accuracies_dt.std()*100))
#print("Standard Deviation_Random Forest:       {:.2f} %" .format(accuracies_rf.std()*100))
#print("Standard Deviation_ XGBoost:            {:.2f} %" .format(accuracies_xgb.std()*100))



Accuracy from Logistic Regression:       96.34 %
Standard Deviation:                      2.02 %


Accuracy from SVM Classifier:            96.70 %
Standard Deviation:                      1.61 %


Accuracy from KNN Classifier:            96.52 %
Standard Deviation:                      2.09 %


Accuracy from Kernel SVM:                96.70 %
Standard Deviation:                      1.81 %


Accuracy from Naive Bayes Classifier:    96.33 %
Standard Deviation:                      2.33 %


Accuracy from Decision Tree Classifier:  95.05 %
Standard Deviation:                      2.61 %


Accuracy from Random Forest Classifier:  95.60 %
Standard Deviation:                      1.68 %


Accuracy from XGBoost:                   96.52 %
Standard Deviation:                      1.72 %


Applying 10 fold cross validation has improved the accuracies of all the models, except logistic regression.  

Overall, **logistic regression model** has performed better in predicting whether the given tumor is beingn or malignant.