<b>Classification Case Study: Breast Cancer</b>

dataset source: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)

We will classify if the breast cancer is benign or malignant based in the following features:


 <br>. Clump Thickness: 1 - 10
 <br>. Uniformity of Cell Size: 1 - 10
 <br>. Uniformity of Cell Shape: 1 - 10
 <br>. Marginal Adhesion: 1 - 10
 <br>. Single Epithelial Cell Size: 1 - 10
 <br>. Bare Nuclei: 1 - 10
 <br>. Bland Chromatin: 1 - 10
 <br>. Normal Nucleoli: 1 - 10
 <br>. Mitoses: 1 - 10

We'll be using the following classification models: Logistic Regression, K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Kernel SVM, Naive Bayes, Decision Tree Classification, Random Forest Classification 

<br><small> The project is based on the Udemy Course "Machine learning with Python and R" by Kirill Eremenko and Hadelin de Ponteves


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [16]:
# Ignore sklearn future warnings:
# import warnings filter
from warnings import simplefilter
# ignore all warnings
simplefilter(action='ignore', category=FutureWarning)
# Ignore Data conversion warnings
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

In [20]:
# Importing the dataset 
dataset = pd.read_csv("breast_cancer.csv")
X = dataset.iloc[:,1:-1].values
y = dataset.iloc[:,-1].values
dataset.head()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


<li><b> Logistic Regression 

In [21]:
# 1) Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state = 0)

# 2) Building the Logistic Regression Model
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)

# 3) Predicting y_test
y_pred = classifier.predict(X_test)

# 4) How good are the results ?
# a) confusion matrix:
cf = confusion_matrix(y_test,y_pred)
print(cf)

# b) accuracy score
A = accuracy_score(y_test,y_pred)
print("Accuracy Score: {:.2f} %\n".format(A))

# c) Accuracy with k-fold Cross Validation = building 10 sets and creating 10 accuracies and averaging those
accuracies = cross_val_score(classifier,X_train,y_train,cv=10)
print("K-folds Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

[[84  3]
 [ 4 46]]
Accuracy Score: 0.95 %

K-folds Accuracy: 96.70 %
Standard Deviation: 2.43 %


<li><b> K-Nearest Neighbors (KNN)

In [22]:
# 1) Split the dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=0)

# 2) Feature Scaling
# not required but applying it will improve the model performance
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# 3) Building the KNN Model
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train,y_train)

# 4) Predicting y_test
y_pred = classifier.predict(X_test)

# 5) How good are the results ?
# a) confusion matrix:
cf = confusion_matrix(y_test,y_pred)
print(cf)

# b) accuracy score
A = accuracy_score(y_test,y_pred)
print("Accuracy Score: {:.2f} %\n".format(A))

[[103   4]
 [  5  59]]
Accuracy Score: 0.95 %



<li><b> Support Vector Machines (SVM)(Linear)

In [24]:
# 1) Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state = 0)

# 2) Feature Scaling
# Necessart for Support Vector
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# 3) Building the SVM Model
classifier = SVC(kernel='linear',random_state=0)
classifier.fit(X_train,y_train)

# 4) Predicting y_test
y_pred = classifier.predict(X_test)

# 5) How good are the results ?
# a) confusion matrix:
cm = confusion_matrix(y_test,y_pred)
print(cm)

# b) accuracy score
A = accuracy_score(y_test,y_pred)
print("Accuracy Score: {:.2f} %\n".format(A))

[[102   5]
 [  5  59]]
Accuracy Score: 0.94 %



<li><b> Kernel SVM (Non Linear)

In [25]:
# 1) Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state = 0)

# 2) Feature Scaling
# Necessart for Support Vector
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# 3) Building Kernel SVM model
classifier = SVC(kernel='rbf',random_state=0)
classifier.fit(X_train,y_train)

# 4) Predicting y_test
y_pred = classifier.predict(X_test)

# 5) How good are the results ?
# a) confusion matrix:
cm = confusion_matrix(y_test,y_pred)
print(cm)

# b) accuracy score
A = accuracy_score(y_test,y_pred)
print("Accuracy Score: {:.2f} %\n".format(A))

[[101   6]
 [  3  61]]
Accuracy Score: 0.95 %



<li><b> Naive Bayes

In [26]:
# 1) Splitting the dataset
# specify that we want a linear kernel
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state = 0)

# 2) Feature Scaling
# It seems that it doesn't change anything here
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# 3) Building the Logistic Regression Model
classifier = GaussianNB()
classifier.fit(X_train,y_train)

# 4) Predicting y_test
y_pred = classifier.predict(X_test)

# 5) How good are the results ?
# a) confusion matrix:
cm = confusion_matrix(y_test,y_pred)
print(cm)

# b) accuracy score
A = accuracy_score(y_test,y_pred)
print("Accuracy Score: {:.2f} %\n".format(A))

[[99  8]
 [ 2 62]]
Accuracy Score: 0.94 %



<li><b> Decision Tree Classification

In [27]:
# 1) Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state = 0)

# 2) Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# 3) Building the Logistic Regression Model
classifier = DecisionTreeClassifier(criterion='entropy',random_state=0)
classifier.fit(X_train,y_train)

# 4) Predicting y_test
y_pred = classifier.predict(X_test)

# 5) How good are the results ?

# a) confusion matrix:
cm = confusion_matrix(y_test,y_pred)
print(cm)

# b) accuracy score
A = accuracy_score(y_test,y_pred)
print("Accuracy Score: {:.2f} %\n".format(A))

[[104   3]
 [  4  60]]
Accuracy Score: 0.96 %



<li><b> Random Forest Classification

In [28]:
# 1) Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state = 0)

# 2) Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# 3) Building the Logistic Regression Model
classifier = RandomForestClassifier(n_estimators=10,criterion='entropy',random_state=0)
classifier.fit(X_train,y_train)

# 4) Predicting y_test
y_pred = classifier.predict(X_test)

# 5) How good are the results ?
# a) confusion matrix:
cm = confusion_matrix(y_test,y_pred)
print(cm)

# b) accuracy score
A = accuracy_score(y_test,y_pred)
print("Accuracy Score: {:.2f} %\n".format(A))

[[104   3]
 [  5  59]]
Accuracy Score: 0.95 %

