# **Comparative Analysis of Classification Algorithms on the Wisconsin Breast Cancer Dataset**

This notebook demonstrates the implementation of seven classification models—Logistic Regression, K-NN, SVM, Kernel SVM, Naive Bayes, Decision Tree, and Random Forest—on the Wisconsin Breast Cancer Dataset. It includes data preprocessing, feature scaling, model training, predictions, and accuracy comparison of each model.

## Importing the libraries

In [492]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, accuracy_score
import pandas as pd

## Importing the dataset

The **Wisconsin Breast Cancer Dataset (WBCD)** is a widely used dataset for binary classification tasks, specifically predicting whether breast cancer tumors are malignant or benign. It contains 683 instances with 10 numeric features derived from digitized images of fine needle aspirates (FNA) of breast masses. The features measure various characteristics of the cell nuclei, such as


.



**Clump Thickness**: Cell clump thickness.

**Uniformity of Cell Size**: Consistency in cell size.

**Uniformity of Cell Shape**: Consistency in cell shape.

**Marginal Adhesion**: Adhesion between cells.

**Single Epithelial Cell Size**: Size of epithelial cells.

**Bare Nuclei**: Number of bare nuclei.

**Bland Chromatin**: Texture of the chromatin in the cell.

**Normal Nucleoli**: Nucleoli appearance.

**Mitoses**: Rate of cell division.

The target variable, Class, is labeled as:

.

**2 for benign tumors.**

**4 for malignant tumors.**

In [493]:
dataset = pd.read_csv('Data.csv')
dataset

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
...,...,...,...,...,...,...,...,...,...,...,...
678,776715,3,1,1,1,3,2,1,1,1,2
679,841769,2,1,1,1,2,1,1,1,1,2
680,888820,5,10,10,3,7,3,8,10,2,4
681,897471,4,8,6,4,3,4,10,6,1,4


In [494]:
print('dataset length :', len(dataset))

dataset length : 683


In [495]:
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1:].values

In [496]:
print(X[:10])

[[ 5  1  1  1  2  1  3  1  1]
 [ 5  4  4  5  7 10  3  2  1]
 [ 3  1  1  1  2  2  3  1  1]
 [ 6  8  8  1  3  4  3  7  1]
 [ 4  1  1  3  2  1  3  1  1]
 [ 8 10 10  8  7 10  9  7  1]
 [ 1  1  1  1  2 10  3  1  1]
 [ 2  1  2  1  2  1  3  1  1]
 [ 2  1  1  1  2  1  1  1  5]
 [ 4  2  1  1  2  1  2  1  1]]


In [497]:
print(y[:10])

[[2]
 [2]
 [2]
 [2]
 [2]
 [4]
 [2]
 [2]
 [2]
 [2]]


## Splitting the dataset into the Training set and Test set

In [498]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [499]:
print("Length of the training dataset : ",len(X_train))
print("Length of the testing dataset  : ",len(X_test))

Length of the training dataset :  512
Length of the testing dataset  :  171


## Feature Scaling

In [500]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [501]:
X_train[:10]

array([[ 0.91903747,  0.9407658 ,  2.30881719,  0.77265864, -0.10307335,
         1.80910082,  2.22576767,  2.27129602,  0.24623928],
       [ 1.27578287, -0.04290763,  1.63138773,  0.06811505,  0.3404019 ,
         1.53244024,  1.82407819,  1.94996317,  3.74830911],
       [ 1.27578287,  2.25233038,  2.30881719,  2.53401762,  1.22735239,
         1.80910082,  2.62745714,  2.27129602, -0.33743902],
       [-1.22143494, -0.69868992, -0.73961536, -0.63642854, -0.5465486 ,
        -0.68084439, -0.98774815, -0.62069958, -0.33743902],
       [-1.22143494, -0.69868992, -0.06218591, -0.63642854, -0.99002384,
        -0.68084439, -0.58605867, -0.62069958, -0.33743902],
       [-1.22143494, -0.69868992, -0.73961536, -0.63642854, -0.5465486 ,
        -0.68084439, -0.98774815, -0.62069958, -0.33743902],
       [ 0.20554667,  0.9407658 ,  0.95395828, -0.28415674,  0.3404019 ,
         1.80910082, -0.18436919,  0.98596464, -0.33743902],
       [-0.50794414, -0.69868992, -0.73961536, -0.63642854, -0

In [502]:
X_test[:10]

array([[-1.22143494, -0.69868992, -0.73961536, -0.63642854, -0.5465486 ,
         0.42579792, -0.98774815, -0.62069958, -0.33743902],
       [-0.50794414, -0.69868992, -0.73961536, -0.63642854, -0.5465486 ,
        -0.68084439, -0.58605867, -0.62069958, -0.33743902],
       [ 0.20554667,  0.61287466,  0.61524355, -0.28415674,  0.78387715,
         1.80910082,  0.21732028,  0.02196611, -0.33743902],
       [-0.15119873,  1.26865695,  1.63138773,  0.06811505,  0.3404019 ,
         1.80910082,  2.22576767, -0.62069958, -0.33743902],
       [-1.22143494, -0.69868992, -0.73961536, -0.63642854, -0.5465486 ,
        -0.68084439, -0.98774815, -0.62069958, -0.33743902],
       [-0.50794414, -0.04290763, -0.73961536, -0.63642854, -0.5465486 ,
        -0.68084439, -0.98774815, -0.62069958, -0.33743902],
       [ 0.20554667, -0.04290763, -0.73961536, -0.28415674, -0.5465486 ,
        -0.68084439, -0.58605867, -0.62069958, -0.33743902],
       [ 1.27578287,  2.25233038,  2.30881719,  2.53401762,  1

## **01 Logistic Regression model on the Training set**

In [503]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


### Predicting the Test set results

In [504]:
y_pred = classifier.predict(X_test)
y_pred[:10]

array([2, 2, 4, 4, 2, 2, 2, 4, 2, 2])

In [505]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

accuracy_1 = accuracy_score(y_test, y_pred)
print('\nAccuracy :',accuracy_1)

[[103   4]
 [  5  59]]

Accuracy : 0.9473684210526315


## **02 K-NN model on the Training set**

In [506]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

  return self._fit(X, y)


### Predicting the Test set results

In [507]:
y_pred = classifier.predict(X_test)
y_pred[:10]

array([2, 2, 4, 4, 2, 2, 2, 4, 2, 2])

In [508]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

accuracy_2 = accuracy_score(y_test, y_pred)
print('\nAccuracy :',accuracy_2)

[[103   4]
 [  5  59]]

Accuracy : 0.9473684210526315


## **03 SVM model on the Training set**

In [509]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


### Predicting the Test set results

In [510]:
y_pred = classifier.predict(X_test)
y_pred[:10]

array([2, 2, 4, 4, 2, 2, 2, 4, 2, 2])

In [511]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

accuracy_3 = accuracy_score(y_test, y_pred)
print('\nAccuracy :',accuracy_3)

[[102   5]
 [  5  59]]

Accuracy : 0.9415204678362573


## **04 Kernel SVM model on the Training set**

In [512]:
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


### Predicting the Test set results

In [513]:
y_pred = classifier.predict(X_test)
y_pred[:10]

array([2, 2, 4, 4, 2, 2, 2, 4, 2, 2])

In [514]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

accuracy_4 = accuracy_score(y_test, y_pred)
print('\nAccuracy :',accuracy_4)

[[101   6]
 [  3  61]]

Accuracy : 0.9473684210526315


## **05 Naive Bayes model on the Training set**

In [515]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


### Predicting the Test set results

In [516]:
y_pred = classifier.predict(X_test)
y_pred[:10]

array([2, 2, 4, 4, 2, 2, 2, 4, 2, 2])

In [517]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

accuracy_5 = accuracy_score(y_test, y_pred)
print('\nAccuracy :',accuracy_5)

[[99  8]
 [ 2 62]]

Accuracy : 0.9415204678362573


## **06 Decision Tree Classification model on the Training set**

In [518]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

### Predicting the Test set results

In [519]:
y_pred = classifier.predict(X_test)
y_pred[:10]

array([2, 2, 4, 4, 2, 2, 2, 4, 2, 2])

In [520]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

accuracy_6 = accuracy_score(y_test, y_pred)
print('\nAccuracy :',accuracy_6)

[[104   3]
 [  4  60]]

Accuracy : 0.9590643274853801


## **07 Random Forest Classification model on the Training set**

In [521]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

  return fit_method(estimator, *args, **kwargs)


### Predicting the Test set results

In [522]:
y_pred = classifier.predict(X_test)
y_pred[:10]

array([2, 2, 4, 4, 2, 2, 2, 4, 2, 2])

In [523]:
cm = confusion_matrix(y_test, y_pred)
print(cm)

accuracy_7 = accuracy_score(y_test, y_pred)
print('\nAccuracy :',accuracy_7)

[[104   3]
 [  5  59]]

Accuracy : 0.9532163742690059


## Comparing all the results

In [525]:
print("Logistic Regression model accuracy          :",accuracy_1)
print("K-NN model accuracy                         :",accuracy_2)
print("SVM model accuracy                          :",accuracy_3)
print("Kernel SVM model accuracy                   :",accuracy_4)
print("Naive Bayes model accuracy                  :",accuracy_5)
print("Decision Tree Classification model accuracy :",accuracy_6)
print("Random Forest Classification model accuracy :",accuracy_7)

Logistic Regression model accuracy          : 0.9473684210526315
K-NN model accuracy                         : 0.9473684210526315
SVM model accuracy                          : 0.9415204678362573
Kernel SVM model accuracy                   : 0.9473684210526315
Naive Bayes model accuracy                  : 0.9415204678362573
Decision Tree Classification model accuracy : 0.9590643274853801
Random Forest Classification model accuracy : 0.9532163742690059


END