# Support Vector Machine

To predict whether a cell sample represents a benign or malignant cell based on cell characteristics

Import necessary modules:

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import pickle

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report, jaccard_score

%matplotlib inline

Import the dataset and see data sample:

In [2]:
df = pd.read_csv("../datasets/cell_samples.csv")
df.head()

Unnamed: 0,ID,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


And details of the dataset:

In [3]:
df.describe()

Unnamed: 0,ID,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BlandChrom,NormNucl,Mit,Class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,1071704.0,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,617095.7,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,1238298.0,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


As the solution notebook points out, one of the columns might contain some values that `sklearn` cannot handle. Let's check for ourselves:

In [4]:
df.dtypes

ID              int64
Clump           int64
UnifSize        int64
UnifShape       int64
MargAdh         int64
SingEpiSize     int64
BareNuc        object
BlandChrom      int64
NormNucl        int64
Mit             int64
Class           int64
dtype: object

Make sure to convert the values into appropriate format and remove null values:

In [5]:
df = df[pd.to_numeric(df['BareNuc'], errors='coerce').notnull()]
df['BareNuc'] = df['BareNuc'].astype(int)

Separate the features and label into different datasets:

In [6]:
X = np.asarray(df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']])
y = np.asarray(df['Class'].astype(int))

Split the dataset into training and test sets:

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Initialize the model and fit it to training data (as pointed out in the solutions notebook, running the SVM with linear kernel gives a bit better results):

In [8]:
model = SVC(kernel='linear')
model.fit(X_train, y_train)

SVC(kernel='linear')

Make some predictions from this model on the test data:

In [9]:
pred = model.predict(X_test)

And check how accurate these predictions are:

In [10]:
print(f"Confusion Matrix:\n\n{confusion_matrix(y_test, pred)}\n\n")
print(f"Classification Report:\n\n{classification_report(y_test, pred)}")

Confusion Matrix:

[[89  1]
 [ 2 45]]


Classification Report:

              precision    recall  f1-score   support

           2       0.98      0.99      0.98        90
           4       0.98      0.96      0.97        47

    accuracy                           0.98       137
   macro avg       0.98      0.97      0.98       137
weighted avg       0.98      0.98      0.98       137



Seems quite accurate! Let's save this model:

In [11]:
pickle.dump(model, open('../saved_models/model_8.sav', 'wb'))