**SVM** works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong.

### Load the Cancer data:

The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007) [http://mlearn.ics.uci.edu/MLRepository.html]. The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics.

For the purposes of this example, we're using a dataset that has a relatively small number of predictors in each record.

In [1]:
import pandas as pd
import numpy as np

In [2]:
url = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/cell_samples.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,ID,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [3]:
df.dtypes

ID              int64
Clump           int64
UnifSize        int64
UnifShape       int64
MargAdh         int64
SingEpiSize     int64
BareNuc        object
BlandChrom      int64
NormNucl        int64
Mit             int64
Class           int64
dtype: object

In [22]:
df = df[pd.to_numeric(df['BareNuc'], errors='coerce').notnull()] # SInce certain values of 'BareNuc' was type 'object', we had to change that.
df['BareNuc'].astype('float')

0       1.0
1      10.0
2       2.0
3       4.0
4       1.0
       ... 
694     2.0
695     1.0
696     3.0
697     4.0
698     5.0
Name: BareNuc, Length: 683, dtype: float64

In [23]:
df.describe()

Unnamed: 0,ID,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BlandChrom,NormNucl,Mit,Class
count,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0,683.0
mean,1076720.0,4.442167,3.150805,3.215227,2.830161,3.234261,3.445095,2.869693,1.603221,2.699854
std,620644.0,2.820761,3.065145,2.988581,2.864562,2.223085,2.449697,3.052666,1.732674,0.954592
min,63375.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,877617.0,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,1171795.0,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,1238705.0,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [24]:
df.shape

(683, 11)

In [25]:
feature_df = df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X = np.asarray(feature_df)
X[0:5]

array([[5, 1, 1, 1, 2, '1', 3, 1, 1],
       [5, 4, 4, 5, 7, '10', 3, 2, 1],
       [3, 1, 1, 1, 2, '2', 3, 1, 1],
       [6, 8, 8, 1, 3, '4', 3, 7, 1],
       [4, 1, 1, 3, 2, '1', 3, 1, 1]], dtype=object)

In [27]:
df['Class'] = df['Class'].astype('int')
Y = np.asarray(df['Class'])
Y [0:5]

array([2, 2, 2, 2, 2])

In [29]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.2, random_state=6) # Splitting data for trainig and testing
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (546, 9) (546,)
Test set: (137, 9) (137,)


In [40]:
from sklearn import svm # Modelling SVM with SKLearn
clf = svm.SVC(kernel='linear') # Other kernels that can be used are; linear, polynomial, sigmoid.
clf.fit(X_train, y_train) 

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [41]:
yhat = clf.predict(X_test)
yhat [0:5]

array([4, 2, 4, 4, 2])

In [42]:
from sklearn.metrics import classification_report
print(classification_report(yhat, y_test))

              precision    recall  f1-score   support

           2       0.99      0.98      0.98        82
           4       0.96      0.98      0.97        55

    accuracy                           0.98       137
   macro avg       0.98      0.98      0.98       137
weighted avg       0.98      0.98      0.98       137



In [43]:
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)



0.9781021897810219