# SVM

The dataset used will be the inbuilt dataset from sklearn "Breast cancer". The goal is to classify the tumor as malignant or benign. 

### Import Libraries

Pandas and numpy is imported to deal with data and for data visualization matplotlib and seaborn is used.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Read data

In [2]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

The data is presented in a dictionary form. Using the keys and information in them the dataframe is created.

In [3]:
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

The data and features names are extracted to form the dataset.

In [4]:
df = pd.DataFrame(data['data'],columns=data['feature_names'])
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### Splitting the data into training and testing set

The train test split function is imported from sklearn.

In [5]:
from sklearn.model_selection import train_test_split
X = df
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

### Build the model

The SVC package is imported from sklearn and its object is created followed by fitting the model on the training dataset.

In [6]:
from sklearn.svm import SVC

In [7]:
model = SVC()
model.fit(X_train,y_train)

SVC()

### Predictions

In [8]:
pred = model.predict(X_test)

### Evaluation metrics

In [9]:
from sklearn.metrics import classification_report,confusion_matrix

In [10]:
confusion_matrix(y_test,pred)

array([[ 53,  13],
       [  2, 103]], dtype=int64)

In [11]:
print(classification_report(y_test,pred))

              precision    recall  f1-score   support

           0       0.96      0.80      0.88        66
           1       0.89      0.98      0.93       105

    accuracy                           0.91       171
   macro avg       0.93      0.89      0.90       171
weighted avg       0.92      0.91      0.91       171



### Improving the model

The model can be improved even further by choosing the right parameters. This can  be done efficiently using the GridSearchCV functionality of sklearn. Creation of a 'grid' of parameters and trying out all the possible combinations is called a Gridsearch.

GridSearchCV takes as input a dictionary that has the parameters that should be tried and the model to train. The grid of parameters is defined as a dictionary, where the keys are the parameters and the values are the settings to be tested.

In [12]:
from sklearn.model_selection import GridSearchCV

The parameter 'C' controls the cost of misclassification. If the 'C' value is large, it gives a low bias and high variance. If the 'C' value is low, it gives a high bias and low variance. 'gamma' is a parameter of the Radial Basis Function (RBF). If 'gamma' is small then it leads to high bias and low variance and vice-versa.

So a range of values for 'C' and 'gamma' will be tested by defining it in the grid parameters dictionary.

In [13]:
grid_param = {'C': [0.1,1, 10, 100, 1000], 
              'gamma': [1,0.1,0.01,0.001,0.0001], 
              'kernel': ['rbf']}

The 'GridSearchCV' estimator taken in the model along with the grid parameters. Then choose verbose. The higher the number, the more verbose. verbose is text output describing the process.

In [15]:
grid = GridSearchCV(SVC(),grid_param,verbose=3)

Next this grid is fit on the training data. First it finds the best parameter combination by running the same loop with cross-validation. After getting the best combination it runs fit again on all data passed to fit without cross-validation, to build a single new model using the best parameter setting.

In [16]:
grid.fit(X_train,y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.637, total=   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.637, total=   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.625, total=   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.633, total=   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV] .......... C=0.1, gamma=1, kernel=rbf, score=0.633, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s


[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.637, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.637, total=   0.3s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.625, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.633, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV] ........ C=0.1, gamma=0.1, kernel=rbf, score=0.633, total=   0.0s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] ....... C=0.1, gamma=0.01, kernel=rbf, score=0.637, total=   0.0s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] ....... C=0.1, gamma=0.01, kernel=rbf, score=0.637, total=   0.0s
[CV] C=0.1, gamma=0.01, kernel=rbf ...................................
[CV] .

[CV] ....... C=10, gamma=0.001, kernel=rbf, score=0.886, total=   0.0s
[CV] C=10, gamma=0.001, kernel=rbf ...................................
[CV] ....... C=10, gamma=0.001, kernel=rbf, score=0.924, total=   0.0s
[CV] C=10, gamma=0.0001, kernel=rbf ..................................
[CV] ...... C=10, gamma=0.0001, kernel=rbf, score=0.938, total=   0.0s
[CV] C=10, gamma=0.0001, kernel=rbf ..................................
[CV] ...... C=10, gamma=0.0001, kernel=rbf, score=0.938, total=   0.0s
[CV] C=10, gamma=0.0001, kernel=rbf ..................................
[CV] ...... C=10, gamma=0.0001, kernel=rbf, score=0.900, total=   0.0s
[CV] C=10, gamma=0.0001, kernel=rbf ..................................
[CV] ...... C=10, gamma=0.0001, kernel=rbf, score=0.937, total=   0.0s
[CV] C=10, gamma=0.0001, kernel=rbf ..................................
[CV] ...... C=10, gamma=0.0001, kernel=rbf, score=0.911, total=   0.0s
[CV] C=100, gamma=1, kernel=rbf ......................................
[CV] .

[Parallel(n_jobs=1)]: Done 125 out of 125 | elapsed:    4.3s finished


GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.1, 1, 10, 100, 1000],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
                         'kernel': ['rbf']},
             verbose=3)

The best parameter setting can be obtained by using the 'best_params_' function and the best estimator can be obtained by using the 'best_estimator_' setting.

In [17]:
grid.best_params_

{'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}

In [18]:
grid.best_estimator_

SVC(C=1, gamma=0.0001)

Now the predictions can be made using these new settings.

In [19]:
pred2 = grid.predict(X_test)

In [21]:
confusion_matrix(y_test,pred2)

array([[ 59,   7],
       [  2, 103]], dtype=int64)

In [22]:
print(classification_report(y_test,pred2))

              precision    recall  f1-score   support

           0       0.97      0.89      0.93        66
           1       0.94      0.98      0.96       105

    accuracy                           0.95       171
   macro avg       0.95      0.94      0.94       171
weighted avg       0.95      0.95      0.95       171



So using the grid search method, the model performance has increased. 