# Evaluating SVM on Breast Cancer Dataset


### Introduction

This explores the Breast Cancer dataset using Support Vector Machine (SVM) classifiers. Results are compared to logistic regression and kNN classifiers.  The goal is to better understand SVM models, and its relative advange to other machine learning methods. 

### Import Packages

In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display

from sklearn.svm import SVC

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [38]:
from sklearn.svm import SVC, LinearSVC
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

### Load and prepare the data

- Are there any missing values? Impute or clean if so.
- Select a classification target and predictors.
- Determine the baseline for accuracy.
- Rescale the data.

In [14]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = pd.DataFrame(data.data,columns=data.feature_names)
y = data.target

In [16]:
X.head(2)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


### Build an SVM classifier on the data

For details on the SVM classifier, see [SVM-classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

- Initialize and train a linear SVM with the default settings. What is the average accuracy score with 5-fold cross validation?
- Repeat using a radial basis function (rbf) classifier. Compare the scores. Which one is better?
- Print the confusion matrix and classification report for your models.

- [Classification report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

- Confusion matrix:

 ```python
df_confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
```

In [19]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

X = pd.DataFrame(data.data,columns=data.feature_names)
y = data.target

### Comparing Kernels

On here, I notice that the linear kernel is consirably more accurate than the default one. Given the large improvement in scores, I then proceed to test further parameters through gridsearch.

In [38]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import svm, linear_model, datasets

clf = svm.SVC()
clf.fit(X,y)
print(np.mean(cross_val_score(clf, X, y, cv=5)))

0.627425933051


In [41]:
clf = svm.SVC(kernel='linear')
clf.fit(X,y)
print(np.mean(cross_val_score(clf, X, y, cv=5)))

0.945563678338


### Tune the SVM classifiers with gridsearch

In [None]:
clf = svm.SVC(kernel='linear')
clf.fit(X,y)
print(np.mean(cross_val_score(clf, X, y, cv=5)))

In [46]:
param_grid = [{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel' :['linear']},
 ]
SVM_gridsearch = GridSearchCV(svm.SVC(), param_grid, cv=5, verbose = 1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size=0.8)

In [47]:
SVM_gridsearch.fit(X,y)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:  4.2min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'kernel': ['linear'], 'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [49]:
SVM_gridsearch.best_params_

{'C': 100, 'gamma': 0.001, 'kernel': 'linear'}

### Model Best Parameters and Evaluate Results
The best model is modelled below. Observing the confusion matrix, precision and recall are the same - meaning that the model is accurate in predicting both positive and negative outcomes. The benchmark of my model is 62.7% positive, meaning that the scores all all improvements over the mean. This is not always the case, as sometimes disease datasets only have positive cases as rare events.

In [50]:
yhat = SVM_gridsearch.predict

In [30]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

X_ss = ss.fit_transform(X)

In [83]:
clf = svm.SVC(C = 100, gamma = 0.001, kernel='linear')
clf.fit(X_ss,y)

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [88]:
yhat = cross_val_predict(clf, X_ss, y, cv=5)

In [90]:
from sklearn.metrics import confusion_matrix

print metrics.accuracy_score(y, yhat)
print metrics.confusion_matrix(y, yhat)
print metrics.classification_report(y, yhat)

0.959578207381
[[201  11]
 [ 12 345]]
             precision    recall  f1-score   support

          0       0.94      0.95      0.95       212
          1       0.97      0.97      0.97       357

avg / total       0.96      0.96      0.96       569



In [28]:
print 'Benchmark:', float(sum(y))/len(y)

Benchmark: 0.627416520211


### Compare kNN and logistic regression on the dataset.

This section compares SVM scores to kNN and Logistic regression. All models score very similarly. They have roughly 96% precision and recall. The main differences come from their ability to predicty positive and negative scores. Logistic regresion is relatively worst at predicting 0-cases in precision, for example, while logistic regression is the best at it.

### k-Nearest-Neighbors

In [71]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [74]:
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')

In [75]:
knn.fit(X,y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [76]:
knn_params = {
    'n_neighbors':[1,3,5,9,15,21],
    'weights':['uniform','distance'],
    'metric' : ['euclidean','manhattan'] }

In [77]:
knn_gridsearch = GridSearchCV(
    KNeighborsClassifier(), knn_params, n_jobs=-1, cv=5, verbose=1)

In [78]:
knn_gridsearch.fit(X, y)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:    0.4s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'n_neighbors': [1, 3, 5, 9, 15, 21], 'metric': ['euclidean', 'manhattan'], 'weights': ['uniform', 'distance']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [80]:
knn_gridsearch.best_params_

{'metric': 'manhattan', 'n_neighbors': 9, 'weights': 'uniform'}

In [31]:
knn = KNeighborsClassifier(n_neighbors=9, weights='uniform', metric='manhattan')
knn.fit(X,y)
yhat = cross_val_predict(knn, X_ss, y, cv=5)

print metrics.accuracy_score(y, yhat)
print metrics.confusion_matrix(y, yhat)
print metrics.classification_report(y, yhat)

0.957820738137
[[194  18]
 [  6 351]]
             precision    recall  f1-score   support

          0       0.97      0.92      0.94       212
          1       0.95      0.98      0.97       357

avg / total       0.96      0.96      0.96       569



### Logistic Regression

In [32]:
from sklearn.linear_model import LogisticRegression

In [34]:
logreg = LogisticRegression(C=10**10,solver='lbfgs')
logreg.fit(X_ss, y)


LogisticRegression(C=10000000000, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

In [35]:
knn = KNeighborsClassifier(n_neighbors=9, weights='uniform', metric='manhattan')
knn.fit(X,y)
yhat = cross_val_predict(logreg, X_ss, y, cv=5)

print metrics.accuracy_score(y, yhat)
print metrics.confusion_matrix(y, yhat)
print metrics.classification_report(y, yhat)

0.954305799649
[[201  11]
 [ 15 342]]
             precision    recall  f1-score   support

          0       0.93      0.95      0.94       212
          1       0.97      0.96      0.96       357

avg / total       0.95      0.95      0.95       569



### Conclusion

Support-Vector Machine models provide a powerful alternative to models such as Logistic Regression and k-Nearest Neighbors. This project explored the different parameters that can be used to model SVM's, and optimised a model for the Breast Cancer Dataset. Furthermore, the SVM model was compared to kNN and Logistic Regression models. While all had high accuracy, each model excelled at predicting certain aspects of the data. Thus, a decision on the ideal model can be reached based on how much the user values accuracy in each of the sections.'