# Support Vector Machines Lab

In this lab you can explore several datasets with SVM classifiers compared to logistic regression and kNN classifiers. 

Your datasets folder has these four datasets to choose from for the lab:

**Breast cancer**

    ./DSI-SF-2/datasets/breast_cancer_wisconsin

**Spambase**

    ./DSI-SF-2/datasets/spam

**Car evaluation**

    ./DSI-SF-2/datasets/car_evaluation
    
**Mushroom**

    ./DSI-SF-2/datasets/car_evaluation


In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC, LinearSVC
from sklearn.cross_validation import cross_val_score
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## 1.: Breast Cancer



### Load the Data

- Are there any missing values? Impute or clean if so.
- Select a classification target and predictors.

In [2]:
bc = pd.read_csv('../Datasets/breast_cancer.csv')

In [3]:
bc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
Sample_code_number             699 non-null int64
Clump_Thickness                699 non-null int64
Uniformity_of_Cell_Size        699 non-null int64
Uniformity_of_Cell_Shape       699 non-null int64
Marginal_Adhesion              699 non-null int64
Single_Epithelial_Cell_Size    699 non-null int64
Bare_Nuclei                    699 non-null object
Bland_Chromatin                699 non-null int64
Normal_Nucleoli                699 non-null int64
Mitoses                        699 non-null int64
Class                          699 non-null int64
dtypes: int64(10), object(1)
memory usage: 60.1+ KB


In [4]:
bc.isnull().sum()

Sample_code_number             0
Clump_Thickness                0
Uniformity_of_Cell_Size        0
Uniformity_of_Cell_Shape       0
Marginal_Adhesion              0
Single_Epithelial_Cell_Size    0
Bare_Nuclei                    0
Bland_Chromatin                0
Normal_Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64

In [5]:
for col in bc.columns:
    print bc[col].value_counts()

1182404    6
1276091    5
1198641    3
466906     2
1116116    2
1070935    2
385103     2
1293439    2
1240603    2
1277792    2
1168736    2
560680     2
1174057    2
822829     2
320675     2
897471     2
1114570    2
1339781    2
654546     2
704097     2
1017023    2
734111     2
1354840    2
769612     2
411453     2
1158247    2
1321942    2
1061990    2
733639     2
1218860    2
          ..
1096352    1
255644     1
1201870    1
1169049    1
1041043    1
1190546    1
1071760    1
797327     1
1293966    1
1214092    1
1184241    1
432809     1
1201834    1
1125035    1
888523     1
1182410    1
640712     1
1018561    1
1336798    1
1091262    1
1173216    1
1286943    1
1319609    1
1172152    1
558538     1
1207986    1
1302428    1
857774     1
1181356    1
625201     1
Name: Sample_code_number, dtype: int64
1     145
5     130
3     108
4      80
10     69
2      50
8      46
6      34
7      23
9      14
Name: Clump_Thickness, dtype: int64
1     384
10     67
3      52
2 

In [6]:
bc_v2 = bc[bc['Bare_Nuclei']!='?']
bc_v2['Bare_Nuclei'] = pd.to_numeric(bc_v2['Bare_Nuclei'])
bc_v2['Class'] = bc_v2['Class'].map(lambda x: 'Benign' if x == 2 else 'Malignant')
print bc_v2['Class'].value_counts()
bc_v2.info()

Benign       444
Malignant    239
Name: Class, dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 683 entries, 0 to 698
Data columns (total 11 columns):
Sample_code_number             683 non-null int64
Clump_Thickness                683 non-null int64
Uniformity_of_Cell_Size        683 non-null int64
Uniformity_of_Cell_Shape       683 non-null int64
Marginal_Adhesion              683 non-null int64
Single_Epithelial_Cell_Size    683 non-null int64
Bare_Nuclei                    683 non-null int64
Bland_Chromatin                683 non-null int64
Normal_Nucleoli                683 non-null int64
Mitoses                        683 non-null int64
Class                          683 non-null object
dtypes: int64(10), object(1)
memory usage: 64.0+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [7]:
y = bc_v2['Class'].reset_index(drop=True)
X = bc_v2[[col for col in bc_v2.columns if col not in ['Class']]]

print X.shape,y.shape

(683, 10) (683,)


## 2. Modeling

For details on the SVM classifier, see here:

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

- What's the baseline for the accuracy?
- Initialize and train a linear svm. What's the average accuracy score with a 5-fold cross validation?
- Repeat using a radial basis function (rbf) classifier. Compare the scores. Which one is better?
- Print a confusion matrix and classification report for your best model using training & testing data.

Classification report:

```python
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
```

Confusion matrix:

```python
df_confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
```

In [9]:
svc = LinearSVC()
svc_best = svc.fit(X,y)

In [None]:
svc = SVC(kernel='linear')
svc.fit(X,y)

In [10]:
# # Setup our GridSearch Parmaters
# search_parameters = {
#     'C':  [1,10,100,1000],
#     'kernel':  ['linear']
# }

# # Intialize a blank model object
# svc = SVC()

# # Initialize gridsearch
# grid = GridSearchCV(svc, search_parameters, cv=5, verbose=1)
# svc_best = grid.fit(X,y).best_estimator_

In [16]:
print 'Cross-Val: ',cross_val_score(svc_best,X,y,cv=5)
print 'SVC: ',np.mean(cross_val_score(svc_best,X,y,cv=5))
print 'Baseline: ',len(bc_v2[bc_v2['Class']=='Benign'])/float(len(bc_v2))

Cross-Val:  [ 0.64963504  0.64963504  0.64963504  0.64963504  0.34814815]
SVC:  0.590224384969
Baseline:  0.650073206442


In [18]:
svc = SVC()
svc_best = svc.fit(X,y)

In [19]:
print 'Cross-Val: ',cross_val_score(svc_best,X,y,cv=5)
print 'SVC: ',np.mean(cross_val_score(svc_best,X,y,cv=5))
print 'Baseline: ',len(bc_v2[bc_v2['Class']=='Benign'])/float(len(bc_v2))

Cross-Val:  [ 0.66423358  0.64963504  0.66423358  0.64963504  0.65185185]
SVC:  0.655917815626
Baseline:  0.650073206442


In [20]:
y_hat = svc_best.predict(X)

In [23]:
df_confusion = pd.crosstab(y, y_hat, rownames=['Actual'], colnames=['Predicted'], margins=True)
df_confusion

Predicted,Benign,Malignant,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Benign,444,0,444
Malignant,0,239,239
All,444,239,683


## 2. Perform the steps above with the car or mushroom dataset

Repeat each step.

## 3. Compare SVM, kNN and logistic regression using spam data

You should:

- Gridsearch optimal parameters for both (for SVM, just gridsearch C and kernel).
- Cross-validate scores.
- Examine confusion matrices and classification reports.

Bonus: 

Plot "learning curves" for the best models of each. This is a great way see how training/testing size affects the scores. Look at the documentation for how to use this function in sklearn.

http://scikit-learn.org/stable/modules/learning_curve.html#learning-curves