# Support Vector Machines Lab

In this lab you can explore several datasets with SVM classifiers compared to logistic regression and kNN classifiers. 

Your datasets folder has these four datasets to choose from for the lab:

**Breast cancer**

    ./DSI-SF-2/datasets/breast_cancer_wisconsin

**Spambase**

    ./DSI-SF-2/datasets/spam

**Car evaluation**

    ./DSI-SF-2/datasets/car_evaluation
    
**Mushroom**

    ./DSI-SF-2/datasets/car_evaluation


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display

from sklearn.svm import SVC

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## 1.: Breast Cancer



### Load the Data

- Are there any missing values? Impute or clean if so.
- Select a classification target and predictors.

In [25]:
filepath ='/Users/tlee010/desktop/DSI-SF-2-timdavidlee/datasets/breast_cancer_wisconsin/breast_cancer.csv'
bc = pd.read_csv(filepath)

#====== replacing the ? in the bare nucleii column
bc.Bare_Nuclei = bc.Bare_Nuclei.map(lambda x: int(x.replace('?','0')))
bc['Class'] = bc['Class'].map(lambda x: 1 if x == 4 else 0)
print bc.info()
bc.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
Sample_code_number             699 non-null int64
Clump_Thickness                699 non-null int64
Uniformity_of_Cell_Size        699 non-null int64
Uniformity_of_Cell_Shape       699 non-null int64
Marginal_Adhesion              699 non-null int64
Single_Epithelial_Cell_Size    699 non-null int64
Bare_Nuclei                    699 non-null int64
Bland_Chromatin                699 non-null int64
Normal_Nucleoli                699 non-null int64
Mitoses                        699 non-null int64
Class                          699 non-null int64
dtypes: int64(11)
memory usage: 60.1 KB
None


Unnamed: 0,Sample_code_number,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,0
1,1002945,5,4,4,5,7,10,3,2,1,0
2,1015425,3,1,1,1,2,2,3,1,1,0
3,1016277,6,8,8,1,3,4,3,7,1,0
4,1017023,4,1,1,3,2,1,3,1,1,0


In [26]:
#=====================
bc['Class'].unique()

array([0, 1])

## 2. Modeling

For details on the SVM classifier, see here:

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

- What's the baseline for the accuracy?
- Initialize and train a linear svm. What's the average accuracy score with a 5-fold cross validation?
- Repeat using a radial basis function (rbf) classifier. Compare the scores. Which one is better?
- Print a confusion matrix and classification report for your best model using training & testing data.

Classification report:

```python
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
```

Confusion matrix:

```python
df_confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
```

In [28]:
#baseline is the following
bc['Class'].mean() #==== generally more 0's than 1's in the class category 1= class 4 and 0= class 2

0.3447782546494993

In [33]:
import patsy
import numpy as py
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
formula = 'Class ~ ' + '+'.join([x for x in bc.columns.values if x not in ('Sample_code_number', 'Class')])
print formula
y, X = patsy.dmatrices(formula,bc)
sclr = StandardScaler()
X = sclr.fit_transform(X)
y = np.ravel(y)





Class ~ Clump_Thickness+Uniformity_of_Cell_Size+Uniformity_of_Cell_Shape+Marginal_Adhesion+Single_Epithelial_Cell_Size+Bare_Nuclei+Bland_Chromatin+Normal_Nucleoli+Mitoses


## 2. Perform the steps above with the car or mushroom dataset

Repeat each step.

## 3. Compare SVM, kNN and logistic regression using spam data

You should:

- Gridsearch optimal parameters for both (for SVM, just gridsearch C and kernel).
- Cross-validate scores.
- Examine confusion matrices and classification reports.

Bonus: 

Plot "learning curves" for the best models of each. This is a great way see how training/testing size affects the scores. Look at the documentation for how to use this function in sklearn.

http://scikit-learn.org/stable/modules/learning_curve.html#learning-curves