# Support Vector Machines Lab

In this lab you can explore several datasets with SVM classifiers compared to logistic regression and kNN classifiers. 

Your datasets folder has these four datasets to choose from for the lab:

**Breast cancer**

    ./DSI-SF-2/datasets/breast_cancer_wisconsin

**Spambase**

    ./DSI-SF-2/datasets/spam

**Car evaluation**

    ./DSI-SF-2/datasets/car_evaluation
    
**Mushroom**

    ./DSI-SF-2/datasets/car_evaluation


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

from ipywidgets import *
from IPython.display import display

from sklearn.svm import SVC

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## 1.: Breast Cancer



### Load the Data

- Are there any missing values? Impute or clean if so.
- Select a classification target and predictors.

In [25]:
filepath ='/Users/tlee010/desktop/DSI-SF-2-timdavidlee/datasets/breast_cancer_wisconsin/breast_cancer.csv'
bc = pd.read_csv(filepath)

#====== replacing the ? in the bare nucleii column
bc.Bare_Nuclei = bc.Bare_Nuclei.map(lambda x: int(x.replace('?','0')))
bc['Class'] = bc['Class'].map(lambda x: 1 if x == 4 else 0)
print bc.info()
bc.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
Sample_code_number             699 non-null int64
Clump_Thickness                699 non-null int64
Uniformity_of_Cell_Size        699 non-null int64
Uniformity_of_Cell_Shape       699 non-null int64
Marginal_Adhesion              699 non-null int64
Single_Epithelial_Cell_Size    699 non-null int64
Bare_Nuclei                    699 non-null int64
Bland_Chromatin                699 non-null int64
Normal_Nucleoli                699 non-null int64
Mitoses                        699 non-null int64
Class                          699 non-null int64
dtypes: int64(11)
memory usage: 60.1 KB
None


Unnamed: 0,Sample_code_number,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,0
1,1002945,5,4,4,5,7,10,3,2,1,0
2,1015425,3,1,1,1,2,2,3,1,1,0
3,1016277,6,8,8,1,3,4,3,7,1,0
4,1017023,4,1,1,3,2,1,3,1,1,0


In [26]:
#=====================
bc['Class'].unique()

array([0, 1])

## 2. Modeling

For details on the SVM classifier, see here:

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

- What's the baseline for the accuracy?
- Initialize and train a linear svm. What's the average accuracy score with a 5-fold cross validation?
- Repeat using a radial basis function (rbf) classifier. Compare the scores. Which one is better?
- Print a confusion matrix and classification report for your best model using training & testing data.

Classification report:

```python
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
```

Confusion matrix:

```python
df_confusion = pd.crosstab(y_true, y_pred, rownames=['Actual'], colnames=['Predicted'], margins=True)
```

In [28]:
#baseline is the following
bc['Class'].mean() #==== generally more 0's than 1's in the class category 1= class 4 and 0= class 2

0.3447782546494993

In [50]:
import patsy
import numpy as py
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import cross_val_score, train_test_split

formula = 'Class ~ ' + '+'.join([x for x in bc.columns.values if x not in ('Sample_code_number', 'Class')])
print formula

# ========== splitting columns for X and Y
y, X = patsy.dmatrices(formula,bc)
sclr = StandardScaler()
X = sclr.fit_transform(X)
y = np.ravel(y)

# ========== splitting training and testing data set rows
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, stratify=y)

# ========== LINEAR kernel
svc = SVC(kernel = 'linear')
score = cross_val_score(svc, X_train,y_train, cv=5)
print '='*60
print 'Linear kernel model:'
print np.mean(score), score

# ========== RBF kernel
svc = SVC(kernel = 'rbf')
score = cross_val_score(svc, X_train,y_train, cv=5)
print '='*60
print 'RBF kernel model:'
print np.mean(score), score




Class ~ Clump_Thickness+Uniformity_of_Cell_Size+Uniformity_of_Cell_Shape+Marginal_Adhesion+Single_Epithelial_Cell_Size+Bare_Nuclei+Bland_Chromatin+Normal_Nucleoli+Mitoses
Linear kernel model:
0.963202188092 [ 0.98979592  0.93877551  0.94897959  0.96938776  0.96907216]
RBF kernel model:
0.967304860088 [ 0.97959184  0.93877551  0.95918367  0.97959184  0.97938144]


## 2. Perform the steps above with the car or mushroom dataset

Repeat each step.

In [87]:
filepath ='/Users/tlee010/desktop/DSI-SF-2-timdavidlee/datasets/car_evaluation/car.csv'
cars = pd.read_csv(filepath)
cars.head()
print cars.buying.value_counts()
# no blanks, but all letters
cars.buying = cars.buying.map(lambda x : int(x.replace('med','0').replace('low','0').replace('vhigh','1').replace('high','1')))

med      432
high     432
low      432
vhigh    432
Name: buying, dtype: int64


In [88]:
print cars.buying.value_counts()

1    864
0    864
Name: buying, dtype: int64


In [91]:
formula = 'buying ~ ' + '+'.join([x for x in cars.columns.values if x not in ('buying')]) +'-1'
print formula
print cars.head()
# ========== splitting columns for X and Y
y, X = patsy.dmatrices(formula,cars)
sclr = StandardScaler()
X = sclr.fit_transform(X)
y = np.ravel(y)
print X.shape, y.shape

# ========== splitting training and testing data set rows
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, stratify=y)

# ========== LINEAR kernel
svc = SVC(kernel = 'linear')
score = cross_val_score(svc, X_train,y_train, cv=5)
print '='*60
print 'Linear kernel model:'
print np.mean(score), score

# ========== RBF kernel
svc = SVC(kernel = 'rbf')
score = cross_val_score(svc, X_train,y_train, cv=5)
print '='*60
print 'RBF kernel model:'
print np.mean(score), score




buying ~ maint+doors+persons+lug_boot+safety+acceptability-1
   buying  maint doors persons lug_boot safety acceptability
0       1  vhigh     2       2    small    low         unacc
1       1  vhigh     2       2    small    med         unacc
2       1  vhigh     2       2    small   high         unacc
3       1  vhigh     2       2      med    low         unacc
4       1  vhigh     2       2      med    med         unacc
(1728, 16) (1728,)
Linear kernel model:
0.636466942149 [ 0.66115702  0.66528926  0.64049587  0.65289256  0.5625    ]
RBF kernel model:
0.593477961433 [ 0.6446281   0.62396694  0.54545455  0.59917355  0.55416667]
[[-0.57735027  1.73205081 -0.57735027 ..., -0.2039395  -1.52836754
   5.0581237 ]
 [ 1.73205081 -0.57735027 -0.57735027 ..., -0.2039395   0.65429288
  -0.19770177]
 [ 1.73205081 -0.57735027 -0.57735027 ..., -0.2039395   0.65429288
  -0.19770177]
 ..., 
 [-0.57735027 -0.57735027 -0.57735027 ..., -0.2039395   0.65429288
  -0.19770177]
 [-0.57735027  1.73205081 

## 3. Compare SVM, kNN and logistic regression using spam data

You should:

- Gridsearch optimal parameters for both (for SVM, just gridsearch C and kernel).
- Cross-validate scores.
- Examine confusion matrices and classification reports.

Bonus: 

Plot "learning curves" for the best models of each. This is a great way see how training/testing size affects the scores. Look at the documentation for how to use this function in sklearn.

http://scikit-learn.org/stable/modules/learning_curve.html#learning-curves

In [96]:

filepath ='/Users/tlee010/desktop/DSI-SF-2-timdavidlee/datasets/spam/spambase.csv'
spam = pd.read_csv(filepath)
spam.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,class
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [97]:
formula = 'class ~ ' + '+'.join([a for a in spam.columns if a != 'class' ])
print formula



class ~ word_freq_make+word_freq_address+word_freq_all+word_freq_3d+word_freq_our+word_freq_over+word_freq_remove+word_freq_internet+word_freq_order+word_freq_mail+word_freq_receive+word_freq_will+word_freq_people+word_freq_report+word_freq_addresses+word_freq_free+word_freq_business+word_freq_email+word_freq_you+word_freq_credit+word_freq_your+word_freq_font+word_freq_000+word_freq_money+word_freq_hp+word_freq_hpl+word_freq_george+word_freq_650+word_freq_lab+word_freq_labs+word_freq_telnet+word_freq_857+word_freq_data+word_freq_415+word_freq_85+word_freq_technology+word_freq_1999+word_freq_parts+word_freq_pm+word_freq_direct+word_freq_cs+word_freq_meeting+word_freq_original+word_freq_project+word_freq_re+word_freq_edu+word_freq_table+word_freq_conference+char_freq_;+char_freq_(+char_freq_[+char_freq_!+char_freq_$+char_freq_#+capital_run_length_average+capital_run_length_longest+capital_run_length_total


In [98]:
y, X = patsy.dmatrices(formula,spam)
sclr = StandardScaler()
X = sclr.fit_transform(X)
y = np.ravel(y)
print X.shape, y.shape

# ========== splitting training and testing data set rows
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, stratify=y)

PatsyError: error tokenizing input (maybe an unclosed string?)
    class ~ word_freq_make+word_freq_address+word_freq_all+word_freq_3d+word_freq_our+word_freq_over+word_freq_remove+word_freq_internet+word_freq_order+word_freq_mail+word_freq_receive+word_freq_will+word_freq_people+word_freq_report+word_freq_addresses+word_freq_free+word_freq_business+word_freq_email+word_freq_you+word_freq_credit+word_freq_your+word_freq_font+word_freq_000+word_freq_money+word_freq_hp+word_freq_hpl+word_freq_george+word_freq_650+word_freq_lab+word_freq_labs+word_freq_telnet+word_freq_857+word_freq_data+word_freq_415+word_freq_85+word_freq_technology+word_freq_1999+word_freq_parts+word_freq_pm+word_freq_direct+word_freq_cs+word_freq_meeting+word_freq_original+word_freq_project+word_freq_re+word_freq_edu+word_freq_table+word_freq_conference+char_freq_;+char_freq_(+char_freq_[+char_freq_!+char_freq_$+char_freq_#+capital_run_length_average+capital_run_length_longest+capital_run_length_total
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ^