# Support Vector Machines Lab

In this lab we will explore several datasets with SVMs. The assets folder contains several datasets (in order of complexity):

1. Breast cancer

For each of these a `.names` file is provided with details on the origin of data.

In [1]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
X.head()
y = data.target
y_as_df = pd.DataFrame(data.target, columns=['benign'])

# Exercise 1: Breast Cancer



## 1.a: Load the Data
- Are there any missing values? (how are they encoded? do we impute them?)
- Are the features categorical or numerical?
- Are the values normalized?
- How many classes are there in the target?

Perform what's necessary to get to a point where you have a feature matrix `X` and a target vector `y`, both with only numerical entries.

In [2]:
def eda(dataframe):
    
    print "missing values \n", dataframe.isnull().sum()## count number of null values per column
    print ''
    print "dataframe types \n", dataframe.dtypes       ## list data type of each column
    print ''
    print "dataframe shape \n", dataframe.shape        ## rows by columns
    print ''
    print "dataframe describe \n", dataframe.describe()## stats -- mean, min, max, etc.. 
    print ''
    print 'unique values in series:\n'
    for item in dataframe:                             ## count number of unique values per column
        print item, '\t\t\t', dataframe[item].nunique()
    print ''
    print 'num duplicates:', dataframe.duplicated().sum() ## df.drop_duplicates() to remove dupes


eda(X)
# no missing values
#


missing values 
mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
dtype: int64

dataframe types 
mean radius                float64
mean texture               float64
mean perimete

In [3]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
X_minmax = pd.DataFrame(min_max_scaler.fit_transform(X),columns=X.columns)
eda(X_minmax)

missing values 
mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
dtype: int64

dataframe types 
mean radius                float64
mean texture               float64
mean perimete

In [32]:
y_pd = pd.DataFrame(y)
y_pd[0].value_counts()
## binary target

1    357
0    212
Name: 0, dtype: int64

## 1.b: Model Building

- What's the baseline for the accuracy?
- Initialize and train a linear svm. What's the average accuracy score with a 3-fold cross validation?
- Repeat using an rbf classifier. Compare the scores. Which one is better?
- Are your features normalized? if not, try normalizing and repeat the test. Does the score improve?
- What's the best model?
- Print a confusion matrix and classification report for your best model using:
        train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)

**Check** to decide which model is best, look at the average cross validation score. Are the scores significantly different from one another?

In [5]:
from sklearn.svm import SVC

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report



### LINEAR
model = SVC(kernel='linear')

X_train, X_test, y_train, y_test = train_test_split(X_minmax, y, test_size=0.33, random_state=42)
model.fit(X_train, y_train.ravel())

scores = cross_val_score(model, X_train, y_train,
                             cv=3,
                             n_jobs=-1)
print 'Linear SVC'
print 'SVC scores:', scores
print 'Avg SVC score = ',scores.mean(),


### RBF
model = SVC()

X_train, X_test, y_train, y_test = train_test_split(X_minmax, y, test_size=0.33, random_state=42)
model.fit(X_train, y_train.ravel())

scores = cross_val_score(model, X_train, y_train,
                             cv=3,
                             n_jobs=-1)
print 'RBF SVC'
print 'SVC scores:', scores
print 'Avg SVC score = ',scores.mean()



expected = y_test
predicted = model.predict(X_test)

print "Support Vector Machine Classifier"
print classification_report(expected, predicted, target_names=[">=5 years","<5 years"])
print confusion_matrix(y_test, predicted)

Linear SVC
SVC scores: [ 0.96875     0.95275591  0.96825397]
Avg SVC score =  0.963253291255 RBF SVC
SVC scores: [ 0.9453125   0.94488189  0.92857143]
Avg SVC score =  0.939588606112
Support Vector Machine Classifier
             precision    recall  f1-score   support

  >=5 years       1.00      0.88      0.94        67
   <5 years       0.94      1.00      0.97       121

avg / total       0.96      0.96      0.96       188

[[ 59   8]
 [  0 121]]


**Check:** Are there more false positives or false negatives? Is this good or bad?

##  1.c: Grid Ssearch

Use the grid_search function to explore different kernels and values for the C parameter.

- Can you improve on your best previous score?
- Print the best parameters and the best score

# Exercise 2
Now let's encapsulate a few things into functions so that it's easier to repeat the analysis.

## 2.a: Cross Validation
Implement a function `do_cv(model, X, y, cv)` that does the following:
- Calculates the cross validation scores
- Prints the model
- Prints and returns the mean and the standard deviation of the cross validation scores

> Answer: see above

## OPTIONAL
## 2.b: Confusion Matrix and Classification report
Implement a function `do_cm_cr(model, X, y, names)` that automates the following:
- Split the data using `train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)`
- Fit the model
- Prints confusion matrix and classification report in a nice format

**Hint:** names is the list of target classes
