# Support Vector Machines Lab

In this lab we will explore several datasets with SVMs. The assets folder contains several datasets (in order of complexity):

1. Breast cancer
- Spambase
- Car evaluation
- Mushroom

For each of these a `.names` file is provided with details on the origin of data.

In [129]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.svm import SVC

# Exercise 1: Breast Cancer



## 1.a: Load the Data
Use `pandas.read_csv` to load the data and assess the following:
- Are there any missing values? (how are they encoded? do we impute them?)
- Are the features categorical or numerical?
- Are the values normalized?
- How many classes are there in the target?

Perform what's necessary to get to a point where you have a feature matrix `X` and a target vector `y`, both with only numerical entries.

In [102]:
df = pd.read_csv('/users/kristensu/dropbox/ga-dsi/dsi-copy/curriculum/week-09/1.3-lab-svms/assets/datasets/breast_cancer.csv')

In [103]:
df.head(2)

Unnamed: 0,Sample_code_number,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2


In [104]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
Sample_code_number             699 non-null int64
Clump_Thickness                699 non-null int64
Uniformity_of_Cell_Size        699 non-null int64
Uniformity_of_Cell_Shape       699 non-null int64
Marginal_Adhesion              699 non-null int64
Single_Epithelial_Cell_Size    699 non-null int64
Bare_Nuclei                    699 non-null object
Bland_Chromatin                699 non-null int64
Normal_Nucleoli                699 non-null int64
Mitoses                        699 non-null int64
Class                          699 non-null int64
dtypes: int64(10), object(1)
memory usage: 60.1+ KB


In [105]:
df.isnull().sum()

Sample_code_number             0
Clump_Thickness                0
Uniformity_of_Cell_Size        0
Uniformity_of_Cell_Shape       0
Marginal_Adhesion              0
Single_Epithelial_Cell_Size    0
Bare_Nuclei                    0
Bland_Chromatin                0
Normal_Nucleoli                0
Mitoses                        0
Class                          0
dtype: int64

In [106]:
df.columns = [x.lower() for x in df.columns]
df.columns

Index([u'sample_code_number', u'clump_thickness', u'uniformity_of_cell_size',
       u'uniformity_of_cell_shape', u'marginal_adhesion',
       u'single_epithelial_cell_size', u'bare_nuclei', u'bland_chromatin',
       u'normal_nucleoli', u'mitoses', u'class'],
      dtype='object')

In [107]:
df['bare_nuclei'].value_counts()

1     402
10    132
5      30
2      30
3      28
8      21
4      19
?      16
9       9
7       8
6       4
Name: bare_nuclei, dtype: int64

In [108]:
# Convert bare_nuclei column to dtype int
def to_int(x):
    if x == '?':
        return 0
    else:
        return x

df['bare_nuclei'] = df['bare_nuclei'].apply(to_int)
df['bare_nuclei'] = pd.to_numeric(df['bare_nuclei'])
df['bare_nuclei'].value_counts()

1     402
10    132
5      30
2      30
3      28
8      21
4      19
0      16
9       9
7       8
6       4
Name: bare_nuclei, dtype: int64

In [109]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 11 columns):
sample_code_number             699 non-null int64
clump_thickness                699 non-null int64
uniformity_of_cell_size        699 non-null int64
uniformity_of_cell_shape       699 non-null int64
marginal_adhesion              699 non-null int64
single_epithelial_cell_size    699 non-null int64
bare_nuclei                    699 non-null int64
bland_chromatin                699 non-null int64
normal_nucleoli                699 non-null int64
mitoses                        699 non-null int64
class                          699 non-null int64
dtypes: int64(11)
memory usage: 60.1 KB


In [111]:
# Check if data needs to be scaled
for i in df.columns:
    print i, max(df[i])
    print i, min(df[i])
# Looks like all data on same scale 1-10, will have to drop 16 rows of 0 in bare_nuclei

sample_code_number 13454352
sample_code_number 61634
clump_thickness 10
clump_thickness 1
uniformity_of_cell_size 10
uniformity_of_cell_size 1
uniformity_of_cell_shape 10
uniformity_of_cell_shape 1
marginal_adhesion 10
marginal_adhesion 1
single_epithelial_cell_size 10
single_epithelial_cell_size 1
bare_nuclei 10
bare_nuclei 0
bland_chromatin 10
bland_chromatin 1
normal_nucleoli 10
normal_nucleoli 1
mitoses 10
mitoses 1
class 4
class 2


In [114]:
df = df[df['bare_nuclei'] > 0]
df['bare_nuclei'].value_counts()

1     402
10    132
5      30
2      30
3      28
8      21
4      19
9       9
7       8
6       4
Name: bare_nuclei, dtype: int64

In [115]:
df.shape

(683, 11)

## 1.b: Model Building

- What's the baseline for the accuracy?
- Initialize and train a linear svm. What's the average accuracy score with a 3-fold cross validation?
- Repeat using an rbf classifier. Compare the scores. Which one is better?
- Are your features normalized? if not, try normalizing and repeat the test. Does the score improve?
- What's the best model?
- Print a confusion matrix and classification report for your best model using:
        train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)

**Check** to decide which model is best, look at the average cross validation score. Are the scores significantly different from one another?

In [None]:
X = df.iloc[:, 1:-1]
y = df.iloc[:, -1]
cv=3

In [132]:
# With linear kernel
s = SVC(C=1, kernel='linear')
s.fit(X,y)

s_linear = cross_val_score(s, X, y, cv=cv, scoring='accuracy').mean()
s_linear

0.96489295927042285

In [133]:
# With rbf kernel
s = SVC(C=1, kernel='rbf')
s.fit(X,y)

s_rbf = cross_val_score(s, X, y, cv=cv, scoring='accuracy').mean()
s_rbf

0.95758301774995491

In [134]:
df.columns

Index([u'sample_code_number', u'clump_thickness', u'uniformity_of_cell_size',
       u'uniformity_of_cell_shape', u'marginal_adhesion',
       u'single_epithelial_cell_size', u'bare_nuclei', u'bland_chromatin',
       u'normal_nucleoli', u'mitoses', u'class'],
      dtype='object')

In [137]:
def fit_model(x):
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size=.33, random_state=7)   
    x.fit(X_train, y_train)
    y_pred = x.predict(X_test)
    cm = classification_report(y_test, y_pred)

def print_cm(y_true, y_pred, ):
    

**Check:** Are there more false positives or false negatives? Is this good or bad?

## 1.c: Feature Selection

Use any of the strategies offered by `sklearn` to select the most important features.

Repeat the cross validation with only those 5 features. Does the score change?

## 1.d: Learning Curves

Learning curves are useful to study the behavior of training and test errors as a function of the number of datapoints available.

- Plot learning curves for train sizes between 10% and 100% (use StratifiedKFold with 5 folds as cross validation)
- What can you say about the dataset? do you need more data or do you need a better model?

##  1.e: Grid Ssearch

Use the grid_search function to explore different kernels and values for the C parameter.

- Can you improve on your best previous score?
- Print the best parameters and the best score

# Exercise 2
Now that you've completed steps 1.a through 1.e it's time to tackle some harder datasets. But before we do that, let's encapsulate a few things into functions so that it's easier to repeat the analysis.

## 2.a: Cross Validation
Implement a function `do_cv(model, X, y, cv)` that does the following:
- Calculates the cross validation scores
- Prints the model
- Prints and returns the mean and the standard deviation of the cross validation scores

> Answer: see above

## 2.b: Confusion Matrix and Classification report
Implement a function `do_cm_cr(model, X, y, names)` that automates the following:
- Split the data using `train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)`
- Fit the model
- Prints confusion matrix and classification report in a nice format

**Hint:** names is the list of target classes

> Answer: see above

## 2.c: Learning Curves
Implement a function `do_learning_curve(model, X, y, sizes)` that automates drawing the learning curves:
- Allow for sizes input
- Use 5-fold StratifiedKFold cross validation

> Answer: see above

## 2.d: Grid Search
Implement a function `do_grid_search(model, parameters)` that automates the grid search by doing:
- Calculate grid search
- Print best parameters
- Print best score
- Return best estimator


> Answer: see above

# Exercise 3
Using the functions above, analyze the Spambase dataset.

Notice that now you have many more features. Focus your attention on step C => feature selection

- Load the data and get to X, y
- Select the 15 best features
- Perform grid search to determine best model
- Display learning curves

# Exercise 4
Repeat steps 1.a - 1.e for the car dataset. Notice that now features are categorical, not numerical.
- Find a suitable way to encode them
- How does this change our modeling strategy?

Also notice that the target variable `acceptability` has 4 classes. How do we encode them?


# Bonus
Repeat steps 1.a - 1.e for the mushroom dataset. Notice that now features are categorical, not numerical. This dataset is quite large.
- How does this change our modeling strategy?
- Can we use feature selection to improve this?
