# Lab 14 SVMs: Illuminating Advanced Classifiers

Justin Breucop

Today we'll cover Support Vector Machines: linear vs. rbf

## SVMs

Support vector machines are powerful tools for performing analysis, built on the theory that there is a higher dimension where data can be seperated (via an appropriate hyperplane for that dimension).

As always, we'll import our standard packages, as well as two new ones: svm.SVC & tree.DecisionTreeClassifier. SVC stands for Support Vector Classification. There is an SVR class as well but that is for using SVMs in regression, which is out of scope for this lab.

In [None]:
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import classification_report
from sklearn.cross_validation import ShuffleSplit

from bokeh.plotting import figure,show,output_notebook
output_notebook()

%matplotlib inline

An SVM can also be used for categorical data. Because SVMs are more complex than most classification algorithms we've seen, there are many more parameters to tune and options to set for the SVC. Sklearn SVC documentation:

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

### Load the data!
To demonstrate these classifiers clearly, we will use the Iris dataset again

In [None]:
from sklearn import datasets

# import some data to play with
iris_port = datasets.load_iris()
iris = pd.DataFrame(iris_port.data,columns=iris_port.feature_names)
y = iris_port.target
X = iris

In [None]:
index = range(0,len(X))
np.random.shuffle(index)
train = index[:len(X)*3/5]
test = index[len(X)*3/5:]

In [None]:
model = SVC(kernel='linear',C=1).fit(X.iloc[train],y[train])
print classification_report(y[test],model.predict(X.iloc[test]))

The linear kernel has a coef\_ attribute we can use to plot our features. The coefficients are provided in the order of the classifier target (row 1 corresponds to target 1, etc.)

We'll be able to visually see how important each feature is to our model.

In [None]:
names = iris_port.target_names

In [None]:
data = {}
for i,row in enumerate(model.coef_):
    # Enumerate gives us a counter i for each value in model.coef_
    # we use that to get the full name of the iris.
    data[names[i]] = list(row)
data


In [None]:
from bokeh.charts import Bar, show

p=Bar(data, cat=list(iris.columns), title="SVC Feature Importance",
        xlabel='Flowers', ylabel='Linear Coefficient', width=600, height=600, legend="top_right")
show(p)

### Aside: Charts as Higher Level Glyph
Bokeh has some default charting functions such as bar which intake data in a specific pattern to generate pretty charts in a small amount of code.

There are more: http://bokeh.pydata.org/en/latest/docs/user_guide/charts.html#userguide-charts

###End Aside

As a reminder, Precision is only part of the story. Classification Report gives us a pretty full understanding of what's going on. 

In [None]:
model = SVC(kernel='rbf',C=1).fit(X.iloc[train],y[train])
print classification_report(y[test],model.predict(X.iloc[test]))

##Hands-on: Mushrooms!

Today we'll be working with a mushroom dataset. If you're lost in a forest and find a gill capped mushroom and have access to your SVM classifier, you'll hopefully be prepared to see if it's poisonous! Humor aside, we'll see the power of an SVM working with a large number of attributes to separate two classes of data.

The attributes are:
1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s 
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y 
4. bruises?: bruises=t,no=f 
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s 
6. gill-attachment: attached=a,descending=d,free=f,notched=n 
7. gill-spacing: close=c,crowded=w,distant=d 
8. gill-size: broad=b,narrow=n 
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y 
10. stalk-shape: enlarging=e,tapering=t 
11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? 
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
16. veil-type: partial=p,universal=u 
17. veil-color: brown=n,orange=o,white=w,yellow=y 
18. ring-number: none=n,one=o,two=t 
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z 
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y 
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y 
22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d



## Aside: String Processing for Fun And Profit

Because of the structure of the categories, I'm going to create a column:categories dictionary. First step is to put the data into a doc-string which is a special string defined by three apostrophes. The string accepts new lines and ends only when it sees another three apostrophes.

In [None]:
attributes = '''cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s 
cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 
cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y 
bruises?: bruises=t,no=f 
odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s 
gill-attachment: attached=a,descending=d,free=f,notched=n 
gill-spacing: close=c,crowded=w,distant=d 
gill-size: broad=b,narrow=n 
gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y 
stalk-shape: enlarging=e,tapering=t 
stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? 
stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 
stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 
stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
veil-type: partial=p,universal=u 
veil-color: brown=n,orange=o,white=w,yellow=y 
ring-number: none=n,one=o,two=t 
ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z 
spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y 
population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y 
habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d'''
attributes_list = attributes.split('\n')
attributes_list

In [None]:
ordered_attributes = []
data_attributes = {}
for att in attributes_list:
    #Break our string into the column name and categories
    col_data_split = att.split(': ')
    #next, we split our category labels into a list of name=value
    cat_labels = col_data_split[1].split(',')
    
    # lets now extract only our values (our data is pure letters) and value names. We'll do a 
    # dict comprehension that extracts the second value of a list after splitting on the =.
    
    #I split second on the spaces because there are some trailing spaces in my string
    cats = {x.split('=')[1].split(' ')[0]:x.split('=')[0] for x in cat_labels}
    #Now lets populate our columns dictionary defining a key, columns, and the values, our list
    # called cats here
    data_attributes[col_data_split[0]] = cats
    
    # we also want an ordered list to declare as the columns for our dataframe
    ordered_attributes.append(col_data_split[0])


data_attributes

## End Aside

So we have a dictionary of more verbose categories. We'll now want to apply them to the dataset.

In [None]:
mush = pd.read_csv('../data/mushrooms.data',header=None,names=['edible?']+ordered_attributes)
mush.head()

Let's have verbose names for our data by using .map()

In [None]:
for col in mush.columns:
    if col == 'edible?':
        continue
    mush[col] = mush[col].map(data_attributes[col])
mush.head()

Because of the sheer # of attributes in this dataset, we will work with a subset of the data.

In [None]:
mush = mush[['edible?','cap-shape','cap-color','cap-surface']]
mush.info()

We'll now convert them into binary features using `pd.get_dummies` function

In [None]:

pd.get_dummies(mush['cap-shape']).head()

For each new column, we'll preface it with the original column name

In [None]:
mush_code = pd.DataFrame(mush['edible?'].map({'p':0,'e':1}))

for column in mush.columns:
    if column == 'edible?':
        continue
    temp = pd.get_dummies(mush[column],prefix=column)
    mush_code[temp.columns] = temp
mush_code.head()
 

Awesome, we've prepared our data. Now we need to provide it to an SVM Classifier and see how we do.

In [None]:
X= mush_code.drop('edible?',axis=1)
y = mush_code['edible?']

In the interest in showing a variety of approaches and flexing your coding skills, what is this code below doing?

In [None]:
index = range(0,len(X))
np.random.shuffle(index)
train = index[:len(X)*4/5]
test = index[len(X)*4/5:]

We can compare two common kernels (note: the default kernel is rbf). 

In [None]:
model = SVC(C=1,kernel='linear').fit(X.iloc[train],y.iloc[train])
print classification_report(y.iloc[test],model.predict(X.iloc[test]))

In [None]:
model = SVC(C=1,kernel='rbf').fit(X.iloc[train],y.iloc[train])
print classification_report(y.iloc[test],model.predict(X.iloc[test]))

###Exercise 1
Convert one column to a dummy encoder and add the 'edible?' column to it. Train a linear kernel on it and generate a confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix

###Exercise 2
Plot the coefficients for the columns. Is this surprising? Share the results with your neighbor and identify the category of your column that best identifies an edible mushroom.

### Exercise 3

Train your SVM on every single category! What would you expect the score to do? (Hint: take your process for Exercise 1 and apply it to every column)

### Exercise 4
Cross validate an SVC model. Remember cross_val_score 

In [None]:
from sklearn.cross_validation import cross_val_score

### Exercise 5: Old Dog, New Data
Utilize the mushroom dataset and train a Decision Tree on it. Are you overfitting? How do you know?

### Advanced Topic: Kernel Play

http://scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html

What kernel performs best for the mushroom data? Do we know why? (the kernels shown are 'linear', 'poly' and 'rbf' however there are many others you can see in the documentation)