## DAT18 Lab 10
## Support Vector Machines

Support vector machines are powerful tools for performing analysis, built on the theory that there is a higher dimension where data can be seperated (via an appropriate hyperplane for that dimension).

As always, we'll import our standard packages, as well as two new ones: svm.SVC & tree.DecisionTreeClassifier. SVC stands for Support Vector Classification. There is an SVR class as well but that is for using SVMs in regression, which is out of scope for this lab.

An SVM can also be used for categorical data. Because SVMs are more complex than most classification algorithms we've seen, there are many more parameters to tune and options to set for the SVC. Sklearn SVC documentation:

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

###Load the data!
To demonstrate these classifiers clearly, we will use the Iris dataset again

In [2]:
from sklearn import datasets

# import some data to play with
iris_port = datasets.load_iris()
iris = pd.DataFrame(iris_port.data,columns=iris_port.feature_names)
y = iris_port.target
X = iris

In [3]:
index = range(0,len(X))
np.random.shuffle(index)
train = index[:len(X)*3/5]
test = index[len(X)*3/5:]

In [4]:
model = SVC(kernel='linear',C=1).fit(X.iloc[train],y[train])
print classification_report(y[test],model.predict(X.iloc[test]))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        22
          1       1.00      0.94      0.97        18
          2       0.95      1.00      0.98        20

avg / total       0.98      0.98      0.98        60



The linear kernel has a coef\_ attribute we can use to plot our features. The coefficients are provided in the order of the classifier target (row 1 corresponds to target 1, etc.)

In [5]:
names = iris_port.target_names
names

array(['setosa', 'versicolor', 'virginica'], 
      dtype='|S10')

In [6]:
data = {}
for i,row in enumerate(model.coef_):
    data[names[i]] = list(row)
data

{'setosa': [-0.00019101758678941039,
  0.59481596255038793,
  -0.96634512622701063,
  -0.44625523510288329],
 'versicolor': [-0.0074537730401550428,
  0.17900529162781531,
  -0.53817143524840017,
  -0.29260259710961933],
 'virginica': [0.54602913544673015,
  1.1574209058950693,
  -1.8437995086110348,
  -1.7365773352566727]}

In [7]:
from bokeh._legacy_charts import Bar, show

In [8]:
p=Bar(data, cat = list(iris.columns), title="SVC Feature Importance",xlabel='Flowers', ylabel='Linear Coefficient', width=600, height=600, legend="top_right")
show(p)

  warn("Instantiating a Legacy Chart from bokeh._legacy_charts")


In [9]:
model = SVC(kernel='rbf',C=1).fit(X.iloc[train],y[train])
print classification_report(y[test],model.predict(X.iloc[test]))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        22
          1       0.94      0.94      0.94        18
          2       0.95      0.95      0.95        20

avg / total       0.97      0.97      0.97        60




Today we'll be working with a mushroom dataset. If you're lost in a forest and find a gill capped mushroom and have access to your SVM classifier, you'll hopefully be prepared to see if it's poisonous! Humor aside, we'll see the power of an SVM working with a large number of attributes to separate two classes of data.

The attributes are:
1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s 
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y 
4. bruises?: bruises=t,no=f 
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s 
6. gill-attachment: attached=a,descending=d,free=f,notched=n 
7. gill-spacing: close=c,crowded=w,distant=d 
8. gill-size: broad=b,narrow=n 
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y 
10. stalk-shape: enlarging=e,tapering=t 
11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? 
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
16. veil-type: partial=p,universal=u 
17. veil-color: brown=n,orange=o,white=w,yellow=y 
18. ring-number: none=n,one=o,two=t 
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z 
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y 
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y 
22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d


Because of the structure of the categories, I'm going to create a column:categories dictionary. First step is to put the data into a doc-string which is a special string defined by three apostrophes. The string accepts new lines and ends only when it sees another three apostrophes.

In [10]:
attributes = '''cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s 
cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 
cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y 
bruises?: bruises=t,no=f 
odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s 
gill-attachment: attached=a,descending=d,free=f,notched=n 
gill-spacing: close=c,crowded=w,distant=d 
gill-size: broad=b,narrow=n 
gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y 
stalk-shape: enlarging=e,tapering=t 
stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? 
stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 
stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 
stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
veil-type: partial=p,universal=u 
veil-color: brown=n,orange=o,white=w,yellow=y 
ring-number: none=n,one=o,two=t 
ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z 
spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y 
population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y 
habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d'''
attributes_list = attributes.split('\n')
attributes_list

['cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s ',
 'cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s ',
 'cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y ',
 'bruises?: bruises=t,no=f ',
 'odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s ',
 'gill-attachment: attached=a,descending=d,free=f,notched=n ',
 'gill-spacing: close=c,crowded=w,distant=d ',
 'gill-size: broad=b,narrow=n ',
 'gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y ',
 'stalk-shape: enlarging=e,tapering=t ',
 'stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? ',
 'stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s ',
 'stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s ',
 'stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y ',
 'stalk-color-below-ring: brown=n,buff=b,cin

In [11]:
ordered_attributes = []
data_attributes = {}
for att in attributes_list:
    #Break our string into the column name and categories
    col_data_split = att.split(': ')
    #next, we split our category labels into a list of name=value
    cat_labels = col_data_split[1].split(',')
    
    # lets now extract only our values (our data is pure letters) and value names. We'll do a 
    # dict comprehension that extracts the second value of a list after splitting on the =.
    
    #I split second on the spaces because there are some trailing spaces in my string
    cats = {x.split('=')[1].split(' ')[0]:x.split('=')[0] for x in cat_labels}
    #Now lets populate our columns dictionary defining a key, columns, and the values, our list
    # called cats here
    data_attributes[col_data_split[0]] = cats
    
    # we also want an ordered list to declare as the columns for our dataframe
    ordered_attributes.append(col_data_split[0])


data_attributes

{'bruises?': {'f': 'no', 't': 'bruises'},
 'cap-color': {'b': 'buff',
  'c': 'cinnamon',
  'e': 'red',
  'g': 'gray',
  'n': 'brown',
  'p': ' pink',
  'r': 'green',
  'u': 'purple',
  'w': 'white',
  'y': 'yellow'},
 'cap-shape': {'b': 'bell',
  'c': 'conical',
  'f': 'flat',
  'k': ' knobbed',
  's': 'sunken',
  'x': 'convex'},
 'cap-surface': {'f': 'fibrous', 'g': 'grooves', 's': 'smooth', 'y': 'scaly'},
 'gill-attachment': {'a': 'attached',
  'd': 'descending',
  'f': 'free',
  'n': 'notched'},
 'gill-color': {'b': 'buff',
  'e': 'red',
  'g': 'gray',
  'h': 'chocolate',
  'k': 'black',
  'n': 'brown',
  'o': 'orange',
  'p': 'pink',
  'r': ' green',
  'u': 'purple',
  'w': ' white',
  'y': 'yellow'},
 'gill-size': {'b': 'broad', 'n': 'narrow'},
 'gill-spacing': {'c': 'close', 'd': 'distant', 'w': 'crowded'},
 'habitat': {'d': 'woods',
  'g': 'grasses',
  'l': 'leaves',
  'm': 'meadows',
  'p': 'paths',
  'u': ' urban',
  'w': 'waste'},
 'odor': {'a': 'almond',
  'c': 'creosote',
 

In [12]:
mush = pd.read_csv('../data/mushrooms.data',header=None,names=['edible?']+ordered_attributes)
mush.head()

Unnamed: 0,edible?,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


Let's have verbose names for our data by using .map()

In [13]:
mush.columns

Index([u'edible?', u'cap-shape', u'cap-surface', u'cap-color', u'bruises?',
       u'odor', u'gill-attachment', u'gill-spacing', u'gill-size',
       u'gill-color', u'stalk-shape', u'stalk-root',
       u'stalk-surface-above-ring', u'stalk-surface-below-ring',
       u'stalk-color-above-ring', u'stalk-color-below-ring', u'veil-type',
       u'veil-color', u'ring-number', u'ring-type', u'spore-print-color',
       u'population', u'habitat'],
      dtype='object')

In [14]:
for col in mush.columns:
    if col == 'edible?':
        continue
    mush[col] = mush[col].map(data_attributes[col])
mush.head()

Unnamed: 0,edible?,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,convex,smooth,brown,bruises,pungent,free,close,narrow,black,...,smooth,white,white,partial,white,one,pendant,black,scattered,urban
1,e,convex,smooth,yellow,bruises,almond,free,close,broad,black,...,smooth,white,white,partial,white,one,pendant,brown,numerous,grasses
2,e,bell,smooth,white,bruises,anise,free,close,broad,brown,...,smooth,white,white,partial,white,one,pendant,brown,numerous,meadows
3,p,convex,scaly,white,bruises,pungent,free,close,narrow,brown,...,smooth,white,white,partial,white,one,pendant,black,scattered,urban
4,e,convex,smooth,gray,no,none,free,crowded,broad,black,...,smooth,white,white,partial,white,one,evanescent,brown,abundant,grasses


Because of the sheer # of attributes in this dataset, we will work with a subset of the data.

In [15]:
mush_small = mush[['edible?','cap-shape','cap-color','cap-surface']]
mush_small.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8124 entries, 0 to 8123
Data columns (total 4 columns):
edible?        8124 non-null object
cap-shape      8124 non-null object
cap-color      8124 non-null object
cap-surface    8124 non-null object
dtypes: object(4)
memory usage: 317.3+ KB


We'll now convert them into binary features using `pd.get_dummies` function

In [16]:
pd.get_dummies(mush_small['cap-shape']).head()

Unnamed: 0,knobbed,bell,conical,convex,flat,sunken
0,0,0,0,1,0,0
1,0,0,0,1,0,0
2,0,1,0,0,0,0
3,0,0,0,1,0,0
4,0,0,0,1,0,0


In [17]:
mush_small_code = pd.DataFrame(mush_small['edible?'].map({'p':0,'e':1}))

for column in mush_small.columns:
    if column == 'edible?':
        continue
    temp = pd.get_dummies(mush_small[column],prefix=column)
    mush_small_code[temp.columns] = temp
mush_small_code.head()
 

Unnamed: 0,edible?,cap-shape_ knobbed,cap-shape_bell,cap-shape_conical,cap-shape_convex,cap-shape_flat,cap-shape_sunken,cap-color_ pink,cap-color_brown,cap-color_buff,...,cap-color_gray,cap-color_green,cap-color_purple,cap-color_red,cap-color_white,cap-color_yellow,cap-surface_fibrous,cap-surface_grooves,cap-surface_scaly,cap-surface_smooth
0,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
1,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
4,1,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1


In [18]:
X= mush_small_code.drop('edible?',axis=1)
y = mush_small_code['edible?']

What is our baseline?

In [19]:
y.value_counts()/float(len(y))

1    0.517971
0    0.482029
Name: edible?, dtype: float64

Thus, a model that predicts always "Edible", has 51% accuracy

In [20]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

model = SVC(C=1,kernel='linear').fit(X_train,y_train)
print classification_report(y_test,model.predict(X_test))

             precision    recall  f1-score   support

          0       0.62      0.81      0.70      1275
          1       0.76      0.54      0.63      1406

avg / total       0.69      0.67      0.66      2681



In [21]:
model = SVC(C=1,kernel='rbf').fit(X_train,y_train)
print classification_report(y_test,model.predict(X_test))

             precision    recall  f1-score   support

          0       0.65      0.71      0.68      1275
          1       0.71      0.65      0.68      1406

avg / total       0.68      0.68      0.68      2681



###Exercise 1

Create a new dataset called mush_medium and include both cap and gill related features plus the edible? column. Train a linear kernel on it and generate a confusion matrix. Use the train_test_split provided by sklearn with a split of 0.33 and a random_state = 0

In [22]:
mush_medium = mush[['edible?', 'cap-shape', 'cap-surface', 'cap-color', 
                    'gill-attachment', 'gill-spacing', 'gill-size',
                    'gill-color']]
mush_medium.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8124 entries, 0 to 8123
Data columns (total 8 columns):
edible?            8124 non-null object
cap-shape          8124 non-null object
cap-surface        8124 non-null object
cap-color          8124 non-null object
gill-attachment    8124 non-null object
gill-spacing       8124 non-null object
gill-size          8124 non-null object
gill-color         8124 non-null object
dtypes: object(8)
memory usage: 571.2+ KB


In [23]:
mush_code = pd.DataFrame(mush_medium['edible?'].map({'p':0,'e':1}))

for column in mush_medium.columns:
    if column == 'edible?':
        continue
    temp = pd.get_dummies(mush_medium[column],prefix=column)
    mush_code[temp.columns] = temp
mush_code.head()


Unnamed: 0,edible?,cap-shape_ knobbed,cap-shape_bell,cap-shape_conical,cap-shape_convex,cap-shape_flat,cap-shape_sunken,cap-surface_fibrous,cap-surface_grooves,cap-surface_scaly,...,gill-color_black,gill-color_brown,gill-color_buff,gill-color_chocolate,gill-color_gray,gill-color_orange,gill-color_pink,gill-color_purple,gill-color_red,gill-color_yellow
0,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,1,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,1,0,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
4,1,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [24]:
mush_code.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8124 entries, 0 to 8123
Data columns (total 39 columns):
edible?                     8124 non-null int64
cap-shape_ knobbed          8124 non-null float64
cap-shape_bell              8124 non-null float64
cap-shape_conical           8124 non-null float64
cap-shape_convex            8124 non-null float64
cap-shape_flat              8124 non-null float64
cap-shape_sunken            8124 non-null float64
cap-surface_fibrous         8124 non-null float64
cap-surface_grooves         8124 non-null float64
cap-surface_scaly           8124 non-null float64
cap-surface_smooth          8124 non-null float64
cap-color_ pink             8124 non-null float64
cap-color_brown             8124 non-null float64
cap-color_buff              8124 non-null float64
cap-color_cinnamon          8124 non-null float64
cap-color_gray              8124 non-null float64
cap-color_green             8124 non-null float64
cap-color_purple            8124 non-null flo

In [25]:
X= mush_code.drop('edible?',axis=1)
y = mush_code['edible?']

In [26]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

In [27]:
model = SVC(C=1,kernel='linear').fit(X_train,y_train)
y_pred = model.predict(X_test)
print classification_report(y_test,y_pred)

             precision    recall  f1-score   support

          0       0.90      0.95      0.93      1275
          1       0.95      0.90      0.93      1406

avg / total       0.93      0.93      0.93      2681



In [28]:
from sklearn.metrics import confusion_matrix

print confusion_matrix(y_test, y_pred)

[[1215   60]
 [ 137 1269]]


###Exercise 2
Plot the coefficients for the columns. Mind that we only have 2 categories now, Edible and not Edible.
Is the plot surprising? Share the results with your neighbor and identify the feature that best identifies an edible mushroom.

In [29]:
coef_dict = {k:v for k, v in zip(X.columns, model.coef_[0])}

p=Bar(coef_dict.values(), cat = coef_dict.keys(), title="SVC Feature Importance",xlabel='Features', ylabel='Linear Coefficient', width=600, height=600, legend="top_right")
show(p)

###Exercise 3
Build a model with only the top 6 features. Use large coefficients both positive and negative

In [30]:
abs_coef_dict = {k:abs(v) for k, v in coef_dict.iteritems()}

In [31]:
import operator
sorted_coefs = sorted(abs_coef_dict.items(), key=operator.itemgetter(1), reverse = True)
sorted_coefs[0:10]

[('cap-shape_sunken', 4.6654338634591426),
 ('cap-color_purple', 4.5987504564286104),
 ('cap-color_green', 4.5987209334201991),
 ('cap-color_buff', 3.3990889268779849),
 ('gill-color_ green', 3.0001402661801841),
 ('gill-color_red', 2.9994602523768221),
 ('gill-size_narrow', 1.9994974119341471),
 ('gill-size_broad', 1.9994974119339699),
 ('gill-spacing_close', 1.9993692163404049),
 ('gill-spacing_crowded', 1.9993692163402725)]

In [32]:
selected_columns = ['edible?']
for i in xrange(0,10):
    selected_columns.append(sorted_coefs[i][0])
selected_columns

['edible?',
 'cap-shape_sunken',
 'cap-color_purple',
 'cap-color_green',
 'cap-color_buff',
 'gill-color_ green',
 'gill-color_red',
 'gill-size_narrow',
 'gill-size_broad',
 'gill-spacing_close',
 'gill-spacing_crowded']

In [33]:
selected_columns

['edible?',
 'cap-shape_sunken',
 'cap-color_purple',
 'cap-color_green',
 'cap-color_buff',
 'gill-color_ green',
 'gill-color_red',
 'gill-size_narrow',
 'gill-size_broad',
 'gill-spacing_close',
 'gill-spacing_crowded']

In [34]:
mush_select = mush_code[selected_columns]
mush_select.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8124 entries, 0 to 8123
Data columns (total 11 columns):
edible?                 8124 non-null int64
cap-shape_sunken        8124 non-null float64
cap-color_purple        8124 non-null float64
cap-color_green         8124 non-null float64
cap-color_buff          8124 non-null float64
gill-color_ green       8124 non-null float64
gill-color_red          8124 non-null float64
gill-size_narrow        8124 non-null float64
gill-size_broad         8124 non-null float64
gill-spacing_close      8124 non-null float64
gill-spacing_crowded    8124 non-null float64
dtypes: float64(10), int64(1)
memory usage: 761.6 KB


In [35]:
X= mush_select.drop('edible?',axis=1)
y = mush_select['edible?']

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

In [37]:
model = SVC(C=1,kernel='linear').fit(X_train,y_train)
y_pred = model.predict(X_test)
print classification_report(y_test,y_pred)

             precision    recall  f1-score   support

          0       0.95      0.56      0.71      1275
          1       0.71      0.97      0.82      1406

avg / total       0.82      0.78      0.77      2681



# Learning Curves

###Exercise 4

plot learning curves for train sizes between 5% and 100%

use StratifiedKFold with 5 folds as cross validation

In [38]:
from sklearn.learning_curve import learning_curve
from sklearn.cross_validation import StratifiedKFold

train_sizes, train_scores, test_scores = learning_curve(model,
                                                        X,
                                                        y,
                                                        train_sizes=np.linspace(0.05, 1.0, 10),
                                                        cv = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=0))

In [39]:
# Create our base figure
p = figure(title='Learning Curve',y_range=(0,1))

# Create our Training score line
p.line(x=train_sizes,y=train_scores.mean(axis=1),color='red',legend="Training Scores")

#Create our Testing score line
p.line(x=train_sizes,y=test_scores.mean(axis=1),color='blue',legend = "Test Scores")

#Move our legend around
p.legend.orientation = "bottom_right"

# Render the plot!!
show(p)

# Grid search

###Exercise 5

Use the grid_search function to explore different kernels and values for the C parameter on the mush_small_code data.

Can you improve on the previous score?

In [40]:
from sklearn import grid_search

parameters = {'kernel':('linear', 'rbf'), 'C':[0.1, 1, 10, 100]}
svr = SVC(kernel='linear')
model = grid_search.GridSearchCV(svr, parameters, n_jobs=4)

X= mush_small_code.drop('edible?',axis=1)
y = mush_small_code['edible?']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

model.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params={}, iid=True, loss_func=None, n_jobs=4,
       param_grid={'kernel': ('linear', 'rbf'), 'C': [0.1, 1, 10, 100]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [41]:
print classification_report(y_test,model.predict(X_test))

             precision    recall  f1-score   support

          0       0.68      0.75      0.72      1275
          1       0.75      0.68      0.72      1406

avg / total       0.72      0.72      0.72      2681



In [42]:
model.best_params_

{'C': 100, 'kernel': 'rbf'}