## UCI Mushroom Data Set - Classification

[UCI Mushroom Data Set](http://archive.ics.uci.edu/ml/datasets/Mushroom?ref=datanews.io) will be used to train a model to predict whether or not a mushroom is poisonous.

## Setup

In [13]:
import pandas as pd
import numpy as np

URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data'

COL_NAMES =['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment',
            'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
            'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',
            'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat']

mushroom_df = pd.read_csv(URL, names=COL_NAMES)

mushroom_df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


## Preprocess Data

In [14]:
# encode string features into numeric work with sklearn
mushroom_df = pd.get_dummies(mushroom_df)

mushroom_df.head()

Unnamed: 0,class_e,class_p,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_f,cap-surface_g,...,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,0,1,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
1,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,1,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
4,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0


In [15]:
from sklearn.model_selection import train_test_split

X_mush = mushroom_df.iloc[:, 2:]
y_mush = mushroom_df.iloc[:, 1]

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_mush, y_mush, random_state=0)

## Feature Importance

Using DecisionTreeClassifier with default parameters to estimate the importance of features.

In [17]:
from sklearn.tree import DecisionTreeClassifier

dTreeClassifier = DecisionTreeClassifier(random_state=0).fit(X_train, y_train)
featureSeries = pd.Series(dTreeClassifier.feature_importances_, index=X_train.columns)

# top 5 features
list(featureSeries.sort_values(ascending=False)[:5].index)

['odor_n', 'stalk-root_c', 'stalk-root_r', 'spore-print-color_r', 'odor_l']

## SVC Model

Support Vector Classifier with radial basis kernel.

`gamma` - kernel width of the RBF kernel  

Try `gamma` from `0.0001` to `10`.

In [80]:
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve

svClassifier = SVC(kernel='rbf', C=1, gamma=0.01)

svClassifier.fit(X_train, y_train)

print('Training:', svClassifier.score(X_train, y_train))
print('Testing:', svClassifier.score(X_test, y_test))

Training: 0.9986870178893813
Testing: 1.0
