## Decision Tree Classification of Poisonous Mushrooms

In [None]:
import numpy as np
import pandas as pd

mushdf = pd.read_csv('../input/mushrooms.csv')

mushdf.head()

We need to create new data frame of  indicator variables because Decision Trees / Random Forests from scikit-learn do not tolerate strings, only numeric values for features. This is known as one-hot encoding or binarization. 

NOTE: It is possible to use scikit-learn's LabelEncoder for ordinal categorical features, but the current labels for the data do not appear to have any ordinality:

cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s  
cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s  
cap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y  
bruises?: bruises=t, no=f  
odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s  
gill-attachment: attached=a, descending=d, free=f, notched=n  
gill-spacing: close=c, crowded=w, distant=d  
gill-size: broad=b, narrow=n  
gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y  
stalk-shape: enlarging=e, tapering=t  
stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?  
stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s  
stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s  
stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y  
stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y  
veil-type: partial=p, universal=u  
veil-color: brown=n, orange=o, white=w, yellow=y  
ring-number: none=n, one=o, two=t  
ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z  
spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y  
population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y  
habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

In [None]:
mushdf = pd.get_dummies(mushdf)

mushdf.head()

In [None]:
from sklearn.model_selection import train_test_split

X_mush = mushdf.iloc[:,2:]
y_mush = mushdf.iloc[:,1] # class_p (0=edible, 1=poisonous)

X_train, X_test, y_train, y_test = train_test_split(X_mush, y_mush, random_state=650)

Let's fit a Decision Tree with default parameters in an attempt to classify the data and to examine feature importances to get a sense for what features are most informative when classifying poisonous mushrooms.

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier().fit(X_train, y_train)
    
fimportance = list(zip(X_train.columns, dtc.feature_importances_))
fimportance.sort(key = lambda x: x[1], reverse=True)

fimportance

## **Feature Selection**

Let's examine the dimensionality of our features

In [None]:
X_mush.shape

Dummy variable parameterization of categorical variables creates *i*-1 indicator variables for each original feature with *i* levels. Our original 22 features ha expanded to 117 features. This has implications when it comes to performance. 

Using sklearn's [SelectFromModel](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel) we can drop features that are not informative as measured by the estimated feature importance from the Decision Tree classifier.

The selected features we will use for learning are:

In [None]:
from sklearn.feature_selection import SelectFromModel

selected = SelectFromModel(dtc, prefit=True, threshold='.01*mean')

feature_mask = m1.get_support(indices=False)

X_mush_selected = X_mush[ X_mush.columns[feature_mask] ] 
X_mush_selected.columns

In [None]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_mush_selected, y_mush, random_state=650)

dtc2 = DecisionTreeClassifier().fit(X_train2,y_train2)

#feature importances
fimportance_selected = list(zip(X_train2.columns, dtc2.feature_importances_))
fimportance_selected.sort(key = lambda x: x[1], reverse=True)

fimportance_selected

In [None]:
features = list(list(zip(*fimportance))[0])[:10]
importances = list(list(zip(*fimportance))[1])[:10]

plt.figure()
sns.barplot(x=features, y=importances)
plt.ylabel('Importance')
plt.xlabel('Feature')
plt.title('Selected Feature Importances from Default Decision Tree')

for item in plt.gca().xaxis.get_ticklabels():
    item.set_rotation(90)

It appears that the most informative features are odor and the stalk root.

Let's examine how well the Decision Tree with default parameters does at classifying the mushrooms

In [None]:
print('Mushroom dataset: decision tree')
print('Accuracy of DT classifier on training set: {:.2f}'
     .format(dtc2.score(X_train2, y_train2)))
print('Accuracy of DT classifier on test set: {:.2f}'
     .format(dtc2.score(X_test2, y_test2)))

## Visualizing the Tree

In [None]:
import graphviz
from sklearn.tree import export_graphviz

def plot_decision_tree(dtc, feature_names, class_names):
    
    export_graphviz(dtc, out_file="temp.dot", feature_names=feature_names, class_names=class_names, filled = True, impurity = False)
    with open("temp.dot") as f:
        dot_graph = f.read()

    return graphviz.Source(dot_graph)

plot_decision_tree(dtc2, X_mush_selected.columns, ['poisonous','edible'])

## Tuning Hyperparameters with GridSearchCV

can we tune parameters after reduce dimensionality?

In [None]:
from sklearn.pipeline import Pipeline

tune max_depth, min_samples_leaf