# Mushroom Poisonous or edible?
 
In this project, the goal is to predict if a mushroom is edible or not using machine learning. 

The model I chose to use here is a Decision Tree as it is a very powerful learning algorithm for classification and it is capable of fitting complex datasets such as the one I have.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

## Load the data

In [2]:
df= pd.read_csv('data/agaricus-lepiota.data')
df.head()

Unnamed: 0,p,x,s,n,t,p.1,f,c,n.1,k,...,s.2,w,w.1,p.2,w.2,o,p.3,k.1,s.3,u
0,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
1,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
2,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
3,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
4,e,x,y,y,t,a,f,c,b,n,...,s,w,w,p,w,o,p,k,n,g


### Rename columns


In [3]:
new_cols=['classes', 'cap_shape', 'cap_surface', 'cap_color', 'bruises?',
         'odor', 'gill_attachment', 'gill_spacing',
         'gill_size', 'gill_color', 'stalk_shape', 'stalk_root',
         'stalk_surface_above_ring', 'stalk_surface_below_ring', 'stalk_color_above_ring',
         'stalk_color_below_ring', 'veil_type', 'veil_color', 'ring_number',
         'ring_type', 'spore_print_color', 'population', 'habitat']

df.columns= new_cols

### basic information about the data


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8123 entries, 0 to 8122
Data columns (total 23 columns):
classes                     8123 non-null object
cap_shape                   8123 non-null object
cap_surface                 8123 non-null object
cap_color                   8123 non-null object
bruises?                    8123 non-null object
odor                        8123 non-null object
gill_attachment             8123 non-null object
gill_spacing                8123 non-null object
gill_size                   8123 non-null object
gill_color                  8123 non-null object
stalk_shape                 8123 non-null object
stalk_root                  8123 non-null object
stalk_surface_above_ring    8123 non-null object
stalk_surface_below_ring    8123 non-null object
stalk_color_above_ring      8123 non-null object
stalk_color_below_ring      8123 non-null object
veil_type                   8123 non-null object
veil_color                  8123 non-null object
ring_number

Here we have 8123 entries with 23 columns. All are of the type object, so we have all categorical and no numerical features. None of these columns contain missing missing entries. Below I look at the various categories each column has.

In [5]:
for col in df.columns:
    print(f'{col} {df[col].unique()}')

classes ['e' 'p']
cap_shape ['x' 'b' 's' 'f' 'k' 'c']
cap_surface ['s' 'y' 'f' 'g']
cap_color ['y' 'w' 'g' 'n' 'e' 'p' 'b' 'u' 'c' 'r']
bruises? ['t' 'f']
odor ['a' 'l' 'p' 'n' 'f' 'c' 'y' 's' 'm']
gill_attachment ['f' 'a']
gill_spacing ['c' 'w']
gill_size ['b' 'n']
gill_color ['k' 'n' 'g' 'p' 'w' 'h' 'u' 'e' 'b' 'r' 'y' 'o']
stalk_shape ['e' 't']
stalk_root ['c' 'e' 'b' 'r' '?']
stalk_surface_above_ring ['s' 'f' 'k' 'y']
stalk_surface_below_ring ['s' 'f' 'y' 'k']
stalk_color_above_ring ['w' 'g' 'p' 'n' 'b' 'e' 'o' 'c' 'y']
stalk_color_below_ring ['w' 'p' 'g' 'b' 'n' 'e' 'y' 'o' 'c']
veil_type ['p']
veil_color ['w' 'n' 'o' 'y']
ring_number ['o' 't' 'n']
ring_type ['p' 'e' 'l' 'f' 'n']
spore_print_color ['n' 'k' 'u' 'h' 'w' 'r' 'o' 'y' 'b']
population ['n' 's' 'a' 'v' 'y' 'c']
habitat ['g' 'm' 'u' 'd' 'p' 'w' 'l']


The column 'stalk_root' has a question mark. This is a sign of missing data.

In [6]:
vals=df['stalk_root'].value_counts()
for idx, each in zip(vals.index, vals):
    print(f'{idx}  : {each}  |     {round(each/sum(vals)*100,2)}%')


b  : 3776  |     46.49%
?  : 2480  |     30.53%
e  : 1119  |     13.78%
c  : 556  |     6.84%
r  : 192  |     2.36%


2480 values are missing in this column. This is a significant amount of data. It is 30% of all the values and trying to impute it by the most common may lead to different results, making almost 80% of our data of one category. Based on that, I will drop the column and work with the remaining.

In [7]:
df= df.drop('stalk_root', axis=1)


### Distribution of classes

In [8]:
sns.countplot(y)
plt.title('poisonous (1) or not (0)')
plt.show()

NameError: name 'y' is not defined

The classes are fairly balanced here

## Label Encode the values
Since machine learning algorithms prefer to deal with numerical values, I will change the categories in the columns from being strings to numbers.

In [None]:
from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()
for col in df.columns:
    df[col]= le.fit_transform(df[col])

df.head()

### Split the data
In order to build the model and test it, I will split the data so that I can test the perfomance of the model on data it has not seen yet.

In [None]:
y= df['classes']
X=df.drop('classes', axis=1)

X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree_clf= DecisionTreeClassifier(max_depth=6)
tree_clf.fit(X_train, y_train)

In [None]:
y_preds= tree_clf.predict(X_test)

# features

In [None]:
from sklearn.tree import export_graphviz

export_graphviz(tree_clf, out_file='mushroom_tree.dot',
                feature_names=X.columns,
                               class_names=y.astype('str')
               )

<img src='mushroom_tree.png'>

### Perfomance Metrics
Now that the model has been trained, it's time to see how well it generalizes on data that it has never seen before.

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, f1_score

print('confusion matrix')
print(confusion_matrix(y_preds, y_test))

print(f'accuracy: {accuracy_score(y_preds, y_test)}')
print(f'f1 score: {f1_score(y_preds, y_test)}')

This is really impressive. The model is able to learn all the features that make up a poisonous mushroom. This model would be really great to take with whenever you go out into the wild to pick up mushrooms.

## ROC curve

In [None]:
from sklearn.metrics import roc_auc_score
y_scores= tree_clf.predict_proba(X_test)
print(f' The ROC AUC score is : {roc_auc_score(y_test, y_scores[:,1])}')

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds= roc_curve(y_test, y_scores[:,1])
plt.title('ROC (Receiver Operating Characteristic)')
plt.plot(fpr, tpr)
plt.plot([0,1], [0,1], 'k--')
plt.xlabel('false positive rate')
plt.ylabel('true positive rate (recall)')
plt.show()

The ROC curve plotted above shows that our model is perfect. There are no false positive rates as we see our ROC curve being a triangle. A useless model, completely random, would lie on the dotted line. The area under our ROC curve is 100%.

# Conclusion
The decision tree proved to be a very powerful learning algorithm for fitting complex data. It was able to accurately differentiate between an edible and poisonous mushroom every single time. Well, don't go out into the wild eating every single mushroom which passes this test. More data may be needed to fully tell how our model performs as there are roughly 140,000 species of mushrooms in the world and less than 10% of them are edible.

