# **Capstone 2: Supervised learning**
### **Mushroom classification model**
Dataset: Mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms (1981).


### 1. Go out and find a dataset of interest. It could be from one of the recommended resources or some other aggregation. Or it could be something that you scraped yourself. Just make sure that it has lots of variables, including an outcome of interest to you.

URL: https://archive.ics.uci.edu/ml/datasets/Mushroom

Dataset information:

This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.

In [None]:
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import io
from google.colab import files

In [None]:
# Load the dataset
uploaded = files.upload()
mushroom_df = pd.read_csv(io.StringIO(uploaded['agaricus-lepiota.data'].decode('utf-8')))
mushroom_df.head()

Saving agaricus-lepiota.data to agaricus-lepiota.data


Unnamed: 0,p,x,s,n,t,p.1,f,c,n.1,k,e,e.1,s.1,s.2,w,w.1,p.2,w.2,o,p.3,k.1,s.3,u
0,e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
1,e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
2,p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
3,e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
4,e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g


### 2. Explore the data. Get to know the data. Spend a lot of time going over its quirks. You should understand how it was gathered, what's in it, and what the variables look like.

In [None]:
# Data inspection & cleaning
mushroom_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8123 entries, 0 to 8122
Data columns (total 23 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   p       8123 non-null   object
 1   x       8123 non-null   object
 2   s       8123 non-null   object
 3   n       8123 non-null   object
 4   t       8123 non-null   object
 5   p.1     8123 non-null   object
 6   f       8123 non-null   object
 7   c       8123 non-null   object
 8   n.1     8123 non-null   object
 9   k       8123 non-null   object
 10  e       8123 non-null   object
 11  e.1     8123 non-null   object
 12  s.1     8123 non-null   object
 13  s.2     8123 non-null   object
 14  w       8123 non-null   object
 15  w.1     8123 non-null   object
 16  p.2     8123 non-null   object
 17  w.2     8123 non-null   object
 18  o       8123 non-null   object
 19  p.3     8123 non-null   object
 20  k.1     8123 non-null   object
 21  s.3     8123 non-null   object
 22  u       8123 non-null   

In [None]:
# Print unique values for each column and map to attribute info: https://archive.ics.uci.edu/ml/datasets/Mushroom
for column in mushroom_df.columns:
  print(f'{column}:', mushroom_df[column].unique(), '\n')

p: ['e' 'p'] 

x: ['x' 'b' 's' 'f' 'k' 'c'] 

s: ['s' 'y' 'f' 'g'] 

n: ['y' 'w' 'g' 'n' 'e' 'p' 'b' 'u' 'c' 'r'] 

t: ['t' 'f'] 

p.1: ['a' 'l' 'p' 'n' 'f' 'c' 'y' 's' 'm'] 

f: ['f' 'a'] 

c: ['c' 'w'] 

n.1: ['b' 'n'] 

k: ['k' 'n' 'g' 'p' 'w' 'h' 'u' 'e' 'b' 'r' 'y' 'o'] 

e: ['e' 't'] 

e.1: ['c' 'e' 'b' 'r' '?'] 

s.1: ['s' 'f' 'k' 'y'] 

s.2: ['s' 'f' 'y' 'k'] 

w: ['w' 'g' 'p' 'n' 'b' 'e' 'o' 'c' 'y'] 

w.1: ['w' 'p' 'g' 'b' 'n' 'e' 'y' 'o' 'c'] 

p.2: ['p'] 

w.2: ['w' 'n' 'o' 'y'] 

o: ['o' 't' 'n'] 

p.3: ['p' 'e' 'l' 'f' 'n'] 

k.1: ['n' 'k' 'u' 'h' 'w' 'r' 'o' 'y' 'b'] 

s.3: ['n' 's' 'a' 'v' 'y' 'c'] 

u: ['g' 'm' 'u' 'd' 'p' 'w' 'l'] 



In [None]:
# Rename columns based on attribute info mapping
renamed_columns = {'p': 'poisonous',
                   'x': 'cap_shape',
                   's': 'cap_surface',
                   'n': 'cap_color',
                   't': 'bruises',
                   'p.1': 'odor',
                   'f': 'gill_attachment',
                   'c': 'gill_spacing',
                   'n.1': 'gill_size',
                   'k': 'gill_color',
                   'e': 'stalk_shape',
                   'e.1': 'stalk_root',
                   's.1': 'stalk_surface_above_ring',
                   's.2': 'stalk_surface_below_ring',
                   'w': 'stalk_color_above_ring',
                   'w.1': 'stalk_color_below_ring',
                   'p.2': 'veil_type',
                   'w.2': 'veil_color',
                   'o': 'ring_number',
                   'p.3': 'ring_type',
                   'k.1': 'spore_print_color',
                   's.3': 'population',
                   'u': 'habitat'}

mushroom_df = mushroom_df.rename(columns=renamed_columns)
mushroom_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8123 entries, 0 to 8122
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   poisonous                 8123 non-null   object
 1   cap_shape                 8123 non-null   object
 2   cap_surface               8123 non-null   object
 3   cap_color                 8123 non-null   object
 4   bruises                   8123 non-null   object
 5   odor                      8123 non-null   object
 6   gill_attachment           8123 non-null   object
 7   gill_spacing              8123 non-null   object
 8   gill_size                 8123 non-null   object
 9   gill_color                8123 non-null   object
 10  stalk_shape               8123 non-null   object
 11  stalk_root                8123 non-null   object
 12  stalk_surface_above_ring  8123 non-null   object
 13  stalk_surface_below_ring  8123 non-null   object
 14  stalk_color_above_ring  

In [None]:
# Convert all columns to dummies
dummies = pd.get_dummies(mushroom_df)
dummies.head()

Unnamed: 0,poisonous_e,poisonous_p,cap_shape_b,cap_shape_c,cap_shape_f,cap_shape_k,cap_shape_s,cap_shape_x,cap_surface_f,cap_surface_g,cap_surface_s,cap_surface_y,cap_color_b,cap_color_c,cap_color_e,cap_color_g,cap_color_n,cap_color_p,cap_color_r,cap_color_u,cap_color_w,cap_color_y,bruises_f,bruises_t,odor_a,odor_c,odor_f,odor_l,odor_m,odor_n,odor_p,odor_s,odor_y,gill_attachment_a,gill_attachment_f,gill_spacing_c,gill_spacing_w,gill_size_b,gill_size_n,gill_color_b,...,stalk_color_below_ring_n,stalk_color_below_ring_o,stalk_color_below_ring_p,stalk_color_below_ring_w,stalk_color_below_ring_y,veil_type_p,veil_color_n,veil_color_o,veil_color_w,veil_color_y,ring_number_n,ring_number_o,ring_number_t,ring_type_e,ring_type_f,ring_type_l,ring_type_n,ring_type_p,spore_print_color_b,spore_print_color_h,spore_print_color_k,spore_print_color_n,spore_print_color_o,spore_print_color_r,spore_print_color_u,spore_print_color_w,spore_print_color_y,population_a,population_c,population_n,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,...,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,1,0,0,...,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0
2,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,...,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0
3,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1,1,0,0,...,0,0,0,1,0,1,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0
4,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,...,0,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0


In [None]:
# Correlation coefficient analysis between the features and target
columns = dummies.columns
correlations = np.abs(dummies[columns].iloc[:,1:].corr().loc[:,'poisonous_p']).sort_values(ascending=False)
correlations.head(21)

poisonous_p                   1.000000
odor_n                        0.785534
odor_f                        0.623974
stalk_surface_above_ring_k    0.587794
stalk_surface_below_ring_k    0.573656
ring_type_p                   0.540670
gill_size_n                   0.539944
gill_size_b                   0.539944
gill_color_b                  0.538919
bruises_f                     0.501758
bruises_t                     0.501758
stalk_surface_above_ring_s    0.491460
spore_print_color_h           0.490333
ring_type_l                   0.451710
population_v                  0.443906
stalk_surface_below_ring_s    0.425592
spore_print_color_n           0.416609
spore_print_color_k           0.397173
spore_print_color_w           0.357499
gill_spacing_w                0.348358
gill_spacing_c                0.348358
Name: poisonous_p, dtype: float64

### 3. Model your outcome of interest. You should try several different approaches and really work to tune a variety of models before using the model evaluation techniques to choose what you consider to be the best performer. Make sure to think about explanatory versus predictive power, and experiment with both.

In [None]:
Y = dummies.poisonous_p
X = dummies.drop(columns=['poisonous_e', 'poisonous_p'])

In [None]:
# Potential classification models: 

# 1. Logistic regression
# 2. KNN classifier
# 3. Random forest model
# 4. Support vector classifier
# 5. Gradient boosting classifier

In [None]:
# 1. Logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

lr_clf = LogisticRegression(solver = 'lbfgs', multi_class = 'ovr', random_state = 45, max_iter = 1000)
lr_cv_scores = cross_val_score(lr_clf, X, Y, cv = 5)

print('Accuracy scores for the 5 folds: ', lr_cv_scores)
print('Mean cross validation score: {:.3f}\n'.format(np.mean(lr_cv_scores)))

Accuracy scores for the 5 folds:  [0.94892308 1.         0.98830769 1.         0.65948276]
Mean cross validation score: 0.919



In [None]:
# 2. KNN classifier
from sklearn.neighbors import KNeighborsClassifier

for n in np.arange(1,6):
  knn = KNeighborsClassifier(n_neighbors = n)
  knn.fit(X,Y)
  knn_cv_scores = cross_val_score(knn, X, Y, cv = 5)

  print(f'{n} neighbors:')
  print('Accuracy scores for the 5 folds: ', knn_cv_scores)
  print('Mean cross validation score: {:.3f}\n'.format(np.mean(knn_cv_scores)))

1 neighbors:
Accuracy scores for the 5 folds:  [0.84923077 1.         1.         1.         0.79371921]
Mean cross validation score: 0.929

2 neighbors:
Accuracy scores for the 5 folds:  [0.84492308 1.         0.99569231 1.         0.81280788]
Mean cross validation score: 0.931

3 neighbors:
Accuracy scores for the 5 folds:  [0.84676923 1.         0.99630769 1.         0.64716749]
Mean cross validation score: 0.898

4 neighbors:
Accuracy scores for the 5 folds:  [0.84369231 1.         0.99507692 1.         0.65517241]
Mean cross validation score: 0.899

5 neighbors:
Accuracy scores for the 5 folds:  [0.84676923 1.         0.99507692 1.         0.64470443]
Mean cross validation score: 0.897



In [None]:
# The highest accuracy occurs at n = 2 neighbors (0.931). This is better than our 1st model (0.919).

In [None]:
# 3. Random forest classifier
from sklearn.ensemble import RandomForestClassifier

for n in np.arange(1,11):
  rf_clf = RandomForestClassifier(n_estimators = n, random_state = 45)
  rf_cv_scores = cross_val_score(rf_clf, X, Y, cv = 5)

  print(f'{n} estimators:')
  print('Accuracy scores for the 5 folds: ', rf_cv_scores)
  print('Mean cross validation score: {:.3f}\n'.format(np.mean(rf_cv_scores)))

1 estimators:
Accuracy scores for the 5 folds:  [0.84246154 0.98646154 0.95876923 1.         0.90640394]
Mean cross validation score: 0.939

2 estimators:
Accuracy scores for the 5 folds:  [0.84246154 0.98646154 0.95446154 1.         0.90640394]
Mean cross validation score: 0.938

3 estimators:
Accuracy scores for the 5 folds:  [0.84246154 0.98646154 0.95446154 1.         0.90640394]
Mean cross validation score: 0.938

4 estimators:
Accuracy scores for the 5 folds:  [0.84246154 0.98646154 0.95446154 1.         0.90640394]
Mean cross validation score: 0.938

5 estimators:
Accuracy scores for the 5 folds:  [0.84246154 0.98646154 0.95446154 1.         0.90640394]
Mean cross validation score: 0.938

6 estimators:
Accuracy scores for the 5 folds:  [0.84246154 0.98646154 0.95446154 1.         0.90640394]
Mean cross validation score: 0.938

7 estimators:
Accuracy scores for the 5 folds:  [0.84246154 0.98646154 0.95446154 1.         0.90640394]
Mean cross validation score: 0.938

8 estimators:

In [None]:
# The highest accuracy occurs at n = 6 estimators (0.958). Processing time for this model is also much faster than KNN.

In [None]:
# 4. Support vector classifer
from sklearn.svm import SVC

svm = SVC(kernel = 'linear')
svm.fit(X,Y)
svm_cv_scores = cross_val_score(svm, X, Y, cv = 5)

print('Accuracy scores for the 5 folds: ', svm_cv_scores)
print('Mean cross validation score: {:.3f}\n'.format(np.mean(svm_cv_scores)))

Accuracy scores for the 5 folds:  [0.84246154 0.98584615 0.95446154 1.         0.99507389]
Mean cross validation score: 0.956



In [None]:
# The mean accuracy is 0.944 which is slightly lower than our random forest model (0.958).

In [None]:
# 5. Gradient boosting classifier
from sklearn.metrics import confusion_matrix, precision_score, recall_score
from sklearn import ensemble

params = {'n_estimators': 500, 'max_depth': 2, 'loss': 'deviance'}

# Initialize and fit the model.
gbc = ensemble.GradientBoostingClassifier(**params)
gbc.fit(X,Y)
gbc_cv_scores = cross_val_score(gbc, X, Y, cv = 5)

print('Accuracy scores for the 5 folds: ', gbc_cv_scores)
print('Mean cross validation score: {:.3f}\n'.format(np.mean(gbc_cv_scores)))

Accuracy scores for the 5 folds:  [0.84246154 0.98646154 0.95446154 1.         0.99507389]
Mean cross validation score: 0.956



In [None]:
# The mean accuracy is 0.951 which is slightly lower than our random forest model (0.958).

In [None]:
# Based on these results, the highest performing model is the random forest with an accuracy of 0.958.

# The models are ranked below by accuracy:

# 1. [0.958] Random forest classifier, n_estimators = 6
# 2. [0.951] Gradient boosting classifier
# 3. [0.944] Support vector classifier
# 4. [0.931] KNN classifier, n_neighbors = 2
# 5. [0.919] Logistic regression classifier