# Application of classification to determine edibility of a mushroom

This dataset includes descriptions of  samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be" for Poisonous Oak and Ivy.

The dataset was downloaded from Kaggle (https://www.kaggle.com/uciml/mushroom-classification)

In [1]:
# import necessary modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
% matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import GridSearchCV

  from numpy.core.umath_tests import inner1d


In [2]:
df = pd.read_csv("mushrooms.csv")

### Explore the data set

In [3]:
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [4]:
df.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


### Get distribution of poisonos and edible mushrooms in the data set

In [5]:
print(df.groupby('class').size())
# 0 - edible
# 1 - poisonos

class
e    4208
p    3916
dtype: int64


### Check if there are null values

In [6]:
df.isnull().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

The data set does not contain any null values

#### Check how many columns contain unique values

In [7]:
df.nunique().sort_values()

veil-type                    1
class                        2
bruises                      2
gill-attachment              2
gill-spacing                 2
gill-size                    2
stalk-shape                  2
ring-number                  3
cap-surface                  4
veil-color                   4
stalk-surface-below-ring     4
stalk-surface-above-ring     4
ring-type                    5
stalk-root                   5
cap-shape                    6
population                   6
habitat                      7
stalk-color-above-ring       9
stalk-color-below-ring       9
odor                         9
spore-print-color            9
cap-color                   10
gill-color                  12
dtype: int64

As we can see, "veil-type" column contains the same value across the whole data set, therefore we can remove this column as it does not bring any additional information

In [8]:
df.drop("veil-type", axis=1, inplace=True)

In [9]:
# there are 8124 observation in the data set, each containing 21 features (and one label)
df.shape

(8124, 22)

#### Explore which values take individual features

In [10]:
for attr in df.columns:
    print('\n*', attr, '*')
    print(df[attr].value_counts())


* class *
e    4208
p    3916
Name: class, dtype: int64

* cap-shape *
x    3656
f    3152
k     828
b     452
s      32
c       4
Name: cap-shape, dtype: int64

* cap-surface *
y    3244
s    2556
f    2320
g       4
Name: cap-surface, dtype: int64

* cap-color *
n    2284
g    1840
e    1500
y    1072
w    1040
b     168
p     144
c      44
u      16
r      16
Name: cap-color, dtype: int64

* bruises *
f    4748
t    3376
Name: bruises, dtype: int64

* odor *
n    3528
f    2160
y     576
s     576
l     400
a     400
p     256
c     192
m      36
Name: odor, dtype: int64

* gill-attachment *
f    7914
a     210
Name: gill-attachment, dtype: int64

* gill-spacing *
c    6812
w    1312
Name: gill-spacing, dtype: int64

* gill-size *
b    5612
n    2512
Name: gill-size, dtype: int64

* gill-color *
b    1728
p    1492
w    1202
n    1048
g     752
h     732
u     492
k     408
e      96
y      86
o      64
r      24
Name: gill-color, dtype: int64

* stalk-shape *
t    4608
e    3516
N

### Transform values of features from strings to integers - use one hot encoding, so that sklean models could be used

In [11]:
# Create dummy variables with drop_first=True: df_encoded
df_encoded = pd.get_dummies(df, drop_first=True)

In [12]:
df_encoded.head()

Unnamed: 0,class_p,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_g,cap-surface_s,cap-surface_y,cap-color_c,...,population_n,population_s,population_v,population_y,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,1,0,0,0,0,1,0,1,0,0,...,0,1,0,0,0,0,0,0,1,0
1,0,0,0,0,0,1,0,1,0,0,...,1,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,1,0,0,0
3,1,0,0,0,0,1,0,0,1,0,...,0,1,0,0,0,0,0,0,1,0
4,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0


#### Separate features and target variable

In [13]:
X = df_encoded.iloc[:, 1:-1]
y = df_encoded.iloc[:, 0]

#### Splitting data into training and test set

Because we do not have equal distribution of classes in the data set, we need to stratify the `train_test_split`

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=3)

#### Apply logistic regression

In [15]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_test)

#### Get model scores

In [16]:
print ("train score:", lr.score(X_train, y_train))
print ("test score:", lr.score(X_test, y_test))

train score: 0.9996717544723454
test score: 0.9980305268340719


In [17]:
accuracy_score(y_test, pred)

0.9980305268340719

__Accuracy score__ is the ratio of correctly predicted observations to the total number of observations. However, accuracy score is not the best way to assess the model, since it best works for data sets where classes are equally distributed. 

Let's check __f1_score__ of the model.

In [18]:
f1_score(y_test, pred)

0.9979529170931423

Confusion matrix:

In [19]:
cm = confusion_matrix(y_test, pred)
cm

array([[1052,    0],
       [   4,  975]])

#### Since the test score and f1_score of the model is so high, there is a possibility, that the model overfit

Let's check it with cross-validation and see what regularisation strength gives best results.

In [20]:
param_grid = [{'C': np.arange(0.1, 50, 5)}]

In [21]:
lr_cv = GridSearchCV(estimator=LogisticRegression(), param_grid=param_grid, cv=10)

In [22]:
lr_cv.fit(X, y)

GridSearchCV(cv=10, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'C': array([ 0.1,  5.1, 10.1, 15.1, 20.1, 25.1, 30.1, 35.1, 40.1, 45.1])}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [23]:
lr_cv.best_params_

{'C': 25.1}

In [24]:
lr_cv.best_score_

0.9623338257016248

In [25]:
from sklearn.model_selection import cross_val_score

In [26]:
scores = cross_val_score(estimator=lr_cv, X=X, y=y, cv=5, scoring='f1', n_jobs=-1)

In [27]:
scores

array([0.84621045, 1.        , 0.99872123, 1.        , 0.7398568 ])

It looks like the model does give very good predictions with different splits.

#### See, which features are the most influential

In [28]:
final_lr = LogisticRegression(C=25.1)
final_lr.fit(X, y);

In [29]:
df = pd.DataFrame(final_lr.coef_)
df = df.T
df.index = X.columns
df = df.rename(columns={0: "param"})
df = df.sort_values(by=['param'], ascending=True)

Those are the factors contributing most to the poisonousness of a mushroom

In [30]:
df.head()

Unnamed: 0,param
odor_n,-6.135612
gill-spacing_w,-5.734165
spore-print-color_u,-4.433037
spore-print-color_n,-4.011501
ring-type_f,-3.573782


And mushroom, which have those factors tend to be edible

In [31]:
df.tail()

Unnamed: 0,param
odor_f,4.951795
gill-size_n,5.942417
odor_p,5.992971
odor_c,7.22957
spore-print-color_r,9.230315
