# **Mushroom Edibility Study**

## Objectives

* Answer business requirement 1:
    * The client would like to better understand the patterns in the mushroom database so that the client can learn the variables of an mushroom most likely to be edible. 

## Inputs

* outputs/datasets/collection/mushrooms.csv

## Outputs

* Generate code and seaborn plots that answer business requirement 1 and can be used for the Streamlit App


---

# Change working directory

* Need to change working directory from the current jupyter_notebooks folder to the parent folder in order to access the whole project

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/mushroom-safety/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/mushroom-safety'

# Load Data

Load in the dataset to a dataframe.

In [4]:
import pandas as pd
df = pd.read_csv("outputs/datasets/collection/mushrooms.csv")
df.head()

Unnamed: 0,edible,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,0,convex,smooth,brown,bruises,pungent,free,close,narrow,black,...,smooth,white,white,partial,white,one,pendant,black,scattered,urban
1,1,convex,smooth,yellow,bruises,almond,free,close,broad,black,...,smooth,white,white,partial,white,one,pendant,brown,numerous,grasses
2,1,bell,smooth,white,bruises,anise,free,close,broad,brown,...,smooth,white,white,partial,white,one,pendant,brown,numerous,meadows
3,0,convex,scaly,white,bruises,pungent,free,close,narrow,brown,...,smooth,white,white,partial,white,one,pendant,black,scattered,urban
4,1,convex,smooth,gray,no,none,free,crowded,broad,black,...,smooth,white,white,partial,white,one,evanescent,brown,abundant,grasses


---

# Data Exploration #

We wish to become familiar with the dataset, check variable types and their distribution, check for any missing data, and to understand what these variables mean in the business context

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

There is no missing or incorrectly formatted values, so we are able to proceed.

---

## Correlation study - Which Mushroom Variable Categories Correlate Most to Edibility

We will use `OneHotEncoder` to find which variable category (eg. mushroom with `odor=none`, mushroom with `stalk-color-above-ring=silky`) correlates strongest to edibility.

In [5]:
from feature_engine.encoding import OneHotEncoder

cols = df.columns[df.dtypes=='object'].to_list()
df_ohe = df.filter(['edible'])
for col in cols:
    encoder = OneHotEncoder(variables=[col])
    df_ohe = pd.concat([df_ohe, encoder.fit_transform(df[col].to_frame())], axis=1)
df_ohe.head()

Unnamed: 0,edible,cap-shape_convex,cap-shape_bell,cap-shape_sunken,cap-shape_flat,cap-shape_knobbed,cap-shape_conical,cap-surface_smooth,cap-surface_scaly,cap-surface_fibrous,...,population_several,population_solitary,population_clustered,habitat_urban,habitat_grasses,habitat_meadows,habitat_woods,habitat_paths,habitat_waste,habitat_leaves
0,0,1,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
1,1,1,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
2,1,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,1,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
4,1,1,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0


Will now run Pearson and Spearman correlations to check which variable categories correlate most to `edible`

In [6]:
corr_spearman = df_ohe.corr(method='spearman')['edible'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

odor_none                         0.785557
odor_foul                        -0.623842
stalk-surface-above-ring_silky   -0.587658
stalk-surface-below-ring_silky   -0.573524
ring-type_pendant                 0.540469
gill-size_broad                   0.540024
gill-size_narrow                 -0.540024
gill-color_buff                  -0.538808
bruises_bruises                   0.501530
bruises_no                       -0.501530
Name: edible, dtype: float64

In [7]:
corr_pearson = df_ohe.corr(method='pearson')['edible'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

odor_none                         0.785557
odor_foul                        -0.623842
stalk-surface-above-ring_silky   -0.587658
stalk-surface-below-ring_silky   -0.573524
ring-type_pendant                 0.540469
gill-size_narrow                 -0.540024
gill-size_broad                   0.540024
gill-color_buff                  -0.538808
bruises_bruises                   0.501530
bruises_no                       -0.501530
Name: edible, dtype: float64

According to this correlation study, a mushroom having no odor is the most strongly correlated to edibility, with a Spearman correlation coefficient of 0.79 and Pearson coefficient of 0.79. Hence mushrooms with no odor are  typically edible. If the mushroom has a silky consistency above the ring is also strongly negatively correlated to a mushroom being edible, with a Spearman correlation coefficient of -0.59 and Pearson correlation coefficient of -0.59. Hence mushrooms with a silky consistency on the stalk above the ring are typically poisonous. 

We will also check the percentage of such mushrooms of these categories that are edible.

In [None]:
cat_edibility_flag_series = df_ohe['edible'][df_ohe['odor_none'].loc[lambda x: x==1].index] 

proportion_of_cat_edible = cat_edibility_flag_series.value_counts()[1]/len(cat_edibility_flag_series)
proportion_of_cat_edible 

In [None]:
cat_edibility_flag_series = df_ohe['edible'][df_ohe['stalk-surface-above-ring_silky'].loc[lambda x: x==1].index] 

proportion_of_cat_edible = cat_edibility_flag_series.value_counts()[1]/len(cat_edibility_flag_series)
proportion_of_cat_edible 

96.6% of all mushrooms with no odor are edible, and hence mushrooms of this category can be said to be typically edible. Only 6.06% of mushrooms with a silky stalk surface above the stalk ring are edible, and hence mushrooms of this category can be said to be typically poisonous.

## Correlation Study - Which Mushroom Variables are Most Relevant for Plotting

We can use `OrdinalEncoder` to transform categorical variables into integer values, so they may be numerically correlated to `edible`. This will allow the measuring of how entire varaiable correlates to ediblity (e.g. `odor`), rather than only individual variable categories as with `OneHotEncoder` previously (e.g. `odor=none`). Firstly, determining the categeorical variables in the dataset, and storing their labels in a string:

In [8]:
df_oe = df.copy()
cols = df.columns[df.dtypes=='object'].to_list()

cat_list=[]

for col in cols:
    print(col)
    print(df[col].unique())
    cat_list.append(list(df[col].unique()))

cap-shape
['convex' 'bell' 'sunken' 'flat' 'knobbed' 'conical']
cap-surface
['smooth' 'scaly' 'fibrous' 'grooves']
cap-color
['brown' 'yellow' 'white' 'gray' 'red' 'pink' 'buff' 'purple' 'cinnamon'
 'green']
bruises
['bruises' 'no']
odor
['pungent' 'almond' 'anise' 'none' 'foul' 'creosote' 'fishy' 'spicy'
 'musty']
gill-attachment
['free' 'attached']
gill-spacing
['close' 'crowded']
gill-size
['narrow' 'broad']
gill-color
['black' 'brown' 'gray' 'pink' 'white' 'chocolate' 'purple' 'red' 'buff'
 'green' 'yellow' 'orange']
stalk-shape
['enlarging' 'tapering']
stalk-root
['equal' 'club' 'bulbous' 'rooted' 'missing']
stalk-surface-above-ring
['smooth' 'fibrous' 'silky' 'scaly']
stalk-surface-below-ring
['smooth' 'fibrous' 'scaly' 'silky']
stalk-color-above-ring
['white' 'gray' 'pink' 'brown' 'buff' 'red' 'orange' 'cinnamon' 'yellow']
stalk-color-below-ring
['white' 'pink' 'gray' 'buff' 'brown' 'red' 'yellow' 'orange' 'cinnamon']
veil-type
['partial']
veil-color
['white' 'brown' 'orange' 'y

In order to pass these categories names as a list to the ordinal encoder, `cat_list` will be used.

In [9]:
print(cat_list)

[['convex', 'bell', 'sunken', 'flat', 'knobbed', 'conical'], ['smooth', 'scaly', 'fibrous', 'grooves'], ['brown', 'yellow', 'white', 'gray', 'red', 'pink', 'buff', 'purple', 'cinnamon', 'green'], ['bruises', 'no'], ['pungent', 'almond', 'anise', 'none', 'foul', 'creosote', 'fishy', 'spicy', 'musty'], ['free', 'attached'], ['close', 'crowded'], ['narrow', 'broad'], ['black', 'brown', 'gray', 'pink', 'white', 'chocolate', 'purple', 'red', 'buff', 'green', 'yellow', 'orange'], ['enlarging', 'tapering'], ['equal', 'club', 'bulbous', 'rooted', 'missing'], ['smooth', 'fibrous', 'silky', 'scaly'], ['smooth', 'fibrous', 'scaly', 'silky'], ['white', 'gray', 'pink', 'brown', 'buff', 'red', 'orange', 'cinnamon', 'yellow'], ['white', 'pink', 'gray', 'buff', 'brown', 'red', 'yellow', 'orange', 'cinnamon'], ['partial'], ['white', 'brown', 'orange', 'yellow'], ['one', 'two', 'none'], ['pendant', 'evanescent', 'large', 'flaring', 'none'], ['black', 'brown', 'purple', 'chocolate', 'white', 'green', 'or

The above can be input into the `categories` argument of `OrdinalEncoder`.

In [10]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(categories=cat_list)
encoded_df = encoder.fit_transform(df[cols])

for i, col in enumerate(cols):
    df_oe[col] = encoded_df[:,i]

df_oe.head()

Unnamed: 0,edible,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
2,1,1.0,0.0,2.0,0.0,2.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0
3,0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,0.0,0.0,3.0,1.0,3.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0,1.0


We will now run the correlation methods on the encoded dataframe, using both spearman and spearman methods, in order to determine the variables in the dataset most relevant to the target, `edible`

In [11]:
corr_spearman = df_oe.corr(method='spearman')['edible'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

odor                       -0.771088
ring-type                  -0.579335
spore-print-color          -0.555944
gill-size                   0.540024
stalk-surface-above-ring   -0.536555
bruises                    -0.501530
stalk-surface-below-ring   -0.500008
gill-color                 -0.399424
gill-spacing                0.348387
stalk-root                 -0.341438
Name: edible, dtype: float64

In [12]:
corr_pearson = df_oe.corr(method='pearson')['edible'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

odor                       -0.582015
spore-print-color          -0.560715
ring-type                  -0.556515
stalk-surface-above-ring   -0.552044
gill-size                   0.540024
stalk-surface-below-ring   -0.532452
bruises                    -0.501530
gill-color                 -0.412869
gill-spacing                0.348387
stalk-root                 -0.337542
Name: edible, dtype: float64

There is an issue with extracting correlation coefficients from this method of encoding; it varies depending on ordering, which is determined purely by the order in which categories appear in the original dataframe and is thus entirely arbitrary. To illustrate this, we will randomize the ordering of the list elements in `cat_list`:

In [13]:
import random
random.seed(123)
print(cat_list)
for cat in cat_list:
    random.shuffle(cat)
print(cat_list)

[['convex', 'bell', 'sunken', 'flat', 'knobbed', 'conical'], ['smooth', 'scaly', 'fibrous', 'grooves'], ['brown', 'yellow', 'white', 'gray', 'red', 'pink', 'buff', 'purple', 'cinnamon', 'green'], ['bruises', 'no'], ['pungent', 'almond', 'anise', 'none', 'foul', 'creosote', 'fishy', 'spicy', 'musty'], ['free', 'attached'], ['close', 'crowded'], ['narrow', 'broad'], ['black', 'brown', 'gray', 'pink', 'white', 'chocolate', 'purple', 'red', 'buff', 'green', 'yellow', 'orange'], ['enlarging', 'tapering'], ['equal', 'club', 'bulbous', 'rooted', 'missing'], ['smooth', 'fibrous', 'silky', 'scaly'], ['smooth', 'fibrous', 'scaly', 'silky'], ['white', 'gray', 'pink', 'brown', 'buff', 'red', 'orange', 'cinnamon', 'yellow'], ['white', 'pink', 'gray', 'buff', 'brown', 'red', 'yellow', 'orange', 'cinnamon'], ['partial'], ['white', 'brown', 'orange', 'yellow'], ['one', 'two', 'none'], ['pendant', 'evanescent', 'large', 'flaring', 'none'], ['black', 'brown', 'purple', 'chocolate', 'white', 'green', 'or

With the newly ordered list, rerun the encoding:

In [None]:
encoder = OrdinalEncoder(categories=cat_list)
encoded_array = encoder.fit_transform(df[cols])
df_new_oe = df.copy()

for i, col in enumerate(cols):
    df_new_oe[col] = encoded_array[:,i]

df_new_oe.head()

And rerun the correlation methods:

In [None]:
print(f"Old Spearman correlation coefficients:\n{corr_spearman}")
new_corr_spearman = df_new_oe.corr(method='spearman')['edible'].sort_values(key=abs, ascending=False)[1:].head(10)
print(f"New Spearman correlation coefficients:\n{new_corr_spearman}")

In [None]:
print(f"Old Pearson correlation coefficients:\n{corr_pearson}")
new_corr_pearson = df_new_oe.corr(method='pearson')['edible'].sort_values(key=abs, ascending=False)[1:].head(10)
print(f"New Pearson correlation coefficients:\n{new_corr_pearson}")

The correlation coefficients have very clearly changed from changing the order in which categories are encoded, and may result in us selecting different variables as the most correlated to `edible`. As such, we would prefer an encoder which is wholly agnostic with respect to the ordering of the dataset, and is based on some immutable statistical reality of the data. To do this, we use a `TargetEncoder`, which encodes each category based on their frequency in the positive result for the target. In the case of this dataset this is how frequently a mushroom of each category type is edible, eg. for encoding `cap-shape=bell`, as 53.2823% of all mushrooms with a `cap-shape=bell` have `edible=1`, `cap-shape=bell` is encoded as `cap-shape=0.532823`. For demonstration, view the output of the following 2 code cells.

In [None]:
df.head()

In [None]:
from category_encoders import TargetEncoder
cols = df.columns[df.dtypes=='object'].to_list()
encoder = TargetEncoder()
df_te = df.copy()

for col in cols:
    df_te[col] = encoder.fit_transform(df[col], df['edible'])

df_te.head()

We now repeat the correlation methods for the dataset with the new encoding method.

In [None]:
corr_spearman = df_te.corr(method='spearman')['edible'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

In [None]:
corr_pearson = df_te.corr(method='pearson')['edible'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

This encoding method appears to provide much higher correlation constants than with ordinal encoding, which are also immutable as there is no way to alter the nature of the encoding. Hence these coefficients will be used to provide insight on which variables correlate most with edibility. 

It appears that there are few variables that have strong correlation to whether the mushrooms are edible or poisonous, with the strongest being `odor` with its correlation coefficient being 0.92 when calculated by the Spearman method, 0.97 when calculated by Pearson. As there are no numerical variables in this dataset, such correlation coefficients do not show a relationship between two measured variables, but rather how strongly different categorical properties of the mushrooms can be said to indicate edibility. 

In [None]:
top_n = 5
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

Therefore we will study the following variables. We will investigate if:

* Mushrooms with `buff` for `gill color` are most liable to be poisonous
* Mushrooms with `foul` for `odor` are most liable to be poisonous
* Mushrooms with `pendant` for `ring-type` have the best chance of being edible
* Mushrooms with `buff` for `spore-print-color` have the best chance of being edible
* Mushrooms with `silky` for `stalk-surface-above-ring` are most liable to be poisonous

In [None]:
vars_to_study = ['gill-color',
                'odor',
                'ring-type',
                'spore-print-color',
                'stalk-surface-above-ring']
vars_to_study

---

# EDA on selected variables

In [None]:
df_eda = df.filter(vars_to_study + ['edible'])
df_eda.head()

## Variables Distibution by Edibility

Plotting the distributions (categorical) coloured by `edible`, recalling that poisonous=0, edible=1:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')


def plot_categorical(df, col, target_var):
    """ Plots distribution of categorical variables with respect to a target variable """
    plt.figure(figsize=(12, 5))
    sns.countplot(data=df, x=col, hue=target_var, order=df[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()

target_var = 'edible'
for col in vars_to_study:
    if df_eda[col].dtype == 'object':
        plot_categorical(df_eda, col, target_var)
        print("\n\n")
    else:
        plot_numerical(df_eda, col, target_var)
        print("\n\n")

## Parellel Plot

Creates multi-dimensional categorical data plot

In [None]:
import plotly.express as px
fig = px.parallel_categories(df_eda, color="edible")
fig.show()

---

# Conclusions

The correlations and plot interpretations converge, in that it can be observed that the selected categories appear to be significant indicators of whether mushrooms are edible or not. It was found that:

* Mushrooms with `buff` for `gill color` are most liable to be poisonous
* Mushrooms with `foul` for `odor` are most liable to be poisonous
* Mushrooms with `pendant` for `ring-type` have the best chance of being edible
* Mushrooms with `buff` for `spore-print-color` have the best chance of being edible
* Mushrooms with `silky` for `stalk-surface-above-ring` are most liable to be poisonous

Furthermore, from the "Correlation Study - Which Mushroom Categories are Most Relevant for Plotting", it's clear the project hypotheses have been successfully validated:

* A mushroom having no odor is strongly correlated to edibility, with the flag `odor_none` having Spearman correlation coefficient of 0.79 and Pearson coefficient of 0.79 to the `edible` flag. It was also found that 96.6% of all mushrooms in the dataset with no odor are edible. Hence, it can be said mushrooms without an odor are typically edible.
* A mushroom having a silky consistency above the ring is strongly negatively correlated to a mushroom being edible, with the flag `stalk-surface-above-ring_silky` having a Spearman correlation coefficient of -0.59 and Pearson correlation coefficient of -0.59 to the `edible` flag. It was also found only 6.06% of mushrooms with a silky stalk surface above the stalk ring are edible. Hence mushrooms with a silky consistency on the stalk above the ring are typically poisonous.