# **Mushroom Edibility Study**

## Objectives

* Answer business requirement 1:
    * The client would like to better understand the patterns in the mushroom database so that the client can learn the variables of an mushroom most likely to be edible. 

## Inputs

* outputs/datasets/collection/mushrooms.csv

## Outputs

* Generate code and seaborn plots that answer business requirement 1 and can be used for the Streamlit App


---

# Change working directory

* Need to change working directory from the current jupyter_notebooks folder to the parent folder in order to access the whole project

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

Load in the dataset to a dataframe.

In [None]:
import pandas as pd
df = pd.read_csv("outputs/datasets/collection/mushrooms.csv")
df.head(3)

---

# Data Exploration #

We wish to become familiar with the dataset, check variable types and their distribution, check for any missing data, and to understand what these variables mean in the business context

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

`veil-type` is a clearly redundant variable, as all mushrooms in the dataset have the same value, 'p' - partial. As such it will be dropped.

In [None]:
df = df.drop(['veil-type'], axis=1)
df.head()

---

# Correlation study

We can use `OrdinalEncoder` to transform categorical variables into integer values, so they may be numerically correlated to `class`. This is used over `OneHotEncoder`, as the dataset has a large number of categorical variables, thus one-hot encoding would result in a dataset with too many columns, which would lead to a ['Curse of Dimensionality'](https://en.wikipedia.org/wiki/Curse_of_dimensionality) scenario for the models. Firstly, determining the categeorical variables in the dataset, and storing their labels in a string:

In [None]:
cols = df.columns[df.dtypes=='object'].to_list()
df_oe = df.copy()

cat_list=[]

for col in cols:
    print(col)
    print(df[col].unique())
    cat_list.append(list(df[col].unique()))

In order to pass these categories names as a list to the ordinal encoder, `cat_list` will be used.

In [None]:
cat_list

The above can be input into the `categories` argument of `OrdinalEncoder`.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(categories=cat_list)
encoded_array = encoder.fit_transform(df[cols])

for i, col in enumerate(cols):
    df_oe[col] = encoded_array[:,i]

df_oe.head(3)

We will now run the correlation methods on the encoded dataframe, using both spearman and spearman methods, in order to determine the variables in the dataset most relevant to the target, `class`

In [None]:
corr_spearman = df_oe.corr(method='spearman')['class'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

In [None]:
corr_pearson = df_oe.corr(method='pearson')['class'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

There is one issue with extracting correlation coefficients from this method of encoding; it varies depending on ordering, which is determined purely by the order in which categories appear in the original dataframe and is thus entirely arbitrary. To illustrate this, we will randomize the ordering of the list elements in `cat_list`:

In [None]:
import random
print(cat_list)
for cat in cat_list:
    random.shuffle(cat)
print(cat_list)

With the newly ordered list, rerun the encoding:

In [None]:
encoder = OrdinalEncoder(categories=cat_list)
encoded_array = encoder.fit_transform(df[cols])
df_new_oe = df.copy()

for i, col in enumerate(cols):
    df_new_oe[col] = encoded_array[:,i]

df_oe.head(3)

And rerun the correlation methods:

In [None]:
print(f"Old Spearman correlation coefficients:\n{corr_spearman}")
new_corr_spearman = df_new_oe.corr(method='spearman')['class'].sort_values(key=abs, ascending=False)[1:].head(10)
print(f"New Spearman correlation coefficients:\n{new_corr_spearman}")

In [None]:
print(f"Old Pearson correlation coefficients:\n{corr_pearson}")
new_corr_pearson = df_new_oe.corr(method='pearson')['class'].sort_values(key=abs, ascending=False)[1:].head(10)
print(f"New Pearson correlation coefficients:\n{new_corr_pearson}")

The correlation coefficients have very clearly changed from changing the order in which categories are encoded. As such, we would prefer an encoder which is wholly agnostic with respect to the ordering of the dataset, and is based on some statistical reality of the data. To do this, we use a `TargetEncoder`, which encodes each category based on their frequency in the positive result for the target. In the case of this dataset this is how frequently a mushroom of each category type is edible, eg. for encoding `cap-shape = b` (b - bell), how frequently do such mushrooms have `class = 1`.

In [None]:
from category_encoders import TargetEncoder
encoder = TargetEncoder()
df_te = df.copy()

for col in cols:
    df_te[col] = encoder.fit_transform(df[col], df['class'])

df_te.head()

We now repeat the correlation methods for the dataset with the new encoding method.

In [None]:
corr_spearman = df_te.corr(method='spearman')['class'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman

In [None]:
corr_pearson = df_te.corr(method='pearson')['class'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson

This encoding method appears to provide much higher correlation constants than with ordinal encoding, which are also immutable as there is no way to alter the nature of the encoding. Hence these coefficients will be used for insight on the data. 

It appears that there are few categories that have strong correlation to whether the mushrooms are edible or poisonous, with the strongest being `odor` with its correlation coefficient being 0.92 when calculated by the Spearman method, 0.97 when calculated by Pearson. As there are no numerical variables in this dataset, such correlation coefficients do not show a relationship between two measured variables, but rather how strongly different properties of the mushrooms can be said to predict edibility. 

In [None]:
top_n = 5
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

Therefore we will study the following variables. We will investigate if:

* Mushrooms with `b` (brown) for `gill color` are most liable to be poisonous
* Mushrooms with `f` (foul) for `odor` are most liable to be poisonous
* Mushrooms with `p` (pendant) for `ring-type` have the best chance of being edible
* Mushrooms with `b` (brown) for `spore-print-color` have the best chance of being edible
* Mushrooms with `k` (silky) for `stalk-surface-above-ring` are most liable to be poisonous

In [None]:
vars_to_study = ['gill-color',
                'odor',
                'ring-type',
                'spore-print-color',
                'stalk-surface-above-ring']
vars_to_study

---

# EDA on selected variables

In [None]:
df_eda = df.filter(vars_to_study + ['class'])
df_eda.head()

## Variables Distibution by Class

Plotting the distributions (categorical) coloured by `class`, recalling that poisonous=0, edible=1:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')


def plot_categorical(df, col, target_var):

    plt.figure(figsize=(12, 5))
    sns.countplot(data=df, x=col, hue=target_var, order=df[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step")
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


target_var = 'class'
for col in vars_to_study:
    if df_eda[col].dtype == 'object':
        plot_categorical(df_eda, col, target_var)
        print("\n\n")
    else:
        plot_numerical(df_eda, col, target_var)
        print("\n\n")

---

## Parellel Plot

Creates multi-dimensional categorical data plot

In [None]:
import plotly.express as px
fig = px.parallel_categories(df_eda, color="class")
fig.show()

---

# Conclusions

The correlations and plot interpretations converge, in that it can be observed that the selected categories appear to be significant predictors of whether mushrooms are edible or not. It was found that:

* Mushrooms with `b` (brown) for `gill color` are most liable to be poisonous
* Mushrooms with `f` (foul) for `odor` are most liable to be poisonous
* Mushrooms with `p` (pendant) for `ring-type` have the best chance of being edible
* Mushrooms with `b` (brown) for `spore-print-color` have the best chance of being edible
* Mushrooms with `k` (silky) for `stalk-surface-above-ring` are most liable to be poisonous