# Shrooms: Edible or Poisonous?

'Shrooming' -- the act of foraging for edible mushrooms in the forest -- is growing in popularity. However, this novel hobby presents new dangers: there are countless types of unique mushrooms, many of which can easily be misclassified. One small mistake in mushroom indentification can have enormous consequences. Accidentally ingesting a poisonous mushroom can lead to serious medical complications, and perhaps even death. In this notebook, I'll attempt to use data science methodologies to correctly identify a mushroom as either edible or poisonous.

In [1]:
# Start by making key imports: NumPy and Pandas
import numpy as np
import pandas as pd

In [2]:
# Read in CSV file of our data
shrooms = pd.read_csv('../datasets/mushrooms.csv')

In [3]:
# Use the .describe() attribute for initial analysis
shrooms.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


In [4]:
# Check the number of rows and columns in the dataset
shrooms.shape

(8124, 23)

In [5]:
# Count the number of nulls in the dataset
shrooms.isnull().sum()

# No nulls!

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

In [6]:
# Check the data types of the dataset. All the values in the datasets are objects
shrooms.dtypes

class                       object
cap-shape                   object
cap-surface                 object
cap-color                   object
bruises                     object
odor                        object
gill-attachment             object
gill-spacing                object
gill-size                   object
gill-color                  object
stalk-shape                 object
stalk-root                  object
stalk-surface-above-ring    object
stalk-surface-below-ring    object
stalk-color-above-ring      object
stalk-color-below-ring      object
veil-type                   object
veil-color                  object
ring-number                 object
ring-type                   object
spore-print-color           object
population                  object
habitat                     object
dtype: object

In [7]:
# For each column, display the name of the column and the number of unique attributes for each column
for column in shrooms:
    print(column)
    print(shrooms[column].unique())

class
['p' 'e']
cap-shape
['x' 'b' 's' 'f' 'k' 'c']
cap-surface
['s' 'y' 'f' 'g']
cap-color
['n' 'y' 'w' 'g' 'e' 'p' 'b' 'u' 'c' 'r']
bruises
['t' 'f']
odor
['p' 'a' 'l' 'n' 'f' 'c' 'y' 's' 'm']
gill-attachment
['f' 'a']
gill-spacing
['c' 'w']
gill-size
['n' 'b']
gill-color
['k' 'n' 'g' 'p' 'w' 'h' 'u' 'e' 'b' 'r' 'y' 'o']
stalk-shape
['e' 't']
stalk-root
['e' 'c' 'b' 'r' '?']
stalk-surface-above-ring
['s' 'f' 'k' 'y']
stalk-surface-below-ring
['s' 'f' 'y' 'k']
stalk-color-above-ring
['w' 'g' 'p' 'n' 'b' 'e' 'o' 'c' 'y']
stalk-color-below-ring
['w' 'p' 'g' 'b' 'n' 'e' 'y' 'o' 'c']
veil-type
['p']
veil-color
['w' 'n' 'o' 'y']
ring-number
['o' 't' 'n']
ring-type
['p' 'e' 'l' 'f' 'n']
spore-print-color
['k' 'n' 'u' 'h' 'w' 'r' 'o' 'y' 'b']
population
['s' 'n' 'a' 'v' 'y' 'c']
habitat
['u' 'g' 'm' 'd' 'p' 'w' 'l']


#### Need for dummy variables

Looking at the unique values in each column of the dataset, we can see that there are a lot of unique non-numerical values in the shrooms dataset. This means we're going to have to create a new column for each unique variable for each unique column. These will be `dummy variables` that can be used for quantitative analysis.

How many dummy variables, exactly? Let's take a look and see

In [8]:
# Create a new dataframe that's simply a description of the shrooms dataset. 
foo = shrooms.describe()
foo

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


In [9]:
# Now we can create a column called 'sum' and sum all the unqiue values in the dataset
foo['sum'] = foo.sum(axis = 1)
foo['sum']

count                      186852
unique                        119
top       exynfnfcbbtbsswwpwopwvd
freq                       108158
Name: sum, dtype: object

In [10]:
foo['sum']['unique'] * shrooms.shape[1]

2737

#### 119 Dummy Variables

Looking at the number above, we can see that we're going to have to create 119 dummy variables for this dataset. We'll create those variables when we come to the `Feature Engineering` portion of the notebook.

**What exactly would be considered a good model?** This question is unique to each data science problem. Let's look at the percentage of edible mushrooms in our dataset:

In [11]:
# The following number represents the percentage of edible mushrooms in our dataset:
shrooms[shrooms['class'] == 'e'].shape[0] / shrooms.shape[0]

0.517971442639094

#### 51.79% of mushrooms in our dataset at edible.

0.5179 indicates the number of edible mushrooms in our dataset. This might lead some to say that any model with an accuracy score above 51.79% is superior, as it outperforms the natural percentage of edible and poisonous mushrooms in the dataset. However, the consequences of eating a poisonous mushroom are dire: even one misclassification can result in death. This is a good reminder that every data science problem is unique and should be approached as such.

## Feature Engineering

The biggest (and perhaps only) amount of feature engineering required in this problem is the creation of the `dummy variables` for the mushroom dataset.

The easiest way to create all the `dummy variables` is to create a function that can create all the variables and thgen drop the original column from the `shrooms` dataset. Once the dummies have been created, the original dataset is no longer needed.

In [None]:
def dummify(df):
    df2 = pd.DataFrame()
    for column in df:
        dummy = pd.get_dummies(df[column], prefix = column, drop_first = True)
        for i in dummy:
            df2[i] = i

In [None]:
shrooms = dummify(shrooms)