Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.


# Naive Bayes: Problem solving

In this session, we will use the `mushroom` dataset.
This dataset describes mushrooms along various nominal variables and labels them as poisonous or edible.
Because the original dataset is a fair bit larger, we've randomly sampled 2000 rows.

The goal is to predict `class`: whether the mushroom is poisonous or not.

| Variable                 | Type    | Description                                                                                         |
|:--------------------------|:---------|:-----------------------------------------------------------------------------------------------------|
| cap-shape                | Nominal | bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s                                                |
| cap-surface              | Nominal | fibrous=f,grooves=g,scaly=y,smooth=s                                                                |
| cap-color                | Nominal | brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y                    |
| bruises?                 | Nominal | bruises=t,no=f                                                                                      |
| odor                     | Nominal | almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s                        |
| gill-attachment          | Nominal | attached=a,descending=d,free=f,notched=n                                                            |
| gill-spacing             | Nominal | close=c,crowded=w,distant=d                                                                         |
| gill-size                | Nominal | broad=b,narrow=n                                                                                    |
| gill-color               | Nominal | black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y |
| stalk-shape              | Nominal | enlarging=e,tapering=t                                                                              |
| stalk-surface-above-ring | Nominal | fibrous=f,scaly=y,silky=k,smooth=s                                                                  |
| stalk-surface-below-ring | Nominal | fibrous=f,scaly=y,silky=k,smooth=s                                                                  |
| stalk-color-above-ring   | Nominal | brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y                            |
| stalk-color-below-ring   | Nominal | brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y                            |
| veil-type                | Nominal | partial=p,universal=u                                                                               |
| veil-color               | Nominal | brown=n,orange=o,white=w,yellow=y                                                                   |
| ring-number              | Nominal | none=n,one=o,two=t                                                                                  |
| ring-type                | Nominal | cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z                      |
| spore-print-color        | Nominal | black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y                      |
| population               | Nominal | abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y                                 |
| habitat                  | Nominal | grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d                                       |
| class                    | Nominal | edible or poisonous                                                                                 |

<div style="text-align:center;font-size: smaller">
 <b>Source:</b> This dataset was taken from the <a href="https://archive.ics.uci.edu/ml/datasets/Mushroom">UCI Machine Learning Repository library
    </a></div>
<br>

## Load data

Import `pandas` so we can load a dataframe.

Load the dataframe with `datasets/mushroom.csv`.

## Explore data

Check the data makes sense with the five figure summary.

------------------
**QUESTION:**

Did any variables have NaN? How do you know?

**ANSWER: (click here to edit)**


<hr>

**QUESTION:**

Looking at `freq` for each variable, how do you think the levels of these variables are distributed?

**ANSWER: (click here to edit)**


<hr>

Plot each variable separately.

First import `plotly.express`.

Create an empty histogram figure.

Plot histograms of all the variables in a loop.

------------------
**QUESTION:**

What do you think about the distribution of levels of the variables now?

**ANSWER: (click here to edit)**


<hr>

## Prepare train/test sets

Separate our predictors (`X`) from our class label (`Y`), putting each into its own dataframe.

Convert the nominal variables in `X` to dummies, storing the result in `X`. 
Keep all levels. 
We're doing this because we will use Bernoulli naive Bayes.

To split the data into train/test sets, import `model_selection`.

And do the actual spliting of data, using `random_state=1`.

### Fit model

Import libraries for:

- Naive Bayes
- Metrics
- Ravel

Create the Bernoulli naive Bayes model.

Train the model by calling `fit` on it.

Get and save predictions.

### Evaluate the model

Get the accuracy.

And get the recall and precision.

Performance is very good for both classes.

------------------
**QUESTION:**

With this level of accuracy, would you eat a mushroom that the classifer said wasn't poisonous?

**ANSWER: (click here to edit)**


<hr>

## Visualizing

### Feature importance

To see the feature importances, create a dataframe of the probabilities of predictors given the class label, i.e. `feature_log_prob_`, then give that dataframe correct row/column names (using `index` and `columns`, and finally raise it to the power of ten (because the default output is log).

Plot these feature importances in a loop, where each is a bar plot.

**Note:** This will take some time. Once it's done, you should be able to scroll down to view.

------------------
**QUESTION:**

Are there any single features you think you'd trust to tell the difference between edible and poisonous?

**ANSWER: (click here to edit)**


<hr>