# Predictive Modeling

This notebook will encompass the predictive modeling component of the mushroom classification problem. The goal is to be 100% accurate -- this isn't necessarily the exact goal of every data science problem, but because this classifcation problem is being exercised on mushrooms that are either poisonous or edible, 100% accuracy is a must. Even one error could result in someone's death if they accidentally ingest a poisonous mushroom that was falsely classified as being edible.

In [1]:
# First step -- let's load in some necessary libraries

import pandas as pd
import numpy as np

In [2]:
# Now that we've imported pandas, let's read in the binarized mushroom dataframe
shrooms = pd.read_csv('../datasets/binarized_mushroom_data.csv')

# Drop the colunmn 'Unnamed: 0', which must have been created somewhere in the saving/uploading process
shrooms.drop(columns = 'Unnamed: 0', inplace = True)

# Preview the dataframe
shrooms

Unnamed: 0,class_edible,class_poisonous,cap-shape_bell,cap-shape_conical,cap-shape_convex,cap-shape_flat,cap-shape_knobbed,cap-shape_subnken,cap-surface_fibrous,cap-surface_grooves,...,population_scattered,population_several,population_solitary,habitat_grasses,habitat_leaves,habitat_meadows,habitat_paths,habitat_urban,habitat_waste,habitat_woods
0,0,1,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
1,1,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,1,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
4,1,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
8120,1,0,0,0,1,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
8121,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
8122,0,1,0,0,0,0,1,0,0,0,...,0,1,0,0,1,0,0,0,0,0


#### Drop column "class_poisonous"

For this problem, the 2 columns "class_edible" and "class_poisonous" are redundant. That is, only one of them is required: for example, a 0 in the "class_edible" column indicates that the mushroom is poisonous. To reduce the chance of unwanted noise in our model, let's drop the column `class_poisonous`.

In [3]:
shrooms.drop(columns = 'class_poisonous', inplace = True)
shrooms.head()

Unnamed: 0,class_edible,cap-shape_bell,cap-shape_conical,cap-shape_convex,cap-shape_flat,cap-shape_knobbed,cap-shape_subnken,cap-surface_fibrous,cap-surface_grooves,cap-surface_scaly,...,population_scattered,population_several,population_solitary,habitat_grasses,habitat_leaves,habitat_meadows,habitat_paths,habitat_urban,habitat_waste,habitat_woods
0,0,0,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,0,0,0,1,0,0,0,0,0,1,...,1,0,0,0,0,0,0,1,0,0
4,1,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


### So what kind of model should we use?

Right away, we're able to tell that this is a classification problem, which means a `Logistic Regression` is appropriate for this problem, as opposed to a `Linear Regression`. As we explore other machine learning libraries in this notebook, it's important to note that we should use applications for `discrete` data as opposed to data that's `continuous`. After all, this problem really is a "yes or no" problem, with "yes = edible" and "no = poisonous".

In [4]:
# Let's import a machine learning library from sklearn for Logistic Regressions, which are used for classification
#/n problems such as this
from sklearn.linear_model import LogisticRegression

# Let's also import train-test-split, which will be crucial later when testing our model's effectiveness
from sklearn.model_selection import train_test_split

#### Preliminary Logistic Regression Model

To start, let's use every variable available in our dataset in our model.

In [16]:
# Define X and y
X = shrooms.drop(columns = 'class_edible')
y = shrooms['class_edible']

# We'll use the same random state each time so we can objectively compare different model results
# Use train_test_split to create train and test data
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.2,
                                                    random_state = 42)
# Instantiate a Logistic Regression model
lg = LogisticRegression()

# Fit the model with the train data
lg.fit(X_train, y_train);

**So how did this model do?** Let'w import some tools for analyzing this model's metrics.

In [13]:
from sklearn.metrics import mean_squared_error

In [21]:
preds = lg.predict(X_test)

print(f'MSE is {mean_squared_error(y_test, preds)}')
print(f'RMSE is {np.sqrt(mean_squared_error(y_test, preds))}')
print(f'The train score of this model is {lg.score(X_train, y_train)}')
print(f'The test score of this model is {lg.score(X_test, y_test)}')

MSE is 0.0
RMSE is 0.0
The train score of this model is 1.0
The test score of this model is 1.0


The MSE and RMSE have values of 0 and the `train score` and `test score` of this model are 1.0, indicating that our model has 100% perfect accuracy. But is this ideal? Our goal was to have 100% accuracy, yes, but it could be that this model is too perfect. To put it another way, this model may very well be overfit to the data, meaning that it will do a poor job of accurately classifying mushrooms as either edible or poisonous with any newly inputted data.