#  Mushroom Classifier - RandomForestClassifier & Logistic Regression

## Name: Erin Moore

## Notebook Overview

In this Notebook, I will be investigating the K Nearest Neighbors and PCA algorithms using sklearn's `KNeighborsClassifier` and `PCA` class. I will use a `RandomForestClassifier` and a `Logistic Regression` to predict whether or not a mushroom is edible based on the mushroom dataset. The `KNN` algorithm will be used to fill in missing values in our dataset and the `PCA` algorithm will be used to reduce the dimensionality of the data. The goal is to evaluate the effect of dimensionality reduction on two common models and to gain experience with the `KNN` algorithm with respect to imputation. This Notebook is broken down into 7 sections: <br>
&nbsp &nbsp 1. Import Data <br>
&nbsp &nbsp &nbsp &nbsp - Simply importing .csv containing the mushroom dataset <br>
&nbsp &nbsp 2. Investigate and Fix Data <br>
&nbsp &nbsp &nbsp &nbsp - Print out data and fill in any missing values using `KNN` algorithm <br>
&nbsp &nbsp 3. Train on Full Dataset <br>
&nbsp &nbsp &nbsp &nbsp - Encode the full dataset, train the `RandomForestClassifier` and `Logistic Regression`, and comment on training time <br>
&nbsp &nbsp 4. Evaluate Performance on Full Dataset <br>
&nbsp &nbsp &nbsp &nbsp - Calculate accuracy, precision, and recall scores for each model and comment on results <br>
&nbsp &nbsp 5. Reduce Dimensionality <br>
&nbsp &nbsp &nbsp &nbsp - Apply PCA reduction to full dataset, display old and new dimensions and reduction percentage <br>
&nbsp &nbsp 6. Train on Reduced Dataset <br>
&nbsp &nbsp &nbsp &nbsp - Retrain `RandomForestClassifier` and `Logistic Regression` on reduced dataset and comment on training times <br>
&nbsp &nbsp 7. Compare Performance <br>
&nbsp &nbsp &nbsp &nbsp - Tabulate `RandomForestClassifier` and `Logistic Regression` performance data for both datasets and comment on results <br>

## Preliminaries

In [109]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d 
import matplotlib as mpl
from matplotlib import cm
import numpy as np
import pandas as pd
%matplotlib inline
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
import os


## Import Data

In [110]:
data = pd.read_csv('expanded.csv')

## Investigate and Fix Data

In [111]:
data.describe()

Unnamed: 0,potability,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8416,8416,8416,8416,8416,8416,8416,8416,8416,8416,...,8416,8416,8416,8416,8416,8416,8416,8416,8416,8416
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,EDIBLE,CONVEX,SCALY,BROWN,NO,NONE,FREE,CLOSE,BROAD,BUFF,...,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,WHITE,SEVERAL,WOODS
freq,4488,3796,3268,2320,5040,3808,8200,6824,5880,1728,...,5076,4744,4640,8416,8216,7768,3968,2424,4064,3160


In [112]:
# showing categories and counts for each feature
for column in data.columns:
    print(data[column].value_counts())

EDIBLE       4488
POISONOUS    3928
Name: potability, dtype: int64
CONVEX     3796
FLAT       3292
KNOBBED     840
BELL        452
SUNKEN       32
CONICAL       4
Name: cap-shape, dtype: int64
SCALY      3268
SMOOTH     2684
FIBROUS    2460
GROOVES       4
Name: cap-surface, dtype: int64
BROWN       2320
GRAY        2096
RED         1500
YELLOW      1072
WHITE       1040
BUFF         168
PINK         144
CINNAMON      44
GREEN         16
PURPLE        16
Name: cap-color, dtype: int64
NO         5040
BRUISES    3376
Name: bruises, dtype: int64
NONE        3808
FOUL        2160
FISHY        576
SPICY        576
ANISE        400
ALMOND       400
PUNGENT      256
CREOSOTE     192
MUSTY         48
Name: odor, dtype: int64
FREE        8200
ATTACHED     216
Name: gill-attachment, dtype: int64
CLOSE      6824
CROWDED    1592
Name: gill-spacing, dtype: int64
BROAD     5880
NARROW    2536
Name: gill-size, dtype: int64
BUFF         1728
PINK         1556
WHITE        1232
BROWN        1112
CHOCOL

In [113]:
# replacing ? in data with nan value so we can use built-in functions
for column in data.columns:
    data[column][data[column] == '?'] = np.nan

In [114]:
# encoding data
data_fill = data.drop('stalk-root', axis = 1)
data_fill = pd.get_dummies(data_fill)

# getting indices of nan values and non nan values for imputation
idx_nan = data.index[data['stalk-root'].isna()]
idx_else = data.index[data['stalk-root'].isna() == False]

# splitting data for imputation
data_tofill = data_fill.loc[idx_nan]
data_filler = data_fill.loc[idx_else]
y_filler = data['stalk-root'].loc[idx_else]

In [115]:
# using KNN to find missing values
from sklearn.neighbors import KNeighborsClassifier


params = {'n_neighbors':5,
          'weights':'uniform',
          'leaf_size':30,
          'p':2}

knn = KNeighborsClassifier(**params)

knn.fit(data_filler,y_filler)
missing_values = knn.predict(data_tofill)

print(missing_values)

['EQUAL' 'BULBOUS' 'EQUAL' ... 'EQUAL' 'EQUAL' 'EQUAL']


In [116]:
# filling missing values
data['stalk-root'][idx_nan] = missing_values

## Train on Full Dataset

In [117]:
from sklearn.model_selection import train_test_split

# splitting up full dataset with imputed values for training and testing
y = data['potability']
y = np.where(y.str.contains("EDIBLE"), 1, 0)

X = data.drop('potability', axis = 1)
X = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [118]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# creating models
rfr = RandomForestClassifier(random_state=42)
lr = LogisticRegression()


In [119]:
%%time

# training random forest classifier and timing
rfr.fit(X_train,y_train)

CPU times: user 416 ms, sys: 13 ms, total: 429 ms
Wall time: 475 ms


RandomForestClassifier(random_state=42)

In [120]:
%%time

# training logistic regression and timing
lr.fit(X_train,y_train)

CPU times: user 312 ms, sys: 25.9 ms, total: 338 ms
Wall time: 419 ms


LogisticRegression()

## Evaluate Performance on Full Dataset

In [121]:
# predicting
rfr_preds = rfr.predict(X_test)
lr_preds = lr.predict(X_test)


In [122]:
from sklearn.metrics import accuracy_score, recall_score, precision_score

# outputting performance metrics
acc_rfr_orig = round(accuracy_score(y_test, rfr_preds),4)
print("RandomForestClassifier Accuracy score (full dataset): ", acc_rfr_orig)

acc_lr_orig = round(accuracy_score(y_test, lr_preds),4)
print("Logistic Regression Accuracy score (full dataset): ", acc_lr_orig)

rec_rfr_orig = round(recall_score(y_test, rfr_preds),4)
print("RandomForestClassifier Recall score (full dataset): ", rec_rfr_orig)

rec_lr_orig = round(recall_score(y_test, lr_preds),4)
print("Logistic Regression Recall score (full dataset): ", rec_lr_orig)

prec_rfr_orig = round(precision_score(y_test, rfr_preds),4)
print("RandomForestClassifier Precision score (full dataset): ",  prec_rfr_orig)

prec_lr_orig = round(precision_score(y_test, lr_preds),4)
print("Logistic Regression Precision score (full dataset): ", prec_lr_orig)

RandomForestClassifier Accuracy score (full dataset):  1.0
Logistic Regression Accuracy score (full dataset):  1.0
RandomForestClassifier Recall score (full dataset):  1.0
Logistic Regression Recall score (full dataset):  1.0
RandomForestClassifier Precision score (full dataset):  1.0
Logistic Regression Precision score (full dataset):  1.0


Since all of these values are 1, we did not miss any mushroom classifications on the full dataset with either the RandomForestClassifier or the Logistic Regression. 

## Reduce Dimensionality

In [123]:
from sklearn.decomposition import PCA

# reducing dataset with PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# outputting reduction metrics
full_dims = X.shape[1]
reduced_dims = X_reduced.shape[1]
print('Original number of dimensions: ',full_dims)
print('Number of dimensions after PCA: ',reduced_dims)
print('Percentage decrease in dimensions: ',round((full_dims-reduced_dims)/full_dims*100,2),'%')

Original number of dimensions:  116
Number of dimensions after PCA:  2
Percentage decrease in dimensions:  98.28 %


## Train on Reduced Datset

In [124]:
# splitting reduced data for training and testing
X_train_reduced, X_test_reduced, y_train, y_test = train_test_split(X_reduced, y, test_size=0.20, random_state=42)

In [125]:
# creating new models for reduced data
rfr_reduced = RandomForestClassifier(random_state=42)
lr_reduced = LogisticRegression()

In [126]:
%%time

# fitting and timing random forest classifier on reduced data
rfr_reduced = rfr_reduced.fit(X_train_reduced,y_train)

CPU times: user 613 ms, sys: 17.2 ms, total: 631 ms
Wall time: 705 ms


In [127]:
%%time

# fitting and timing logistic regression on reduced data
lr_reduced = lr_reduced.fit(X_train_reduced,y_train)

CPU times: user 22 ms, sys: 3.65 ms, total: 25.6 ms
Wall time: 46.3 ms


## Compare Performance

In [128]:
# predicting
rfr_preds = rfr_reduced.predict(X_test_reduced)
lr_preds = lr_reduced.predict(X_test_reduced)

In [129]:
# outputting performance metrics
acc_rfr_red = round(accuracy_score(y_test, rfr_preds),4)
print("RandomForestClassifier Accuracy score (reduced dataset): ", acc_rfr_red)

acc_lr_red = round(accuracy_score(y_test, lr_preds),4)
print("Logistic Regression Accuracy score (reduced dataset): ", acc_lr_red)

rec_rfr_red = round(recall_score(y_test, rfr_preds),4)
print("RandomForestClassifier Recall score (reduced dataset): ", rec_rfr_red)

rec_lr_red = round(recall_score(y_test, lr_preds),4)
print("Logistic Regression Recall score (reduced dataset): ", rec_lr_red)

prec_rfr_red = round(precision_score(y_test, rfr_preds),4)
print("RandomForestClassifier Precision score (reduced dataset): ", prec_rfr_red)

prec_lr_red = round(precision_score(y_test, lr_preds),4)
print("Logistic Regression Precision score (reduced dataset): ", prec_lr_red)

RandomForestClassifier Accuracy score (reduced dataset):  0.9388
Logistic Regression Accuracy score (reduced dataset):  0.8818
RandomForestClassifier Recall score (reduced dataset):  0.9513
Logistic Regression Recall score (reduced dataset):  0.9712
RandomForestClassifier Precision score (reduced dataset):  0.9357
Logistic Regression Precision score (reduced dataset):  0.8352


In [131]:
import plotly.graph_objects as go

# tabulating results
fig = go.Figure(data=[go.Table(header=dict(values=['Models', 'Item','Full Data','PCA Reduced']),
                 cells=dict(values=[['Random Forest','Random Forest','Random Forest','Random Forest', 'Logistic Regression', 'Logistic Regression', 'Logistic Regression', 'Logistic Regression'], ['Accuracy', 'Precision','Recall','Time','Accuracy', 'Precision','Recall','Time(ms)'],[acc_rfr_orig, prec_rfr_orig, rec_rfr_orig, 475, acc_lr_orig, prec_lr_orig, rec_lr_orig, 419],[acc_rfr_red, prec_rfr_red, rec_rfr_red, 631, acc_lr_red, prec_lr_red, rec_lr_red, 46.3]]))
                     ])
fig.show()

### Conclusions <br>
Based on the table above, the models perform worse with the reduced dataset, however the drop in performance is greatly outweighed by the drop in dimensions. This means even with a logistic regression, which is a fairly simple model, we can expect decent results with much less training time. However, the RandomForestClassifier takes longer to train on the reduced dataset because there are fewer ways in which the data can be split. Even so, the RandomForestClassifier performs very well - greatly outperforming the logistic regression on the reduced dataset in both accuracy and precision, but losing slightly in recall. This means the RandomForestClassifier is more likely to get the correct prediction and less likely to have false positives (in this case falsely identifying a poisonous mushroom as edible). But, the RandomForestClassifier is less likely to get all edible mushrooms (falsely labels as poisonous). So, overall, the RandomForestClassifier performs better on the reduced data, especially because the logistic regression is much more likely to say a poisonous mushroom is edible. I would also conclude that using the logistic regression on the full dataset is the best option because it did not miss any mushroom labels and had a lower training time compared to the RandomForestClassifier.