# DTSC 680: Assignment 6

### Marisa Love


# Description

This project uses the Agaricus Lepiota mushroom data, which identifies many qualities of different mushroom samples and the specific attributes that can be used to determine if they are poisonous or edible. There is not a simple rule, so the dataset identifies several combinations that are used with high accuracy in predicting if a mushroom is poisonous. 

There are multiple steps to this project. Firstly, there are several rows with missing values all from the 'stalk-root' column, so a KNeighborsClassifier is used to predict the labels of the missing data by using the other stalk-related features, such as 'stalk-shape'. Before the model can be trained, the features have to be one-hot-encoded and the reponse is label encoded, which is needed when using categorical data. Because the rows with missing values do not contain all of the same feature values as the rows with a missing value, the categories have to be specified when one-hot-encoding or the classifier will not be able to use the test dataset for making predictions that has a different number of features from the training dataset. The predicted values are then imputed into the original data in place of the missing values.

With the newly completed data set, after one-hot-encoding and label-encoding as needed, 2 models are then trained to predict whether a sample is poisonous or edible. The first is a RandomForestClassifier and the second a LogisticRegression model. After checking the training time, accuracy, precision, and recall for each model, a Principal Component Analysis is completed to reduce the number of features used in training the models while still explaining 95% of the variance. The RandomForestClassifier and LogisitcRegression model are retrained on the reduced dataset and compared to the full data models on training time, accuracy, precision, and recall. 

### Import Mushroom Data Set

In [1]:
#load mushrooms data with feature names and move poisonous/edible from index to a column

import numpy as np
import pandas as pd

data = pd.read_csv('agaricus-lepiota.data', header=None, names=(['cap-shape','cap-surface','cap-color','bruises','odor',
                           'gill-attachment','gill-spacing','gill-size','gill-color',
                           'stalk-shape','stalk-root','stalk-surface-above-ring','stalk-surface-below-ring',
                           'stalk-color-above-ring','stalk-color-below-ring','veil-type','veil-color',
                           'ring-number','ring-type','spore-print-color','population','habitat']))

data.reset_index(inplace=True)
data

Unnamed: 0,index,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,e,k,s,n,f,n,a,c,b,y,...,s,o,o,p,o,o,p,b,c,l
8120,e,x,s,n,f,n,a,c,b,y,...,s,o,o,p,n,o,p,b,v,l
8121,e,f,s,n,f,n,a,c,b,n,...,s,o,o,p,o,o,p,b,c,l
8122,p,k,y,n,f,y,f,c,n,b,...,k,w,w,p,w,o,e,w,v,l


# Impute Missing Values

In [2]:
#response data will be the column with missing values
#2480 missing values are all in the 'stalk-root' column and denoted with a '?'
#change '?' to na so we can locate rows with missing values

data['stalk-root'] = data['stalk-root'].replace('?', np.nan)
response = data['stalk-root']
response.info()

<class 'pandas.core.series.Series'>
RangeIndex: 8124 entries, 0 to 8123
Series name: stalk-root
Non-Null Count  Dtype 
--------------  ----- 
5644 non-null   object
dtypes: object(1)
memory usage: 63.6+ KB


In [3]:
#choose features that may be related to the stalk-root data to use for imputing
features = data[['stalk-shape','stalk-surface-above-ring','stalk-surface-below-ring','stalk-color-above-ring','stalk-color-below-ring']]

In [4]:
#create training data for KNN from rows without missing values

data_no_miss = data.dropna()
X_train = data_no_miss[['stalk-shape','stalk-surface-above-ring','stalk-surface-below-ring','stalk-color-above-ring','stalk-color-below-ring']]
X_train

Unnamed: 0,stalk-shape,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring
0,e,s,s,w,w
1,e,s,s,w,w
2,e,s,s,w,w
3,e,s,s,w,w
4,t,s,s,w,w
...,...,...,...,...,...
7986,e,y,y,n,n
8001,e,y,y,n,n
8038,e,s,s,w,w
8095,e,k,y,c,c


In [5]:
#non-missing labels from the response column to use in training

y_train = data_no_miss['stalk-root']

In [6]:
#rows with missing values to use when making predictions for the missing values

data_miss = data.copy()
data_miss = data_miss[data_miss['stalk-root'].isnull()]
X_test = data_miss[['stalk-shape','stalk-surface-above-ring','stalk-surface-below-ring','stalk-color-above-ring','stalk-color-below-ring']]
X_test

Unnamed: 0,stalk-shape,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring
3984,e,s,s,e,w
4023,t,k,s,w,w
4076,e,s,f,w,w
4100,t,k,s,p,p
4104,t,s,s,p,p
...,...,...,...,...,...
8119,e,s,s,o,o
8120,e,s,s,o,o
8121,e,s,s,o,o
8122,t,s,k,w,w


In [7]:
#one-hot-encode feature data
#specify categories so the training and test data match in number of features

from sklearn.preprocessing import OneHotEncoder

one_hot_enc = OneHotEncoder(drop='first', sparse=False,
                            categories=[['e','t'],['f','k','s','y'],['f','k','s','y'],
                                        ['b','c','e','g','n','o','p','w','y'],
                                        ['b','c','e','g','n','o','p','w','y']])
X_train_trans = one_hot_enc.fit_transform(X_train)

In [8]:
#transform the test data with the encoder
X_test_trans = one_hot_enc.transform(X_test)

In [9]:
#label encode response data

from sklearn.preprocessing import LabelEncoder

label_enc = LabelEncoder()
y_train_trans = label_enc.fit_transform(y_train)

In [10]:
#train a KNN classifier on rows without missing data to predict labels for missing values

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

knn.fit(X_train_trans, y_train_trans)

KNeighborsClassifier()

In [11]:
predictions = knn.predict(X_test_trans)
predictions[:10]

array([1, 2, 1, 0, 0, 0, 0, 1, 1, 1])

In [12]:
#convert predicted labels from encoded data to corresponding original letter data
missing_values = label_enc.inverse_transform(predictions)
missing_values[:10]

array(['c', 'e', 'c', 'b', 'b', 'b', 'b', 'c', 'c', 'c'], dtype=object)

In [13]:
#impute predictions for missing values into dataset containing only rows with missing values
#combine with other rows that were not missing values to get full dataset
data_miss['stalk-root'] = missing_values
data_imputed = data_no_miss.merge(data_miss, how='outer')

In [36]:
data_imputed

Unnamed: 0,index,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,e,k,s,n,f,n,a,c,b,y,...,s,o,o,p,o,o,p,b,c,l
8120,e,x,s,n,f,n,a,c,b,y,...,s,o,o,p,n,o,p,b,v,l
8121,e,f,s,n,f,n,a,c,b,n,...,s,o,o,p,o,o,p,b,c,l
8122,p,k,y,n,f,y,f,c,n,b,...,k,w,w,p,w,o,e,w,v,l


### Graded Concept Question #1

Why don't we one-hot encode the response data to train the KNN model instead?

One-hot-encoding is used to turn categorical data into binary features, so it turns each feature column into several new columns, each only containing a 0 or 1 to specify if the value is or is not in a particular category for that feature. With the response data, we are trying to make a prediction on what it's value is but if we one-hot-encode it then we would be creating multiple possible classes so the estimator would have to be able to support multiclass clasification. The label encoder instead changes each non-numerical label to a number so each possible value has it's own label and the predictor only has to output one number instead of an array of 0s and 1s to specify which category it belongs to. 

## RandomForestClassifier and LogisticRegression

In [15]:
#separate features and response
features = data_imputed.drop('index', axis=1)
response = data_imputed['index']

In [16]:
#one-hot-encode all features with imputed values
one_hot_enc2 = OneHotEncoder(drop='first', sparse=False)
features_trans = one_hot_enc2.fit_transform(features)

In [17]:
#label encode response data
label_enc2 = LabelEncoder()
response_trans = label_enc2.fit_transform(response)

In [18]:
#create train and test sets from transformed features and response
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features_trans, response_trans)

In [19]:
#train a RandomForestClassifer to predict if edible

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=42)
%time rfc.fit(X_train, y_train)

CPU times: total: 234 ms
Wall time: 241 ms


RandomForestClassifier(random_state=42)

In [20]:
#RandomForestClassifier predictions
y_pred_rfc = rfc.predict(X_test)
y_pred_rfc

array([0, 1, 1, ..., 1, 0, 1])

In [21]:
#train a LogisticRegression model to predict if edible

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
%time log_reg.fit(X_train, y_train)

CPU times: total: 250 ms
Wall time: 69.8 ms


LogisticRegression()

In [22]:
#LogisticRegression predictions
y_pred_log = log_reg.predict(X_test)
y_pred_log

array([0, 1, 1, ..., 1, 0, 1])

### Graded Concept Question #2

Could we train these two models by one-hot encoding the response data instead, being careful to specify that the drop parameter of the OneHotEncoder class is set to ‘first’? Why or why not?

One-hot-encoding is not meant to be used on y labels. It creates a different column for each possible category from a single feature, so even if we dropped the first column, there would still be 2 columns with only 0s or 1s, making it unclear what the predicted value would be indicating unless the predictor was outputting a multiclass prediction with an array of 0s and 1s to properly specify which category is being predicted. 

# Accuracy, Precision, and Recall

In [23]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

rfc_acc = accuracy_score(y_test, y_pred_rfc)
rfc_prec = precision_score(y_test, y_pred_rfc)
rfc_recall = recall_score(y_test, y_pred_rfc)

log_acc = accuracy_score(y_test, y_pred_log)
log_prec = precision_score(y_test, y_pred_log)
log_recall = recall_score(y_test, y_pred_log)

print('RandomForestClassifer Accuracy: ', rfc_acc)
print('RandomForestClassifer Precision: ', rfc_prec)
print('RandomForestClassifer Recall: ', rfc_recall)

print('LogisticRegression Accuracy: ', log_acc)
print('LogisticRegression Precision: ', log_prec)
print('LogisticRegression Recall: ', log_recall)

RandomForestClassifer Accuracy:  1.0
RandomForestClassifer Precision:  1.0
RandomForestClassifer Recall:  1.0
LogisticRegression Accuracy:  0.999015263417036
LogisticRegression Precision:  1.0
LogisticRegression Recall:  0.9979777553083923


Overall, both models did very well in their predictions. The RandomForestClassifier was able to make perfect predictions on the supplied data, scoring 100% on each category. It's possible that the model is overfitting the data and would not generalize as well, but because the predictions were made on a portion of the dataset that the model had not seen before we can be pretty confident that it's ability to generalize is rather strong. The logistic regression was not far behind, with over 99% in each category, so the difference may not even be significant. The precision was 100%, so every time it said the sample was poisonous it actually was correct. The recall was the lowest of the three, so at some point when it should have labeled something as poisonous it failed to do so. It seems there are nuances about the decision rules that are not quite explained by a simple logistic regression. 

# Dimensionality Reduction with PCA

In [24]:
#first check the total number of features being used in the full dataset
from sklearn.decomposition import PCA

pca = PCA()
pca.fit_transform(X_train)
pca.n_components_

94

In [25]:
#train a PCA model to reduce the number of training features but still retain 95% of the explained variance
pca2 = PCA(n_components=0.95)
X_train_reduced = pca2.fit_transform(X_train)
pca2.n_components_

38

In [26]:
(94 - 38)/94 * 100

59.57446808510638

59.6% reduction in the number of dimensions, 38 features remaining of the 94 started with

In [27]:
#reduce testing dataset to use in making predictions
X_test_reduced = pca2.transform(X_test)

### RandomForest and LogisticRegression with Reduced Dataset

In [28]:
#train new prediction models with the reduced training dataset
rfc2 = RandomForestClassifier(random_state=42)
%time rfc2.fit(X_train_reduced, y_train)

CPU times: total: 2.25 s
Wall time: 1.69 s


RandomForestClassifier(random_state=42)

In [29]:
#make predictions with newly trained model on reduced test dataset
y_pred_rfc2 = rfc2.predict(X_test_reduced)

In [30]:
log_reg2 = LogisticRegression()
%time log_reg2.fit(X_train_reduced, y_train)

CPU times: total: 62.5 ms
Wall time: 23.9 ms


LogisticRegression()

In [31]:
y_pred_log2 = log_reg2.predict(X_test_reduced)

In [32]:
rfc_acc2 = accuracy_score(y_test, y_pred_rfc2)
rfc_prec2 = precision_score(y_test, y_pred_rfc2)
rfc_recall2 = recall_score(y_test, y_pred_rfc2)

log_acc2 = accuracy_score(y_test, y_pred_log2)
log_prec2 = precision_score(y_test, y_pred_log2)
log_recall2 = recall_score(y_test, y_pred_log2)

print('RandomForestClassifer Accuracy: ', rfc_acc2)
print('RandomForestClassifer Precision: ', rfc_prec2)
print('RandomForestClassifer Recall: ', rfc_recall2)

print('LogisticRegression Accuracy: ', log_acc2)
print('LogisticRegression Precision: ', log_prec2)
print('LogisticRegression Recall: ', log_recall2)

RandomForestClassifer Accuracy:  1.0
RandomForestClassifer Precision:  1.0
RandomForestClassifer Recall:  1.0
LogisticRegression Accuracy:  0.9871984244214672
LogisticRegression Precision:  0.9868554095045501
LogisticRegression Recall:  0.9868554095045501


In [34]:
index = pd.MultiIndex.from_product([['Random Forest', 'Logistic Regression'], ['Accuracy','Precision','Recall','Time']],
                                   names = ['Models', 'Item'])

pd.DataFrame(data=[[rfc_acc, rfc_acc2], [rfc_prec, rfc_prec2], [rfc_recall, rfc_recall2],['234 ms', '2.25 s'], 
                   [log_acc, log_acc2], [log_prec, log_prec2], [log_recall, log_recall2], ['250 ms', '62.5 ms']],
            index = index, columns = ['Full Data', 'PCA Reduced'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Full Data,PCA Reduced
Models,Item,Unnamed: 2_level_1,Unnamed: 3_level_1
Random Forest,Accuracy,1.0,1.0
Random Forest,Precision,1.0,1.0
Random Forest,Recall,1.0,1.0
Random Forest,Time,234 ms,2.25 s
Logistic Regression,Accuracy,0.999015,0.987198
Logistic Regression,Precision,1.0,0.986855
Logistic Regression,Recall,0.997978,0.986855
Logistic Regression,Time,250 ms,62.5 ms


The RandomForestClassifier did just as well on the reduced dataset as it did on the full dataset. The extra 56 features in the full dataset that only accounted for 5% of the explained variance did not affect the accuracy, precision, or recall of the Random Forest. This increases my confidence that the model is not overfitting the data because even when reduced it still performs perfectly. The Random Forest model is powerful in properly classifying samples, and surprisingly worked faster on the full dataset than the reduced dataset, so in that case there's no reason to reduce the dataset since the accuracy, precision, and recall were not affected. However, the Logistic Regression did lose over 1% in each of the 3 categories, which isn't a huge difference but still dropped all scores below 99%, meaning more poisonous mushrooms being misclassified.The precision score suffered the most, so where it was always correct when it labeled something as poisonous on the full dataset it then started to mislabel some edible samples as poisonous on the reduced dataset. However, the training time was much faster on the second Logistic Regression model than on both Random Forest models and the Logistic Regression on the full dataset, so it may be more useful on a larger dataset. 