# Mushroom Classification Using KNN, PCA, and Tree-Based Models

This notebook was developed as part of a graduate-level applied machine learning course (DTSC 680). The goal was to build a mushroom classification model from scratch using a real-world dataset, applying K-Nearest Neighbors for missing data imputation, Random Forest and Logistic Regression for binary classification, and PCA for dimensionality reduction.

No templates or starter code were provided. All model building, preprocessing, and documentation was written independently to fulfill the assignment and practice industry-standard machine learning workflows.

**Dataset:** UCI Mushroom Dataset  
**URL:** https://archive.ics.uci.edu/ml/datasets/Mushroom


In [2]:
#importing necessary libraries

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score

In [3]:
#Import and load the mushroom data set
#Open the .names file in a Text Editor to see the names of the columns and add them in in order.
mushroom1 = pd.read_csv("agaricus-lepiota.data", names = ['classes', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'], na_values = "?")

In [4]:
#Will work off of a copy of the dataset so as not to accidentally permanently alter anything in the dataset.

mushroom = mushroom1.copy()
mushroom.head()

Unnamed: 0,classes,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [5]:
#Looking at my data here I can see that the stalk-root attribute has the most missing values

mushroom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   classes                   8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises?                  8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                5644 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

In [6]:
#Create your features (X) and response (y). After doing the get dummies, we know that we want to predict the 
#class on whether it's edible or poisonous, so these is our response and the rest will be considered features.

X = mushroom.drop(columns=['stalk-root'])
y = mushroom['stalk-root']

In [7]:
#Creating a dataset for my response values

y_df = pd.DataFrame(y)


In [8]:
#Identifying which rows in the response dataset is missing a value for stalk-root. 
y_df.loc[y_df['stalk-root'].isnull()]
    

Unnamed: 0,stalk-root
3984,
4023,
4076,
4100,
4104,
...,...
8119,
8120,
8121,
8122,


In [9]:
#Changing my NaNs to a float so it doesn't get one-hot-encoded, and so I can easily identify the values later that need to be filled with imputed data.
y_df.loc[y_df['stalk-root'].isnull()] = y_df.loc[y_df['stalk-root'].isnull()].astype(float).fillna(0.0)

In [10]:
#Checking my dataframe to make sure it filled NaNs with 0's.
y_df

Unnamed: 0,stalk-root
0,e
1,c
2,c
3,e
4,e
...,...
8119,0.0
8120,0.0
8121,0.0
8122,0.0


In [12]:
#Creating a dataframe with the missing values, so I can identify the indices. 

missing = mushroom.loc[mushroom['stalk-root'].isnull()]

In [13]:
dropindices = missing.index.tolist() 
#create an object with a list of indices that corresond to rows with missing values

In [14]:
dropindices

[3984,
 4023,
 4076,
 4100,
 4104,
 4196,
 4200,
 4283,
 4291,
 4326,
 4329,
 4331,
 4357,
 4376,
 4380,
 4396,
 4419,
 4429,
 4459,
 4461,
 4494,
 4497,
 4498,
 4519,
 4522,
 4533,
 4534,
 4557,
 4610,
 4611,
 4647,
 4662,
 4663,
 4673,
 4717,
 4722,
 4777,
 4801,
 4806,
 4812,
 4814,
 4817,
 4824,
 4826,
 4831,
 4844,
 4846,
 4858,
 4859,
 4860,
 4869,
 4871,
 4884,
 4888,
 4898,
 4899,
 4903,
 4904,
 4911,
 4926,
 4931,
 4938,
 4939,
 4945,
 4946,
 4951,
 4964,
 4965,
 4966,
 4984,
 4986,
 4993,
 4996,
 4999,
 5001,
 5005,
 5010,
 5015,
 5023,
 5041,
 5051,
 5052,
 5053,
 5070,
 5071,
 5074,
 5085,
 5098,
 5101,
 5104,
 5105,
 5106,
 5113,
 5116,
 5131,
 5132,
 5133,
 5134,
 5135,
 5141,
 5142,
 5144,
 5145,
 5147,
 5149,
 5150,
 5151,
 5154,
 5155,
 5160,
 5162,
 5163,
 5166,
 5168,
 5169,
 5175,
 5176,
 5179,
 5186,
 5188,
 5192,
 5193,
 5194,
 5197,
 5202,
 5205,
 5206,
 5208,
 5210,
 5212,
 5216,
 5218,
 5219,
 5221,
 5223,
 5224,
 5226,
 5227,
 5229,
 5230,
 5233,
 5239,
 5241,

In [12]:
#Label Encode all of the response data 


label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)


In [13]:
#creating a new dataframe with the label encoded response data for easier manipulation.
y_df_encoded = pd.DataFrame(y_encoded)

In [14]:
#separating out my response training set by dropping the rows with missing values based on the indices identified above.
y_train = y_df_encoded.drop(dropindices)
y_train

Unnamed: 0,0
0,2
1,1
2,1
3,2
4,2
...,...
7986,0
8001,0
8038,0
8095,1


In [15]:
#One Hot Encoding the features.

one_hot_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
X_encoded = one_hot_encoder.fit_transform(X)

In [16]:
#putting into a dataframe again for easier manipulation
X_df_encoded = pd.DataFrame(X_encoded)

In [17]:
#Creating a test set by indexing the rows with missing values based on the indices identified above.
X_test = X_encoded[dropindices]

In [18]:
#Creating my training set by dropping the rows with missing values based on the indices identified above.
X_train = X_df_encoded.drop(dropindices)


In [19]:
#Convert back into an array
X_train = np.array(X_train)

In [20]:
#An error later told me to use ravel on the y_train data
y_train = np.array(y_train).ravel()

In [21]:
#Check the shapes to make sure these make sense.
print('X_train shape is: ', X_train.shape)
print('y_train shape is: ', y_train.shape)
print('X_test shape is: ', X_test.shape)

X_train shape is:  (5644, 114)
y_train shape is:  (5644,)
X_test shape is:  (2480, 114)


In [22]:
#Instantiate an object with the KNeighborsClassifier class with default hyperparameters.

knn = KNeighborsClassifier()


#Fit the Model
knn.fit(X_train, y_train)

KNeighborsClassifier()

In [23]:
##Calculate predicions on the X_test set
predictions = knn.predict(X_test)

In [24]:
#Put those predictions back into a dataframe so I can easily see the indices that correspond to the predicted values.
predictions_df = pd.DataFrame(predictions, index=dropindices)

In [25]:
#Preparing to merge my new dataframe into the old one, so I have to change the column name to match.
predictions_df = predictions_df.rename(columns={0:'stalk-root'})
#Check to make sure that column name changed
predictions_df

Unnamed: 0,stalk-root
3984,0
4023,0
4076,2
4100,0
4104,0
...,...
8119,2
8120,2
8121,2
8122,0


In [26]:
#Knowing that One hot Encoder uses an alphabetized rank for encoding, I wanted to see which values correspond to the numbers
print(label_encoder.classes_)


['b' 'c' 'e' 'r' nan]


In [27]:
#Making a dictionary with corresponding encoded value and original values. 
encoded_dict = {0 : 'b', 1:'c', 2:'e', 3:'r'}


In [28]:
#Replacing the encoded values with the original letter categorical values
predictions_df = predictions_df.replace(encoded_dict).sort_index()


In [29]:
#Create an object that stores the missing values list.
missing_values = predictions_df['stalk-root'].tolist()
missing_values[:10]

['b', 'b', 'e', 'b', 'b', 'b', 'b', 'b', 'b', 'e']

When you have computed the missing values, create a data structure (i.e. a list) called
missing_values that contains all of the imputed values (in terms of the original
categorical/letter data and not the encoded numeric data) in order of increasing index
from the original data set. <b>You must then print the first 10 instances of
missing_values to the screen so we can check your work. </b>Finally, you will impute the
missing values back into the original data set before continuing, so that the next step
starts fresh with a complete data set in terms of the raw data values.

In [30]:
print('The first 10 missing values for stalk root column are: ', missing_values[:10])

The first 10 missing values for stalk root column are:  ['b', 'b', 'e', 'b', 'b', 'b', 'b', 'b', 'b', 'e']


In [31]:
#Looking back at my first dataframe with the response variables, I decided to change those 0.0 back to NaN so I can visually see the difference.

for i in y_df['stalk-root']:
    if i == 0.0:
        y_df['stalk-root'] = y_df['stalk-root'].replace(0.0, 'NaN')

In [32]:
#Checking to make sure my NaNs are back in the y_df 
y_df

Unnamed: 0,stalk-root
0,e
1,c
2,c
3,e
4,e
...,...
8119,
8120,
8121,
8122,


In [33]:
#Finally, I can add the new dataset with my original dataframe for the response variable. 
fulldf = predictions_df.combine_first(y_df)
fulldf

Unnamed: 0,stalk-root
0,e
1,c
2,c
3,e
4,e
...,...
8119,e
8120,e
8121,e
8122,b


In [34]:
#Create a new dataframe that using the mushroom data set and replacing the 'stalk-root' column with the new column that has the missing values filled in.

newX = mushroom.drop(columns=['stalk-root'])
newX['stalk-root'] = fulldf['stalk-root']
newX

Unnamed: 0,classes,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,stalk-root
0,p,x,s,n,t,p,f,c,n,k,...,w,w,p,w,o,p,k,s,u,e
1,e,x,s,y,t,a,f,c,b,k,...,w,w,p,w,o,p,n,n,g,c
2,e,b,s,w,t,l,f,c,b,n,...,w,w,p,w,o,p,n,n,m,c
3,p,x,y,w,t,p,f,c,n,n,...,w,w,p,w,o,p,k,s,u,e
4,e,x,s,g,f,n,f,w,b,k,...,w,w,p,w,o,e,n,a,g,e
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,e,k,s,n,f,n,a,c,b,y,...,o,o,p,o,o,p,b,c,l,e
8120,e,x,s,n,f,n,a,c,b,y,...,o,o,p,n,o,p,b,v,l,e
8121,e,f,s,n,f,n,a,c,b,n,...,o,o,p,o,o,p,b,c,l,e
8122,p,k,y,n,f,y,f,c,n,b,...,w,w,p,w,o,e,w,v,l,b


In [35]:
newmushroom = newX.copy()
newmushroom.info()

#Checking to make sure all of the missing values are filled in.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   classes                   8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises?                  8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-surface-above-ring  8124 non-null   object
 12  stalk-surface-below-ring  8124 non-null   object
 13  stalk-color-above-ring    8124 non-null   object
 14  stalk-color-below-ring  

### Why don't we one-hot encode the response variable for KNN?

K-Nearest Neighbors is a distance-based algorithm, and one-hot encoding would result in a sparse, multi-dimensional array that distorts the meaning of distances. Instead, label encoding preserves class identity in a single dimension, which allows Euclidean distance to function as intended for classification.

In [36]:
#Creating a new set of features and responses. We want to now determine if a mushroom is edible or poisonous, so that falls under "classes."

features = newmushroom.drop(columns='classes')
response = newmushroom['classes']

In [37]:
#Encode first, then train_test_split

#Instantiate a new object of the One Hot Encoder class
OHEnew = OneHotEncoder(sparse=False, handle_unknown='ignore')

#Fit and transform the features.
features_coded = OHEnew.fit_transform(features)

In [38]:
#Instantiate a new object of the Label Encoder class.

LEnew = LabelEncoder()

#Fit and transform the response data
response_coded = LEnew.fit_transform(response)

In [39]:
#Split training and test set, 20% test size. Giving them the suffix "_new" to distinguish from previously used X_train/y_train used above for the KNN imputer.
X_train_new, X_test_new, y_train_new, y_test = train_test_split(features_coded, response_coded,
                                   test_size=0.20)

In [40]:
#Instantiate objects of each class. The assignment states to only do a random state for the Random Forest Classifier.

rnd_clf = RandomForestClassifier(random_state=42) 
log_clf = LogisticRegression()

<b>When you train both the RandomForestClassifier and LogisticRegression models,
use the magic command %%time to time how long it takes to complete training.</b>

In [41]:
%%timeit
rnd_clf.fit(X_train_new, y_train_new)

#Calculate the time to fit the models.
#My timeit run shows: 337 ms ± 32.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


313 ms ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [42]:
%%timeit
log_clf.fit(X_train_new, y_train_new)

#Calculate the time to fit the models
#My timeit run shows: 72.2 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#Therefore, the Logistic Regression is faster than Random Forest Classifier.

99 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Why not one-hot encode the response for Random Forest or Logistic Regression?

While technically possible, one-hot encoding the binary response would require careful column dropping and could lead to data leakage or incorrect targets. Label encoding ensures a binary classifier like logistic regression receives an appropriate 1D target vector.

In [43]:
#Make predictions based off the test data. 

pred_rnd = rnd_clf.predict(X_test_new)
pred_log = log_clf.predict(X_test_new)

<b>Compute the accuracy, precision, and recall scores for a test set.</b>

In [44]:
###For the Random Forest Classifier Model:

#Compute accuracy, precision, and recall functions. set avg param to micro.
acc_score = accuracy_score(pred_rnd, y_test)
prec_score = precision_score(pred_rnd, y_test, average='micro')
recall_score1 = recall_score(pred_rnd, y_test, average='micro')

#Print all three scores.
print('Accuracy=%s' % (acc_score))
print('Precision=%s' % (prec_score))
print('Recall=%s' % (recall_score1))

Accuracy=1.0
Precision=1.0
Recall=1.0


In [45]:
###For the Logistic Regression Model:

#Compute accuracy, precision, and recall functions. set avg param to micro.
acc_score2 = accuracy_score(pred_log, y_test)
prec_score2 = precision_score(pred_log, y_test, average='micro')
recall_score2 = recall_score(pred_log, y_test, average='micro')

#Print all three scores.
print('Accuracy=%s' % (acc_score2))
print('Precision=%s' % (prec_score2))
print('Recall=%s' % (recall_score2))

Accuracy=1.0
Precision=1.0
Recall=1.0


<b> Briefly discuss the performance of your models in terms of these values</b>

Based on these accuracy, precision, and recall scores, both models are overfitting. This could be due to the fact that I am not using any specified hyperparmeters, or that this is not a particuarly large data set, or that it has 22 features. 

<b> Perform dimensionality reduction using PCA and keep 95% of the variance. </b>
    
Q: By what percentage were you able to reduce the number of dimensions of the training set? 
<p><i> A: See below for work, but we reduced the data by 75 columns. I was able to reduce the number of dimensions in the training set by 35.34%. That is, what remains is 64.66% of the original number of columns. </i>

Q: How many features (i.e. dimensions) are you left with after reducing dimensionality?
    
<i> A: We are left with 41 features</i>

In [46]:
#Peform dimensionality reduction using PCA. 

#Instantiate an object of the PCA class with the 95% variance
pca = PCA(n_components=0.95)

#Fit and transform the training data
X_train2D = pca.fit_transform(X_train_new)

In [47]:
#Check old shape of data
X_train_new.shape

(6499, 116)

In [48]:
#Check new shape of data and this is how I can see we have 41 features. 
X_train2D.shape

(6499, 41)

<b>Train two new models, a RandomForestClassifier and a LogisticRegression model,
on this reduced dataset to predict whether a mushroom is edible or not. Again, time the
training of these two models on the reduced dataset. You will again use the default
hyperparameter values for both models, with a random state parameter of 42 for the
RandomForestClassifier.</b>

In [49]:
#Instantiate new objects of the Random Forest Classifier and Logistic Regression classes.

rnd_clf_pca = RandomForestClassifier(random_state=42) 
log_clf_pca = LogisticRegression()

In [50]:
%%timeit
rnd_clf_pca.fit(X_train2D, y_train_new)

#Timing again now that we have reduced data.
##My timeit run shows: 1.75 s ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

1.81 s ± 79.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [51]:
%%timeit
log_clf_pca.fit(X_train2D, y_train_new)

#Timing again now that we have reduced data.
##My timeit run shows: 28.2 ms ± 357 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

39.9 ms ± 6.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [52]:
#Perform dimension reduction on the test data set as well so it will work with the new prediction. 

Xtest2D = pca.transform(X_test_new)
Xtest2D.shape

(1625, 41)

In [53]:
#Make predictions with the new Random Forest Classifier model with the reduced test set.

pred_rnd_pca = rnd_clf_pca.predict(Xtest2D)

In [54]:
#Make predictions with the new Logistic Regression model with the reduced test set.

pred_log_pca = log_clf_pca.predict(Xtest2D)

In [55]:
#Compute accuracy, precision, and recall functions. set avg param to micro.
acc_score3 = accuracy_score(pred_rnd_pca, y_test)
prec_score3 = precision_score(pred_rnd_pca, y_test, average='micro')
recall_score3 = recall_score(pred_rnd_pca, y_test, average='micro')

#Print all three scores.
print('Accuracy=%s' % (acc_score3))
print('Precision=%s' % (prec_score3))
print('Recall=%s' % (recall_score3))

Accuracy=1.0
Precision=1.0
Recall=1.0


In [56]:
#Compute accuracy, precision, and recall functions. set avg param to micro.
acc_score4 = accuracy_score(pred_log_pca, y_test)
prec_score4 = precision_score(pred_log_pca, y_test, average='micro')
recall_score4 = recall_score(pred_log_pca, y_test, average='micro')

#Print all three scores.
print('Accuracy=%s' % (acc_score4))
print('Precision=%s' % (prec_score4))
print('Recall=%s' % (recall_score4))

Accuracy=0.9926153846153846
Precision=0.9926153846153846
Recall=0.9926153846153846


In [57]:
#Creating a multi-index dataframe to tabulate information.

idx = pd.MultiIndex.from_product([["Random Forest", "Logistic Regression"],
                                  ["Accuracy", "Precision", "Recall", "Time"]],
                                 names=['Models', 'Item'])
col = ['Full Data', 'PCA Reduced']

tabledata = [[acc_score, acc_score3],
    [prec_score, prec_score3],
    [recall_score1, recall_score3],
    ['337 ms ± 32.9 ms per loop', '1.75 s ± 5.22 ms per loop'],
    [acc_score2, acc_score4],
    [prec_score2, prec_score4],
    [recall_score2, recall_score4],
    ['72.2 ms ± 4.05 ms per loop', '28.2 ms ± 357 µs per loop']]

table = pd.DataFrame(tabledata, idx, col)
table

Unnamed: 0_level_0,Unnamed: 1_level_0,Full Data,PCA Reduced
Models,Item,Unnamed: 2_level_1,Unnamed: 3_level_1
Random Forest,Accuracy,1.0,1.0
Random Forest,Precision,1.0,1.0
Random Forest,Recall,1.0,1.0
Random Forest,Time,337 ms ± 32.9 ms per loop,1.75 s ± 5.22 ms per loop
Logistic Regression,Accuracy,1.0,0.992615
Logistic Regression,Precision,1.0,0.992615
Logistic Regression,Recall,1.0,0.992615
Logistic Regression,Time,72.2 ms ± 4.05 ms per loop,28.2 ms ± 357 µs per loop


## Conclusion

Model Comparison and PCA Impact
Both the Random Forest and Logistic Regression models were evaluated on the full dataset and on a dimensionally reduced version using PCA (retaining 95% of the variance). The results show:

Random Forest: Applying PCA did not improve training speed and slightly increased training time. Accuracy remained very high (1.0), suggesting possible overfitting due to default hyperparameters. Tuning may improve generalization performance.

Logistic Regression: PCA helped reduce training time and resulted in only a minor decrease in accuracy. This suggests better generalization with the reduced feature set and improved efficiency.

In summary, PCA offered modest benefits for Logistic Regression by improving speed and slightly reducing overfitting, while Random Forest performance remained stable but potentially overfit. Further optimization of hyperparameters and feature engineering could enhance both models.