Juliana Xu

####Mushroom Classification - Deliverable 2

#**Problem Statment**

My project will aim to classify a given mushroom as either poisonous or safe to eat based on a given set of physical characteristics. Each mushroom is classified based on a maximum of 21 features that range from odor, to cap colour, to habitat. The classes are poisonous(1) and edible(0). The final conceptualization will be realized through a simple web app.

#**Data Preprocessing**

I will be using the following dataset: https://www.kaggle.com/uciml/mushroom-classification. The dataset consists of 23 labels and 8124 samples. Out of the 8124 samples, 48% percent are classified as poisonous and 52% as edible. Furthermore, the first label represents the class and the subsequent 22 represent features. However, I will be removing the label "veil-type" as 100% of the samples share the same veil type. There are also other features which exhibit similar behaviour, having nearly 100% of the samples sharing the same chracteristics. These labels include "veil-color", "gill-attachment", and "ring-number". I have considered removing these labels as well but decided against it as they could still be crucial to determining the edibility of a mushroom. Instead, I will be implementing the L1 regularization method which will automatically remove labels that have little to no effect on the final output.

In addition to removing certain columns, I will need to convert the data in my dataset from categorical to numerical. This is to facilitate the process of implementing different classification algorithms. Currently, the dataset contains solely letters which are used to represent the class and features of each sample. To convert each letter into an integer I will be using the LabelEncoder() from sklearn.preprocessing.

#**Machine Learning Model**

For my project which involves a classification problem, I will be testing out the Support Vector Classifier and the Random Forest Classifer. I will not need to consider the methods for multiclass problems since my project deals with a two class problem. I chose not to implement the k-NN model because my dataset contains a large number of features and the k-NN model is highly sensitive to higher dimensionalities. 

As for training/validation/test set splits, I plan on more or less following the standard. However, the large amount of features in my dataset may call for more samples in order obtain a comprehensive training phase. Therefore, I have decided to split my data 65%/15%/20% for the training, validation, and test sets, respectively. Also due to the numerous features, I am choosing to use the L1 regularization technique. As aforementioned, using this method will shrink the coefficients of less important features to zero, and therefore, eliminating them completely. 

#**Preliminary Results**

In [215]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC 
from sklearn import preprocessing
import pandas as pd
import numpy as np

In [216]:
# Import Dataset
mushrooms = pd.read_csv("/Users/julianaxu/Downloads/mushrooms.csv")

# Delete the colums with label "veil-type". Mushrooms now has shape (8124, 22).
mushrooms = mushrooms.drop("veil-type", axis = 1)

# Create X and y. 
# y is an array of the first column. 
# X is an array of all the subsequent columns. 
X = np.array(mushrooms.iloc[:, 1:])
y = np.array(mushrooms.iloc[:, 0])

# Create training and test sets. Note: train/valid/test = .65/.15/.20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
print("Factor data (y_test): ", y_test)

# Create a validation set.
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size = 0.15)

# Use Label Encoder to turn categorical data into numerical.
le = preprocessing.LabelEncoder()
le.fit(y_train)
y_train = le.transform(y_train)

# Do the same for y_test. Note: 0 = edible, 1 = poisonous
le.fit(y_test)
y_test = le.transform(y_test)

# Do the same for y_valid.
le.fit(y_valid)
y_valid = le.transform(y_valid)

# Use Label Encoder on X_train, X_valid and _test
for i in range(21):
    # X_train
    le.fit(X_train[:, i])
    X_train[:, i] = le.transform(X_train[:, i])
    # X_valid
    le.fit(X_valid[:, i])
    X_valid[:, i] = le.transform(X_valid[:, i])
    # X_test
    le.fit(X_test[:, i])
    X_test[:, i] = le.transform(X_test[:, i])
    
print("Numerical data (y_test): ", y_test)

Factor data (y_test):  ['p' 'p' 'p' ... 'e' 'p' 'e']
Numerical data (y_test):  [1 1 1 ... 0 1 0]


In [217]:
# Using SVM
svm = LinearSVC(class_weight = 'balanced')
svm.fit(X_train, y_train)
svm.score(X_train, y_train)
svm.score(X_test, y_test)

print("Training Accuracy: ", svm_clf.score(X_train, y_train))
print("Test Accuracy: ", svm_clf.score(X_test, y_test))

Training Accuracy:  0.9476828385228095
Test Accuracy:  0.9489230769230769




Using the SVM method, the accuracy is fairly high so it could be a good choice to use this classification technique. The test accuracy is higher than the training accuracy so I do not think there are any issues regarding underfitting or overfitting the data. 

In [218]:
# Using RandomForests
rfc = RandomForestClassifier(n_estimators = 100, min_samples_split = 2, max_depth = None, max_features = 'auto')
rfc.fit(X_train, y_train)
rfc.score(X_train, y_train)
rfc.score(X_test, y_test)

print("Training Accuracy: ", rfc.score(X_train, y_train))
print("Test Accuracy: ", rfc.score(X_test, y_test))

Training Accuracy:  1.0
Test Accuracy:  1.0


Using the Random Forest Classifier, I received both training and test accuracies of 1.0. I am unsure of whether this means there is 100% accuracy when using when using this method or there was some sort of error made.

#**Next Steps**

I would like to continue to look into the Random Forest Classifer and whether or not it would be the most effective method to use in this situation. There might also be additional classification methods that we learn in future lectures whose implementation could be beneficial to my project. Furthermore, I would like to test the accuracy of the model given unknown features (i.e. if a user did not fill out all the features for a mushroom). I want to find out whether or not it would be feasible to allow users to leave some fields blank when using the application.  