# **Lab 5: Wide and Deep Networks**
### Authors: Will Lahners, Edward Powers, and Nino Castellano

## **Describing the Data**

The dataset we chose pertains to mushrooms, specifically whether or not they are poisonous, and is called *mushrooms.csv*. We obtained this data from [Kaggle](https://www.kaggle.com/datasets/uciml/mushroom-classification). The results of our model could benefit food production companies, farmers, or people who enjoy the outdoors. 

We chose this dataset becasue every feature column is a categorical variable, which makes this perfect for a wide and deep neural network. This data set contains descriptions of samples corresponding to 23 species of gilled mushrooms found in the Agaricus and Lepiota Family Mushroom. Each species will be identified to be either definitely edible, definitely poisonous, or unknown edibility. 

## Preparation (4 points total)

> [1 points] Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created). You have the option of using tf.dataset for processing, but it is not required. 

In [15]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

# (1) Load the data into a pandas DataFrame
data = pd.read_csv('./mushrooms.csv')
data = pd.DataFrame(data)

# Deleting Useless Variables
data.drop(columns=['veil-type'], inplace=True)

class_mapping = {'e': True, 'p': False}
data['class'] = data['class'].map(class_mapping)

# Optionally, rename the column to something more descriptive
data.rename(columns={'class': 'edible'}, inplace=True)

# Encode any string data as integers for now (Credits to ChatGPT)
le = LabelEncoder()
object_columns = data.select_dtypes(include=['object']).columns

for col in object_columns:
    data[col] = le.fit_transform(data[col])
    
pd.set_option('display.max_columns', None)    
data.head()

Unnamed: 0,edible,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,False,5,2,4,1,6,1,0,1,4,0,3,2,2,7,7,2,1,4,2,3,5
1,True,5,2,9,1,0,1,0,0,4,0,2,2,2,7,7,2,1,4,3,2,1
2,True,0,2,8,1,3,1,0,0,5,0,2,2,2,7,7,2,1,4,3,2,3
3,False,5,3,8,1,6,1,0,1,5,0,3,2,2,7,7,2,1,4,2,3,5
4,True,5,2,3,0,5,1,1,0,4,1,3,2,2,7,7,2,1,0,3,0,1


In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   edible                    8124 non-null   bool 
 1   cap-shape                 8124 non-null   int32
 2   cap-surface               8124 non-null   int32
 3   cap-color                 8124 non-null   int32
 4   bruises                   8124 non-null   int32
 5   odor                      8124 non-null   int32
 6   gill-attachment           8124 non-null   int32
 7   gill-spacing              8124 non-null   int32
 8   gill-size                 8124 non-null   int32
 9   gill-color                8124 non-null   int32
 10  stalk-shape               8124 non-null   int32
 11  stalk-root                8124 non-null   int32
 12  stalk-surface-above-ring  8124 non-null   int32
 13  stalk-surface-below-ring  8124 non-null   int32
 14  stalk-color-above-ring    8124 non-null 

We started by one-hot encoding the each of our variables. We also ended up dropping the 'veil-type' becasue that column did not contain any unique values that would benefit our model. Each column is categorical, with each number (being one-hot encoded) pertaining to a differnet feature of that categorical variable. In the table below, the features of our variables can be found:

| Variable | Number of Classifications | Types of Classifications |
|:-------------|:--------------:|--------------:|
| edible        |       2    |           true or false  |
| cap-shape         |       6       |          bell, conical, flat, sunken, convex, or knobbed    |
| cap-surface        |     4       |        grooves, scaley, smooth, or fiberous   |
| cap-color        |     10       |        gray, green, brown, buff, cinamon, pink, purple, red, white, or yellow   |
| bruises        |     2       |        true or false   |
| odor        |     9       |        fishy, foul, musty, pungent, spicy, anise, creosote, almond, or none   |
| gill-attachment        |     4       |        descending, free, attached, or notched   |
| gill-spacing        |     3       |        close, distant, or crowded   |
| gill-size        |     2       |        broad or narrow   |
| gill-color        |     12       |        white, black, brown, chocolate, gray, buff, green, yellow, orange, pink, purple or red   |
| stalk-shape        |     2       |        enlarging or taperingg   |
| stalk-root        |     7       |        club, cup, equal, rhizomorphs, missing, bulbous, or rooted.   |
| stalk-surface-above-ring        |     4       |        scaly, silky, smooth, or fiberous   |
| stalk-surface-below-ring        |     4       |        scaly, silky, smooth, or fiberous   |
| stalk-color-above-ring        |     8       |        gray, cinnamon, orange, pink, red, yellow, buff, or brown   |
| stalk-color-below-ring        |     8       |        gray, cinnamon, orange, pink, red, yellow, buff, or brown   |
| veil-type        |     2       |       partial or universal   |
| veil-color        |     4       |        brown, orange, yellow, or white   |
| ring-number        |     3       |        none, one, or two   |
| ring-type        |     8       |        cobwebby, evanscent, flaring, large, none, pendant, sheating, or zone   |
| spore-print-color        |     9       |        brown, buff, black, green, orange, purple, white, yellow, or chocolate  |
| population        |     6       |        clustered, numerous, scattered, several, solitary, or abundant    |
| habitat        |     7       |        grasses, meadows, leaves, paths, urban, woods or waste   |


> [1 points] Identify groups of features in your data that should be combined into cross-product features. Provide a compelling justification for why these features should be crossed (or why some features should not be crossed). 

In [17]:
import tensorflow as tf
from tensorflow import keras
from keras.utils import FeatureSpace

#Creating Feature Spaces
feature_space= FeatureSpace(
    features= {
        "cap-shape": FeatureSpace.integer_categorical(num_oov_indices=0),
        "cap-surface": FeatureSpace.integer_categorical(num_oov_indices=0),
        "cap-color": FeatureSpace.integer_categorical(num_oov_indices=0),
        "bruises": FeatureSpace.integer_categorical(num_oov_indices=0),
        "odor": FeatureSpace.integer_categorical(num_oov_indices=0),
        "gill-attachment": FeatureSpace.integer_categorical(num_oov_indices=0),
        "gill-spacing": FeatureSpace.integer_categorical(num_oov_indices=0),
        "gill-size": FeatureSpace.integer_categorical(num_oov_indices=0),
        "gill-color": FeatureSpace.integer_categorical(num_oov_indices=0),
        "stalk-shape": FeatureSpace.integer_categorical(num_oov_indices=0),
        "stalk-root": FeatureSpace.integer_categorical(num_oov_indices=0),
        "stalk-surface-above-ring": FeatureSpace.integer_categorical(num_oov_indices=0),
        "stalk-surface-below-ring": FeatureSpace.integer_categorical(num_oov_indices=0),
        "stalk-color-above-ring": FeatureSpace.integer_categorical(num_oov_indices=0),
        "stalk-color-below-ring": FeatureSpace.integer_categorical(num_oov_indices=0),
        "veil-color": FeatureSpace.integer_categorical(num_oov_indices=0),
        "ring-number": FeatureSpace.integer_categorical(num_oov_indices=0),
        "ring-type": FeatureSpace.integer_categorical(num_oov_indices=0),
        "spore-print-color": FeatureSpace.integer_categorical(num_oov_indices=0),
        "population": FeatureSpace.integer_categorical(num_oov_indices=0),
        "habitat": FeatureSpace.integer_categorical(num_oov_indices=0),
    }, crosses=[
        # Cap-Color and Cap Shape
        FeatureSpace.cross(
            feature_names= ('cap-color', 'cap-shape'),
            crossing_dim= 10*6),
        # Odor and Gill-Color
        FeatureSpace.cross(
            feature_names= ('gill-color', 'odor'),
            crossing_dim= 12*9),
        # Bruises and Stalk-Surface-above-ring
        FeatureSpace.cross(
            feature_names= ('bruises', 'stalk-surface-above-ring'),
            crossing_dim= 2*4),
        # Bruises and Stalk-Surface-below-ring
        FeatureSpace.cross(
            feature_names= ('bruises', 'stalk-surface-below-ring'),
            crossing_dim= 2*4),
        # Ring-Type and Stalk-Color-below ring
        FeatureSpace.cross(
            feature_names= ('ring-type', 'stalk-color-below-ring'),
            crossing_dim= 8*8),
        # Ring-Type and Stalk-Color-above ring
        FeatureSpace.cross(
            feature_names= ('ring-type', 'stalk-color-above-ring'),
            crossing_dim= 8*8),
        # Spore-Print Color and Habitat
        FeatureSpace.cross(
            feature_names= ('spore-print-color', 'habitat'),
            crossing_dim= 9*7)
    ],
    output_mode="concat"
)

In this section, we establish the feature space utilized by our network. Initially, we inform Keras that all integer values within the dataframe denote categorical values. This specification will prove advantageous when we proceed to one-hot encode this data in the subsequent section.

After conducting preliminary research into various qualities and characteristics commonly associated with poisonous mushrooms, we found that we could use some of our features from our dataset to combine them into cross-product features possibly improving the predictive performance of the model in distinguishing between edible and poisonous mushrooms.

We found we combine the following features into the feature space:

- **Cap-Color X Cap-Shape**: Certain combinations of cap color and shape might be more indicative of edible or poisonous mushrooms. For example, convex-shaped mushrooms with a brown cap color might be more likely to be edible, while flat-shaped mushrooms with a red cap color might be more likely to be poisonous.

- **Odor X Gill-Color**: The combination of odor and gill color can provide valuable information. For instance, mushrooms with a foul odor and black gills might be more likely to be poisonous, while mushrooms with an almond-like odor and white gills might be more likely to be edible.

- **Bruises X Stalk-Surface**: Combining bruises and stalk surface texture could capture interactions related to the mushroom's response to damage. For example, mushrooms that bruise easily and have a silky stalk surface might be more likely to be poisonous.

- **Ring-Type X Stalk-Color**: Certain combinations of ring type and stalk color might be indicative of edible or poisonous mushrooms. For instance, mushrooms with an evanescent ring type and a brown stalk color might be more likely to be edible.

- **Spore-Print-Color X Habitat**: Combining spore print color and habitat could capture interactions related to the mushroom's reproductive characteristics and preferred environment. For example, mushrooms with a brown spore print color found in wooded habitats might be more likely to be edible.


> [1 points] Choose and explain what metric(s) you will use to evaluate your algorithm’s performance. You should give a detailed argument for why this (these) metric(s) are appropriate on your data. That is, why is the metric appropriate for the task (e.g., in terms of the business case for the task). Please note: rarely is accuracy the best evaluation metric to use. Think deeply about an appropriate measure of performance.


Before we begin modeling, we must ensure what evaluation metrics are appropriate for evaluating our networks performance, as it pertains to our buisness case. For evaluating our network's performance on classifying mushrooms as edible or poisonous, using the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) is a prudent choice, particularly due to the critical nature of the classification task.

The primary concern in classifying mushrooms is the potential severe health risks associated with incorrectly identifying a poisonous mushroom as edible. In this context, the consequences of false negatives (wrongly predicting that a poisonous mushroom is edible) are far more severe than false positives (erroneously identifying an edible mushroom as poisonous). The AUC-ROC metric provides a comprehensive measure of the model’s ability to correctly classify both classes across all possible thresholds, emphasizing the capability to distinguish between the two with high sensitivity (true positive rate) and specificity (true negative rate).

Since our primary goal is to avoid false negatives, the ROC curve (which plots the true positive rate against the false positive rate) helps in visualizing and choosing a model with the least number of false negatives at an acceptable false positive rate level.

> [1 points] Choose the method you will use for dividing your data into training and testing (i.e., are you using Stratified 10-fold cross validation? Shuffle splits? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. Argue why your cross validation method is a realistic mirroring of how an algorithm would be used in practice. Use the method to split your data that you argue for. 


In [18]:
from sklearn.model_selection import ShuffleSplit

shuffle_split = ShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in shuffle_split.split(data):
  df_train= data.iloc[train_index]
  df_test= data.iloc[test_index]

To divide our dataset for training and testing purposes, we will employ the shuffle split method from the model_selection package in Scikit-Learn. This method is particularly suitable for our dataset given its relatively balanced composition, with 52% of the observations being edible and 48% poisonous. The even distribution facilitates the use of the shuffle split method, which is not only faster but also less computationally demanding. We have opted for an 80-20 split between the training and testing sets, respectively. This ratio helps minimize the risk of overfitting while ensuring that the testing set remains adequately large to verify the model's performance effectively.

In [25]:
# Creating Tensors for train and test datasets
categorical_headers= data.drop(columns=['edible']).columns
batch_size= 64

def create_dataset_from_dataframe(df_input):

    df = df_input.copy()
    labels = df_input['edible'].values
    
    # Removing labels from the features data
    df = df.drop(columns=['edible'])

    df = {key: value.values[:,np.newaxis] for key, value in df_input[categorical_headers].items()}

    # create the Dataset here
    ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))

    # now enable batching and prefetching
    ds = ds.batch(batch_size)
    ds = ds.prefetch(batch_size)

    return ds

ds_train= create_dataset_from_dataframe(df_train)
ds_test= create_dataset_from_dataframe(df_test)

# Performing One Hot Encoding
ds_train_no_label= ds_train.map(lambda x, _: x)
feature_space.adapt(ds_train_no_label)

train_ds_with_no_labels = ds_train.map(lambda x, _: x)
feature_space.adapt(train_ds_with_no_labels)


In this section, we start by converting our Pandas DataFrame into a TensorFlow Tensor. This conversion is facilitated by a function taken from example 10a on our class's GitHub repository. Once we have transformed the train and test datasets into tensors, we can straightforwardly apply one-hot encoding using the Keras Feature Space object that was established in the preceding section.

## Modeling (5 points total)


> [2 points] Create at least three combined wide and deep networks to classify your data using Keras (this total of "three" includes the model you will train in the next step of the rubric). Visualize the performance of the network on the training data and validation data in the same plot versus the training iterations.

> *Note: you can use the "history" return parameter that is part of Keras "fit" function to easily access this data.*

> [2 points] Investigate generalization performance by altering the number of layers in the deep branch of the network. Try at least two models (this "two" includes the wide and deep model trained from the previous step). Use the method of cross validation and evaluation metric that you argued for at the beginning of the lab to answer: What model with what number of layers performs superiorly? Use proper statistical methods to compare the performance of different models.

> [1 points] Compare the performance of your best wide and deep network to a standard multi-layer perceptron (MLP). Alternatively, you can compare to a network without the wide branch (i.e., just the deep network). For classification tasks, compare using the receiver operating characteristic and area under the curve. For regression tasks, use Bland-Altman plots and residual variance calculations.  Use proper statistical methods to compare the performance of different models.  


## Exceptional Work (1 points total)
