# Build Your Own Neural Network
In this section we will build a simple neural network, train it and validate it on a sample test data.
For this exercise, we will use the [Mushroom dataset from the Audobon Society Field Guide](https://archive.ics.uci.edu/dataset/73/mushroom>).
This dataset includes 22 physical characteristics of ~8,000 mushrooms spanning 23 species of gilled mushrooms in the Agaricus and Lepiota Family.
Our task is to predict whether a mushroom is edible or poisonous based on its physical characteristics.

By the end of this excercise participants will be able to:

1. Import the Mushroom dataset from the UCI Machine Learning Repository.
2. Examine and preprocess the data to be fed to the neural network.
3. Build a sequential model neural network using TensorFlow Keras.
4. Evaluate the model's performance on test data.

## Step 1: Importing required libraries and data
The Mushroom dataset is available in the University of California, Irvine Machine Learning Repository, which is a popular repository for machine learning datasets.
Conveniently, the ``ucimlrepo`` Python package provides a simple interface to download and load datasets directly from this repository.

First, we will import the Mushroom dataset using the ``ucimlrepo`` package:

Let's inspect the metadata:

We know that the Mushroom dataset has 8124 instances (samples) and 22 features (physical characteristics), and there are missing values in the dataset.
Now that we have loaded the dataset, let's separate the features (``X``) from the target variable and examine the structure of our feature data.

Next, let's isolate and examine our target variable ``y``:

In pandas, a Dtype (data type) specifies how the data in a column should be stored and interpreted.
When we see a Dtype of ``object``, it typically means the column contains strings or a mix of different data types. Let's examine our data further:

In this dataset, the features are categorical variables stored as strings (which pandas represents as ``object`` Dtype). 
Each feature is encoded with single-character values that represent specific categories.

For a complete reference of all categorical values and their meanings, visit the [UCI Mushroom Dataset page](https://archive.ics.uci.edu/dataset/73/mushroom).

Here are a few examples of the categorical encodings:
 
 * **cap-shape**: 'x' (convex), 'b' (bell), 'f' (flat), etc.
 * **cap-color**: 'n' (brown), 'y' (yellow), 'w' (white), etc.
 * **odor**: 'p' (pungent), 'a' (almond), 'l' (anise), etc.


Next, let's take a look at the target variable:

The target variable contains two categorical labels: ``p`` (poisonous) and ``e`` (edible).
With this insight into our dataset's structure, our next step is to prepare the data for model training.


**Thought Challenge:** What are some things that you have noticed about the data that you think we will need to fix before feeding it to the neural network? Pause here and write down your thoughts before continuing.

## Step 2: Data pre-processing
Our exploration of the Mushroom dataset reveals a collection of 8124 samples with 22 features and a single target variable. Before proceeding with model development, several preprocessing challenges need to be addressed:

 1. The dataset contains missing values that require handling.
 2. All features are categorical, encoded as text strings (represented as ``object`` type in pandas).
 3. The target variable itself is categorical, using ``p`` to indicate poisonous mushrooms and ``e`` for edible ones.

First, let's handle the missing values. Let's see how many missing values are in the dataset, and where they are located:

The output shows that ``stalk-root`` is missing data for 2480 samples, while all other features have complete data.
Let's remove this column from the dataset:

Now we need to encode our categorical variables into a format suitable for the neural network. We'll use one-hot encoding via ``pd.get_dummies()`` to transform each categorical feature into multiple binary columns. For example, if a feature has three possible values (A, B, C), it will be converted into three separate columns, where only one column will have a value of 1 (True) and the others 0 (False):

Now, instead of having 22 features, we have 112 features, each representing a binary True/False value for each categorical value in the original features.

Finally, let's encode the target variable. We will simply convert the string labels ``p`` and ``e`` into binary numeric values of 1 and 0, respectively.
In this case, 1 will represent a poisonous mushroom and 0 will represent an edible mushroom.

Now would be a good time to check the class distribution of our dataset:

We have a roughly balanced dataset with 51.8% of the samples being edible and 48.2% being poisonous.
We can now split the dataset into training and test sets:

**Understanding the Train-Test Split**

The code above divides our data into training and testing sets, creating four objects:
``X_train``, ``X_test``, ``y_train``, and ``y_test``.

| Parameter | Purpose | In Our Example |
|-----------|---------|----------------|
| `test_size` | Determines what portion of data is reserved for testing | 30% for testing, 70% for training |
| `stratify` | Maintains the same class distribution in both splits | Ensures balanced representation of poisonous/edible classes |
| `random_state` | Controls the randomization for reproducible results | Set to 123 for consistent splits across runs |

**Why These Parameters Matter:**

* **Test Size**: Finding the right balance between having enough data for training while reserving sufficient data for testing is crucial. Too little test data may not reliably assess model performance; too little training data may limit learning.

* **Stratification**: When working with classification problems, maintaining class proportions is essential. Without stratification, you might accidentally create a test set with disproportionate class representation, leading to misleading evaluation metrics.

* **Reproducibility**: Setting a random seed ensures you can reproduce your experiments exactly, which is fundamental for scientific rigor and debugging.

**Tip**: While our dataset has roughly balanced classes, stratification becomes especially important with imbalanced datasets. Always consider using ``stratify`` as a best practice.

## Step 3: Building a sequential model neural network
Now we'll create a simple neural network for our mushroom classification task. The model will consist of:

- An **input layer** that matches our feature dimensions
- A **hidden layer** with 10 neurons and ReLU activation
- An **output layer** with sigmoid activation for binary classification

This architecture provides a good starting point for understanding how neural networks learn from tabular data.

**Thought Challenge**: How many parameters does the model have? Can you calculate this manually and get the same result?

**Training the Neural Network**

With our model built and compiled, we can now train it on our data. Before executing the training code, let’s understand the key parameters we’ll use:

| Parameter | Description |
|-----------|-------------|
| **validation_split=0.2** | Reserves 20% of training data to evaluate performance during training, without affecting model weights |
| **epochs=5** | Number of complete passes through the dataset; more epochs allow for more learning iterations but risk overfitting |
| **batch_size=32** | Number of samples processed before weight update; affects memory usage, training speed, and convergence behavior |
| **verbose=2** | Controls output level (0=silent, 1=progress bar, 2=one line per epoch) |

**Thought Challenge**: How does the choice of ``batch_size`` affect the training process?

Now let's train our model with these parameters:

Let's understand what this output tells us:

1. **Progress metrics**:
  - ``143/143``: Shows progress through the training batches; 143 batches were completed out of 143, and each batch contains 32 samples (as specified by ``batch_size=32``)
  - ``0s``: Indicates the time taken for each epoch; here, the first epoch took <1 second to complete.
  - ``2ms/step``: This indicates the average time taken per training step (one forward and backward pass through a single batch) during training.

2. **Training metrics**:
  - ``accuracy: 0.8828``: Represents the accuracy of the model on the training dataset. The accuracy value of approximately 0.8828 indicates that the model correctly predicted 88.28% of the training samples.
  - ``loss: 0.4267``: Represents the training loss value (using binary cross-entropy loss function) on the training dataset. Higher loss values indicate that the model's predictions are further from the true labels.

3. **Validation metrics**:
  - ``val_accuracy: 0.9552``: Represents the accuracy of the model on the validation dataset. The accuracy value of approximately 0.9552 indicates that the model correctly predicted 95.52% of the validation samples.
  - ``val_loss: 0.2148``: Represents the validation loss value (using binary cross-entropy loss function) on the validation dataset. Lower loss values indicate that the model's predictions are closer to the true labels.

Looking at our training results after 5 epochs, we can observe:

1. The model achieved excellent performance, with final training accuracy of 99.78% and validation accuracy of 99.82%.
2. Both training and validation loss steadily decreased across epochs, indicating consistent learning.
3. Validation metrics consistently tracked close to training metrics, suggesting the model generalizes well rather than memorizing the training data.

Let's visualize our training progress before moving on:

This high performance is promising, but we should verify it on our completely separate test set, which the model has never seen during training. This will give us the most reliable measure of how well our model might perform in real-world scenarios.
## Step 4: Evaluate the model's performance on test data
The true test of our model's capabilities comes from evaluating it on our completely separate test dataset. Let's see how our neural network performs when classifying mushrooms it has never encountered before!

For a binary classification problem like our (poisonous vs edible), the model outputs probabilities between 0 and 1 for each sample. Let's show the first sample's prediction:

This shows the probability for the first mushroom sample in the test set.
The output is a single value between 0 and 1, where:
 - Values closer to 1 indicate the model is more confident that the sample is poisonous.
 - Values closer to 0 indicate the model is more confident that the sample is edible.

For example, our output value is 0.00026, which means that the model is 99.99% confident that the sample is edible.

The model outputs probability values, but for practical mushroom classification, we need definitive "edible" or "poisonous" predictions. We need to convert these continuous probability values into discrete class labels:

This code performs what's called "thresholding":

1. First, we compare each probability to the threshold value (0.5)
   
   - If probability > 0.5, the result is True (model thinks it's more likely poisonous)
   - If probability ≤ 0.5, the result is False (model thinks it's more likely edible)

2. Then, we convert these True/False values to integers (1/0) with ``.astype(int)``
   
   - True becomes 1 (poisonous)
   - False becomes 0 (edible)

The 0.5 threshold represents the decision boundary - the point where the model is equally confident in either class. We could adjust this threshold if we wanted to be more conservative about certain types of errors (e.g., lowering the threshold would classify more mushrooms as poisonous, reducing the chance of missing toxic ones).


Now, let's visualize the model's prediction accuracy with a **confusion matrix**. 
This will allow us to see how many correct vs incorrect predictions were made using the model above.

The confusion matrix visualization shows how well our model classifies mushrooms as edible or poisonous. The matrix is a 2x2 grid where:

* The y-axis (Truth) shows the actual class of the mushrooms
* The x-axis (Predicted) shows what our model predicted
* Each cell contains the count of predictions falling into that category
* The heatmap coloring provides visual intensity, where lighter colors indicate higher counts

Reading the matrix:

* **Top-left**: True Negatives (TN) - Correctly identified edible mushrooms
* **Top-right**: False Positives (FP) - Edible mushrooms incorrectly classified as poisonous
* **Bottom-left**: False Negatives (FN) - Poisonous mushrooms incorrectly classified as edible
* **Bottom-right**: True Positives (TP) - Correctly identified poisonous mushrooms 

**Key Classification Metrics**

From these confusion matrix values, we can calculate several important evaluation metrics:

| Metric | Definition | Interpretation for Mushrooms |
|--------|------------|----------------------------|
| **Accuracy** | (TP + TN)/(TP + TN + FP + FN) | Percentage of all mushrooms correctly classified |
| **Precision** | TP/(TP + FP) | When model predicts "poisonous," how often is it right? |
| **Recall** | TP/(TP + FN) | Of all poisonous mushrooms, how many did we correctly identify? |
| **F1-Score** | 2 × (Precision × Recall)/(Precision + Recall) | Harmonic mean of precision and recall; useful when you need to balance both |
| **Specificity** | TN/(TN + FP) | Of all edible mushrooms, how many did we correctly identify? |

**Thought Challenge**: Which prediction metric is most important for this model? Why? 

Let's also print the full classification report of this model using code below:

The accuracy of our model is 99.79%.
99.79% of the time, this model predicted the correct label on the test data.

**Thought Challenge**: Did we build a successful model? Why or why not? Is there anything we can do to improve the model?