# 8. Adding more layers, using larger layers, or crafting more features?

### About this notebook

This notebook was used in the 50.039 Deep Learning course at the Singapore University of Technology and Design.

**Author:** Matthieu DE MARI (matthieu_demari@sutd.edu.sg)

**Version:** 1.1 (16/06/2023)

**Requirements:**
- Python 3 (tested on v3.11.4)
- Matplotlib (tested on v3.7.1)
- Numpy (tested on v1.24.3)

### Imports

In [None]:
# Matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.lines import Line2D
# Numpy
import numpy as np
# Sklearn
from sklearn.metrics import accuracy_score
# Removing unecessary warnings (optional, just makes notebook outputs more readable)
import warnings
warnings.filterwarnings("ignore")

We have also prepared a few functions to help you, in the utils.py file, which is given along with this notebook. Feel free to have a look at it.

In [None]:
from utils import *

### 1. New mock dataset generation - Nonlinearity with a mysterious equation

As before, we will generate a dataset with some non-linearity, whose boundary follows a mysterious equation.

We do not, however, provide the exact equation of the boundary. 

In [None]:
# Dataset Generation values
eps = 1e-5
min_val = -1 + eps
max_val = 1 - eps
n_points = 1000

In [None]:
# Generate dataset
np.random.seed(27)
val1_list, val2_list, inputs, outputs = create_dataset(n_points, min_val, max_val)
# Check a few entries of the dataset
print(val1_list.shape)
print(val2_list.shape)
print(inputs.shape)
print(outputs.shape)
print(inputs[0:5, :])
print(outputs[0:5])

As expected and observed in the plots below, the dataset does not exhibit linearity, due to the presence of a mysterious function.

In [None]:
# Show the dataset (no classification)
show_dataset(val1_list, val2_list, outputs)

### 2. Template for our Shallow Neural Network with Activation functions from Notebook 4

As in Notebook 4, below is the template class for a neural network consisting of a single linear layer, with sigmoid activation function. The class has been coded for you in the utils file and it ready to be used, as shown below.

Let us try to train it and see its capabilities on the classification task.

In [None]:
# Define and train neural network structure
n_x = 2
n_y = 1
np.random.seed(37)
shallow_neural_net_act = ShallowNeuralNet_WithAct_OneLayer(n_x, n_y)
# Train and show final loss
shallow_neural_net_act.train(inputs, outputs, N_max = 10000, alpha = 1, delta = 1e-8, display = True)
print(shallow_neural_net_act.loss)

In [None]:
# Training curves
shallow_neural_net_act.show_losses_over_training()

The model trained, but we are not surprised to see that it struggles to classify... The model is too simple for the task.

In [None]:
# Show dataset and predictions made by model
show_dataset_and_predictions(inputs, val1_list, val2_list, outputs, shallow_neural_net_act)

### 3. How about two layers then?

How about two layers then? (As in Notebook 4 also).

In [None]:
# Define and train neural network structure (with activation)
n_x = 2
n_h = 2
n_y = 1
np.random.seed(37)
shallow_neural_net_act2 = ShallowNeuralNet_WithAct_TwoLayers(n_x, n_h, n_y)
# Train and show final loss
shallow_neural_net_act2.train(inputs, outputs, N_max = 10000, alpha = 1, delta = 1e-8, display = True)
print(shallow_neural_net_act2.loss)

In [None]:
# Training curves
shallow_neural_net_act2.show_losses_over_training()

The model trained, but it stil struggles to classify... Is the model still too simple for the task?

In [None]:
# Show dataset and predictions made by model
show_dataset_and_predictions(inputs, val1_list, val2_list, outputs, shallow_neural_net_act2)

### 4. How about two layers, but more neurons?

A thing that might help would be adding more neurons to the first layer, because using only two neurons might not be enough.

**Practice #1:** Let us go back to our original two layers model, but this time, let us increase the size of the first layer, for instance, let us try using $ n_h = 10 $ instead of $ 2 $. Is that going to help?

In [None]:
# Define and train neural network structure (with activation)
"""
What needs to be modified here to make 10 neurons in the first layer?
"""
n_x = 2
n_h = 2
n_y = 1
np.random.seed(37)
shallow_neural_net_act3 = ShallowNeuralNet_WithAct_TwoLayers(n_x, n_h, n_y)
# Train and show final loss
shallow_neural_net_act3.train(inputs, outputs, N_max = 10000, alpha = 1, delta = 1e-8, display = True)
print(shallow_neural_net_act3.loss)

In [None]:
shallow_neural_net_act3.show_losses_over_training()

It trains and almost classifies perfectly!

That is another important lesson: you should always ensure that your layers have enough neurons in them.

What is a good number of neurons to use then? That is not an easy question to answer, we will discuss it in class together at the end of the lecture.

In [None]:
show_dataset_and_predictions(inputs, val1_list, val2_list, outputs, shallow_neural_net_act3)

### 5. How about some feature engineering instead?

**Practice #2:** The mysterious equation, in the utils file, seems to be using the squared values of the inputs. How about reworking the inputs so each sample consists of 4 values instead of 2, by adding the squared values of each original inputs?

In other words, what we are suggesting here is to process the inputs and generate polynomial features, as in the polynomial regression, so that each input sample $ i $ in the dataset is transformed as:

$$ (val_1, val_2) \rightarrow (val_1, val_2, val_1^2, val_2^2) $$

Do note that after this input transformation we will call our two-layers model, but will have to replace $ N_x = 2 $ with $ N_x = 4 $.

And let us go back to only using two layers instead of three as using more layers did not solve the problem.

In [None]:
# Processing feature so they have squared values instead
def rework_inputs(inputs):
    """
    How would you rework your inputs to make add some polynomial features of order 2 to the dataset?
    """
    inputs_processing = None
    return inputs_processing

In [None]:
# Processing feature so they have squared values instead
inputs_processing = rework_inputs(inputs)
print(inputs_processing)

In [None]:
# Define and train neural network structure (with activation)
n_x = 4
n_h = 2
n_y = 1
np.random.seed(37)
shallow_neural_net_act4 = ShallowNeuralNet_WithAct_TwoLayers(n_x, n_h, n_y)
# Train and show final loss
shallow_neural_net_act4.train(inputs_processing, outputs, N_max = 10000, alpha = 1, delta = 1e-8, display = True)
print(shallow_neural_net_act4.loss)

In [None]:
shallow_neural_net_act4.show_losses_over_training()

It trains, and this time, it seems to classify correctly!

That is an important lesson for us: when possible, features engineering (that is, reworking your inputs to add relevant information that could help train a model) is best! It is, however, difficult, as it requires to have insights about the dataset, which relies on human expertise/intuition.

In [None]:
show_dataset_and_predictions(inputs_processing, val1_list, val2_list, outputs, shallow_neural_net_act4)

### 6. How about adding more layers?

What if we decided to stick to the original inputs (no feature processing), but added more layers, e.g. a third one?

**Practice #3: How would you modify the template below to have a third layer with the following successive sizes?**

$ N_x \rightarrow N_h \rightarrow N_{h2} \rightarrow N_y $

We already provide a suggestion for the init method, and parts where the code should probably be amended (we will probably need to amend some methods such as parameters initialization, forward and backward).

In [None]:
class ShallowNeuralNet_WithAct_ThreeLayers():
    
    def __init__(self, n_x, n_h, n_h2, n_y):
        # Network dimensions
        self.n_x = n_x
        self.n_h = n_h
        self.n_h2 = n_h2
        self.n_y = n_y
        # Initialize parameters
        self.init_parameters_normal()
        # Loss, initialized as infinity before first calculation is made
        self.loss = float("Inf")
         
    def init_parameters_normal(self):
        # Weights and biases matrices (randomly initialized)
        """
        This will have to be updated as new parameters will be used for the third layer.
        """
        self.W1 = np.random.randn(self.n_x, self.n_h)*0.1
        self.b1 = np.random.randn(1, self.n_h)*0.1
        self.W2 = np.random.randn(self.n_h, self.n_y)*0.1
        self.b2 = np.random.randn(1, self.n_y)*0.1

    def sigmoid(self, val):
        return 1/(1 + np.exp(-val))
    
    def forward(self, inputs):
        # Wx + b operation for the first layer
        Z1 = np.matmul(inputs, self.W1)
        Z1_b = Z1 + self.b1
        A1 = self.sigmoid(Z1_b)
        # Wx + b operation for the second layer
        Z2 = np.matmul(A1, self.W2)
        Z2_b = Z2 + self.b2
        y_pred = self.sigmoid(Z2_b)
        """
        This will probably have new operations implemented, as a third layer will be added.
        """
        return y_pred
    
    def CE_loss(self, inputs, outputs):
        # CE loss function as before
        outputs_re = outputs.reshape(-1, 1)
        pred = self.forward(inputs)
        eps = 1e-10
        losses = outputs*np.log(pred + eps) + (1 - outputs)*np.log(1 - pred + eps)
        self.loss = -np.sum(losses)/outputs.shape[0]
        return self.loss
    
    def backward(self, inputs, outputs, alpha = 1e-5):
        # Get the number of samples in dataset
        m = inputs.shape[0]
        
        # Forward propagate
        Z1 = np.matmul(inputs, self.W1)
        Z1_b = Z1 + self.b1
        A1 = self.sigmoid(Z1_b)
        Z2 = np.matmul(A1, self.W2)
        Z2_b = Z2 + self.b2
        A2 = self.sigmoid(Z2_b)
        """
        More stuff happens here too
        """
        
        # Compute error term
        dL_dA2 = -outputs/A2 + (1 - outputs)/(1 - A2)
        dL_dZ2 = dL_dA2*A2*(1 - A2)
        dL_dA1 = np.dot(dL_dZ2, self.W2.T)
        dL_dZ1 = dL_dA1*A1*(1 - A1)
        """
        More partial derivatives need to be calculated for A3 and Z3.
        The derivatives above might no longer be correct and should probably be adjusted.
        """
        
        # Gradient descent update rules
        self.W2 -= (1/m)*alpha*np.dot(A1.T, dL_dZ2)
        self.W1 -= (1/m)*alpha*np.dot(inputs.T, dL_dZ1)
        self.b2 -= (1/m)*alpha*np.sum(dL_dZ2, axis = 0, keepdims = True)
        self.b1 -= (1/m)*alpha*np.sum(dL_dZ1, axis = 0, keepdims = True)
        """
        More gradient descent rules need to be calculated for W3 and b3.
        The formulas above might no longer be correct and should probably be adjusted.
        """
        
        # Update loss
        self.CE_loss(inputs, outputs)
    
    def train(self, inputs, outputs, N_max = 1000, alpha = 1e-5, delta = 1e-5, display = True):
        # List of losses, starts with the current loss
        self.losses_list = [self.loss]
        # Repeat iterations
        for iteration_number in range(1, N_max + 1):
            # Backpropagate
            self.backward(inputs, outputs, alpha)
            new_loss = self.loss
            # Update losses list
            self.losses_list.append(new_loss)
            # Display
            if(display and iteration_number % (N_max*0.05) == 1):
                print("Iteration {} - Loss = {}".format(iteration_number, new_loss))
            # Check for delta value and early stop criterion
            difference = abs(self.losses_list[-1] - self.losses_list[-2])
            if(difference < delta):
                if(display):
                    print("Stopping early - loss evolution was less than delta on iteration {}.".format(iteration_number))
                break
        else:
            # Else on for loop will execute if break did not trigger
            if(display):
                print("Stopping - Maximal number of iterations reached.")
    
    def show_losses_over_training(self):
        # Initialize matplotlib
        fig, axs = plt.subplots(1, 2, figsize = (15, 5))
        axs[0].plot(list(range(len(self.losses_list))), self.losses_list)
        axs[0].set_xlabel("Iteration number")
        axs[0].set_ylabel("Loss")
        axs[1].plot(list(range(len(self.losses_list))), self.losses_list)
        axs[1].set_xlabel("Iteration number")
        axs[1].set_ylabel("Loss (in logarithmic scale)")
        axs[1].set_yscale("log")
        # Display
        plt.show()

In [None]:
# Define and train neural network structure (with activation)
n_x = 2
n_h = 2
n_h2 = 2
n_y = 1
np.random.seed(37)
shallow_neural_net_act5 = ShallowNeuralNet_WithAct_ThreeLayers(n_x, n_h, n_h2, n_y)
# Train and show final loss
shallow_neural_net_act5.train(inputs, outputs, N_max = 10000, alpha = 1, delta = 1e-8, display = True)
print(shallow_neural_net_act5.loss)

In [None]:
shallow_neural_net_act5.show_losses_over_training()

The model trains, but it still struggles to classify... (We obtain an accuracy of 66% or so)

It seems adding more layers will not do the trick here, as the problem was elsewhere: we needed more neurons in the first layer.

It was still good practice nonetheless, and a good lesson for us: adding a layer to a neural network that is malfunctioning is most likely not going to fix its issues.

In [None]:
show_dataset_and_predictions(inputs, val1_list, val2_list, outputs, shallow_neural_net_act5)

### Typical homework questions would be

1. Show the ShallowNeuralNet_WithAct_ThreeLayers class you coded and explain the changes you made to the code.

2. Is the performance of the model improving when adding a third layer as suggested in Task 3?

3. Show the rework_inputs function you produced on Task #2 and explain your logic in terms of code.

4. Is the performance of the model improving after feature engineering? If so, how could we explain this performance improvement?

5. Task #1 suggests using two layers model with more neurons is sufficient to address our underfitting issue.
Why this be a good approach? How could we explain this intuitively?

6. Challenge: If the issue was indeed more neurons needed for our model, does that mean I could make the 1-layer model work by simply adding more neurons?