# 4. PyTorch Basics - Advanced Backprop & Optimizers, Regularization, Initializers

### About this notebook

This notebook was used in the 50.039 Deep Learning course at the Singapore University of Technology and Design.

**Author:** Matthieu DE MARI (matthieu_demari@sutd.edu.sg)

**Version:** 1.0 (27/12/2022)

**Requirements:**
- Python 3 (tested on v3.9.6)
- Matplotlib (tested on v3.5.1)
- Numpy (tested on v1.22.1)
- Time
- Torch (tested on v1.13.0)
- Torchmetrics (tested on v0.11.0)

### Imports and CUDA

In [2]:
# Matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
# Numpy
import numpy as np
from numpy.random import default_rng
# OS
import os
# Pickle
import pickle
# Time
from time import time
# Torch
import torch
from torchmetrics.classification import BinaryAccuracy

In [3]:
# Use GPU if available, else use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


### Mock dataset, with nonlinearity

As in the previous notebooks, we will reuse our nonlinear binary classification mock dataset and generate a training set with 1000 samples.

In [4]:
# All helper functions
eps = 1e-5
min_val = -1 + eps
max_val = 1 - eps
def val(min_val, max_val):
    return round(np.random.uniform(min_val, max_val), 2)
def class_for_val(val1, val2):
    k = np.pi
    return int(val2 >= -1/4 + 3/4*np.sin(val1*k))
def create_dataset(n_points, min_val, max_val):
    val1_list = np.array([val(min_val, max_val) for _ in range(n_points)])
    val2_list = np.array([val(min_val, max_val) for _ in range(n_points)])
    inputs = np.array([[v1, v2] for v1, v2 in zip(val1_list, val2_list)])
    outputs = np.array([class_for_val(v1, v2) for v1, v2 in zip(val1_list, val2_list)]).reshape(n_points, 1)
    return val1_list, val2_list, inputs, outputs

In [5]:
# Generate dataset (train)
np.random.seed(47)
n_points = 1000
train_val1_list, train_val2_list, train_inputs, train_outputs = create_dataset(n_points, min_val, max_val)

In [6]:
# Convert to tensors and send to device (CUDA or CPU)
train_inputs_pt = torch.from_numpy(train_inputs).to(device)
train_outputs_pt = torch.from_numpy(train_outputs).to(device)

### Using Torch loss and accuracy functions for simplicity

In order to make our model even simpler, we will use the loss functions and evaluation fucntions from PyTorch.

Our **CE_loss()** and **accuracy()** methods will therefore be replaced with the **nn.BCELoss()** function and the **BinaryAccuracy()** functions.

Feel free to have a look at the loss functions available in PyTorch, here: https://pytorch.org/docs/stable/nn.html#loss-functions.

Torchmetrics also provides a few functions, ready to use with PyTorch: https://torchmetrics.readthedocs.io/en/stable/all-metrics.html

In [20]:
# Our class will inherit from the torch.nn.Module
# used to write all model in PyTorch
class ShallowNeuralNet_PT(torch.nn.Module):
    
    def __init__(self, n_x, n_h, n_y):
        # Super __init__ for inheritance
        super().__init__()
        
        # Network dimensions (as before)
        self.n_x = n_x
        self.n_h = n_h
        self.n_y = n_y
        
        # Initialize parameters using the torch.nn.Parameter type (a subclass of Tensors).
        # We immediatly initialize the parameters using a random normal.
        # The RNG is done using torch.randn instead of the NumPy RNG.
        # We add a conversion into float64 (the same float type used by Numpy to generate our data)
        # And send them to our GPU/CPU device
        self.W1 = torch.nn.Parameter(torch.randn(n_x, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b1 = torch.nn.Parameter(torch.randn(1, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W2 = torch.nn.Parameter(torch.randn(n_h, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b2 = torch.nn.Parameter(torch.randn(1, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W1.retain_grad()
        self.b1.retain_grad()
        self.W2.retain_grad()
        self.b2.retain_grad()
        
        # Loss and accuracy functions
        self.loss = torch.nn.BCELoss()
        self.accuracy = BinaryAccuracy()
        
    def forward(self, inputs):
        # Instead of using np.matmul(), we use its equivalent in PyTorch,
        # which is torch.matmul()!
        # (Most numpy matrix operations ahve their equivalent in torch, check it out!)
        # Wx + b operation for the first layer
        Z1 = torch.matmul(inputs, self.W1)
        Z1_b = Z1 + self.b1
        # Sigmoid is already implemented in PyTorch, feel fre to reuse it!
        A1 = torch.sigmoid(Z1_b)
        
        # Wx + b operation for the second layer
        # (Same as first layer)
        Z2 = torch.matmul(A1, self.W2)
        Z2_b = Z2 + self.b2
        y_pred = torch.sigmoid(Z2_b)
        return y_pred
    
    def train(self, inputs, outputs, N_max = 1000, alpha = 1):
        # History of losses
        self.loss_history = []
        # Repeat gradient descent procedure for N_max iterations
        for iteration_number in range(1, N_max + 1):
            # Forward pass
            # This is equivalent to pred = self.forward(inputs)
            pred = self(inputs)
            # Compute loss
            loss_val = self.loss(pred, outputs.to(torch.float64))
            self.loss_history.append(loss_val.item())

            # Backpropagate
            # Compute differentiation of loss with respect to all
            # parameters involved in the calculation that have a flag
            # requires_grad = True (that is W2, W1, b2 and b1)
            loss_val.backward()

            # Update all weights
            # Note that this operation should not be tracked for gradients,
            # hence the torch.no_grad()!
            with torch.no_grad():
                self.W1 -= alpha*self.W1.grad
                self.W2 -= alpha*self.W2.grad
                self.b1 -= alpha*self.b1.grad
                self.b2 -= alpha*self.b2.grad

            # Reset gradients to 0
            self.W1.grad.zero_()
            self.W2.grad.zero_()
            self.b1.grad.zero_()
            self.W2.grad.zero_()
            
            # Display
            if(iteration_number % (N_max//20) == 1):
                # Compute accuracy for display
                acc_val = self.accuracy(pred, outputs)
                print("Iteration {} - Loss = {} - Accuracy = {}".format(iteration_number, loss_val.item(), acc_val.item()))

In [22]:
# Define a neural network structure
n_x = 2
n_h = 10
n_y = 1
np.random.seed(37)
shallow_neural_net_pt = ShallowNeuralNet_PT(n_x, n_h, n_y).to(device)
train_pred = shallow_neural_net_pt.train(train_inputs_pt, train_outputs_pt, N_max = 1001, alpha = 5)

Iteration 1 - Loss = 0.6815893678844995 - Accuracy = 0.6259999871253967
Iteration 51 - Loss = 0.26856284262640595 - Accuracy = 0.875
Iteration 101 - Loss = 0.2554499223809305 - Accuracy = 0.878000020980835
Iteration 151 - Loss = 0.2103705015487559 - Accuracy = 0.9020000100135803
Iteration 201 - Loss = 0.17958287411376456 - Accuracy = 0.9129999876022339
Iteration 251 - Loss = 0.15927524680187605 - Accuracy = 0.9279999732971191
Iteration 301 - Loss = 0.14429291254439877 - Accuracy = 0.9380000233650208
Iteration 351 - Loss = 0.13263530282019223 - Accuracy = 0.9470000267028809
Iteration 401 - Loss = 0.12317784533839293 - Accuracy = 0.9520000219345093
Iteration 451 - Loss = 0.11523437134002097 - Accuracy = 0.9599999785423279
Iteration 501 - Loss = 0.10832495056501382 - Accuracy = 0.9629999995231628
Iteration 551 - Loss = 0.10205194807838043 - Accuracy = 0.9700000286102295
Iteration 601 - Loss = 0.09608766375853714 - Accuracy = 0.972000002861023
Iteration 651 - Loss = 0.09022613737409999 - A

In [23]:
# Check accuracy after training
acc = shallow_neural_net_pt.accuracy(shallow_neural_net_pt(train_inputs_pt), train_outputs_pt).item()
print(acc)

0.9819999933242798


### Advanced backpropagation and optimizers

We can also define some advanced optimizers, e.g. Adam, as below.

Feel free to have a look at all the available optimizers, here: https://pytorch.org/docs/stable/optim.html

Three modifications are to be considered to use Adam instead of the Vanilla gradient descent rule.

1. **Adam** has been added as an optimizer and its parameters can be passed to the train() method.

```
# Optimizer
# You can use self.parameters() to get the list of parameters for the model
# self.parameters() is therefore equivalent to [self.W1, self.b1, self.W2, self.b2]
optimizer = torch.optim.Adam(self.parameters(), # Parameters to be updated by gradient rule
                                 lr = alpha, # Learning rate
                                 betas = (beta1, beta2), # Betas used in Adam rules for V and S
                                 eps = 1e-08) # Epsilon value used in normalization
optimizer.zero_grad()
```

2. **Optimizer step** must be performed to update the V and S parameters in Adam. This also replaces the gradient rule update entirely (damn!).

```
# Update all weights and optimizer step (will update the V 
# and S parameters in Adam) all at once!
optimizer.step()
```

3. **Reset gradients in optimizer to 0**, like you would in the parameters tensors. This replaces all four self.Xx.grad.zero_() operations!
```
# Reset gradients to 0
optimizer.zero_grad()
```        

In [54]:
# Our class will inherit from the torch.nn.Module
# used to write all model in PyTorch
class ShallowNeuralNet_PT(torch.nn.Module):
    
    def __init__(self, n_x, n_h, n_y):
        # Super __init__ for inheritance
        super().__init__()
        
        # Network dimensions (as before)
        self.n_x = n_x
        self.n_h = n_h
        self.n_y = n_y
        
        # Initialize parameters using the torch.nn.Parameter type (a subclass of Tensors).
        # We immediatly initialize the parameters using a random normal.
        # The RNG is done using torch.randn instead of the NumPy RNG.
        # We add a conversion into float64 (the same float type used by Numpy to generate our data)
        # And send them to our GPU/CPU device
        self.W1 = torch.nn.Parameter(torch.randn(n_x, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b1 = torch.nn.Parameter(torch.randn(1, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W2 = torch.nn.Parameter(torch.randn(n_h, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b2 = torch.nn.Parameter(torch.randn(1, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W1.retain_grad()
        self.b1.retain_grad()
        self.W2.retain_grad()
        self.b2.retain_grad()
        
        # Loss and accuracy functions
        self.loss = torch.nn.BCELoss()
        self.accuracy = BinaryAccuracy()
        
    def forward(self, inputs):
        # Instead of using np.matmul(), we use its equivalent in PyTorch,
        # which is torch.matmul()!
        # (Most numpy matrix operations ahve their equivalent in torch, check it out!)
        # Wx + b operation for the first layer
        Z1 = torch.matmul(inputs, self.W1)
        Z1_b = Z1 + self.b1
        # Sigmoid is already implemented in PyTorch, feel fre to reuse it!
        A1 = torch.sigmoid(Z1_b)
        
        # Wx + b operation for the second layer
        # (Same as first layer)
        Z2 = torch.matmul(A1, self.W2)
        Z2_b = Z2 + self.b2
        y_pred = torch.sigmoid(Z2_b)
        return y_pred
    
    def train(self, inputs, outputs, N_max = 1000, alpha = 1, beta1 = 0.9, beta2 = 0.999):
        # Optimizer
        # You can use self.parameters() to get the list of parameters for the model
        # self.parameters() is therefore equivalent to [self.W1, self.b1, self.W2, self.b2]
        optimizer = torch.optim.Adam(self.parameters(), # Parameters to be updated by gradient rule
                                     lr = alpha, # Learning rate
                                     betas = (beta1, beta2), # Betas used in Adam rules for V and S
                                     eps = 1e-08) # Epsilon value used in normalization
        optimizer.zero_grad()
        # History of losses
        self.loss_history = []
        # Repeat gradient descent procedure for N_max iterations
        for iteration_number in range(1, N_max + 1):
            # Forward pass
            # This is equivalent to pred = self.forward(inputs)
            pred = self(inputs)
            # Compute loss
            loss_val = self.loss(pred, outputs.to(torch.float64))
            self.loss_history.append(loss_val.item())

            # Backpropagate
            # Compute differentiation of loss with respect to all
            # parameters involved in the calculation that have a flag
            # requires_grad = True (that is W2, W1, b2 and b1)
            loss_val.backward()

            # Update all weights and optimizer step (will update the V 
            # and S parameters in Adam) all at once!
            optimizer.step()
            
            # Reset gradients to 0
            optimizer.zero_grad()
            
            # Display
            if(iteration_number % (N_max//20) == 1):
                # Compute accuracy for display
                acc_val = self.accuracy(pred, outputs)
                print("Iteration {} - Loss = {} - Accuracy = {}".format(iteration_number, loss_val.item(), acc_val.item()))

Adam allows for a much faster convergence during training now!

In [55]:
# Define a neural network structure
n_x = 2
n_h = 10
n_y = 1
np.random.seed(37)
shallow_neural_net_pt = ShallowNeuralNet_PT(n_x, n_h, n_y).to(device)
train_pred = shallow_neural_net_pt.train(train_inputs_pt, train_outputs_pt, N_max = 1001, \
                                         alpha = 1, beta1 = 0.9, beta2 = 0.999)

Iteration 1 - Loss = 0.6963595490783729 - Accuracy = 0.37400001287460327
Iteration 51 - Loss = 0.27495575265733646 - Accuracy = 0.8730000257492065
Iteration 101 - Loss = 0.25407400607104663 - Accuracy = 0.8659999966621399
Iteration 151 - Loss = 0.25298914524563143 - Accuracy = 0.8659999966621399
Iteration 201 - Loss = 0.22908243002852788 - Accuracy = 0.8849999904632568
Iteration 251 - Loss = 0.05799776142957951 - Accuracy = 0.9789999723434448
Iteration 301 - Loss = 0.023912304350808676 - Accuracy = 0.9959999918937683
Iteration 351 - Loss = 0.019605317364312973 - Accuracy = 0.9959999918937683
Iteration 401 - Loss = 0.01757331994992764 - Accuracy = 0.9959999918937683
Iteration 451 - Loss = 0.01629953531239529 - Accuracy = 0.9959999918937683
Iteration 501 - Loss = 0.015403979247254736 - Accuracy = 0.9959999918937683
Iteration 551 - Loss = 0.014727827297406149 - Accuracy = 0.9959999918937683
Iteration 601 - Loss = 0.014191901532099892 - Accuracy = 0.9959999918937683
Iteration 651 - Loss = 

In [56]:
# Check accuracy after training
acc = shallow_neural_net_pt.accuracy(shallow_neural_net_pt(train_inputs_pt), train_outputs_pt).item()
print(acc)

0.9959999918937683


### Stochastic Mini-Batches

We can also define Stochastic Mini-Batches as shown below. This is done with two modifications.

1. **Create a Dataloader** using the inputs and outputs provided in the train(). We will leanr more about these Dataset and Dataloader objects in the next notebook. For now, just consider that it allows to conveniently zip the data in an object that is able to shuffle and draw randomly mini-batches of data for us.

```
# Create a PyTorch dataset object from the input and output data
dataset = torch.utils.data.TensorDataset(inputs, outputs)
# Create a PyTorch DataLoader object from the dataset, with the specified batch size
data_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
```

2. **Loop over the mini-batches of data**, instead of using the entire inputs/outputs at once, like in batch gradient descent.

```
# Loop over each mini-batch of data
for batch in data_loader:
    # Unpack the mini-batch data
    x_batch, y_batch = batch
    # Forward pass
    # This is equivalent to pred = self.forward(inputs)
    pred = self(x_batch)
    # Compute loss
    loss_val = self.loss(pred, y_batch.to(torch.float64))
    self.loss_history.append(loss_val.item())
```

In [57]:
# Our class will inherit from the torch.nn.Module
# used to write all model in PyTorch
class ShallowNeuralNet_PT(torch.nn.Module):
    
    def __init__(self, n_x, n_h, n_y):
        # Super __init__ for inheritance
        super().__init__()
        
        # Network dimensions (as before)
        self.n_x = n_x
        self.n_h = n_h
        self.n_y = n_y
        
        # Initialize parameters using the torch.nn.Parameter type (a subclass of Tensors).
        # We immediatly initialize the parameters using a random normal.
        # The RNG is done using torch.randn instead of the NumPy RNG.
        # We add a conversion into float64 (the same float type used by Numpy to generate our data)
        # And send them to our GPU/CPU device
        self.W1 = torch.nn.Parameter(torch.randn(n_x, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b1 = torch.nn.Parameter(torch.randn(1, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W2 = torch.nn.Parameter(torch.randn(n_h, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b2 = torch.nn.Parameter(torch.randn(1, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W1.retain_grad()
        self.b1.retain_grad()
        self.W2.retain_grad()
        self.b2.retain_grad()
        
        # Loss and accuracy functions
        self.loss = torch.nn.BCELoss()
        self.accuracy = BinaryAccuracy()
        
    def forward(self, inputs):
        # Instead of using np.matmul(), we use its equivalent in PyTorch,
        # which is torch.matmul()!
        # (Most numpy matrix operations ahve their equivalent in torch, check it out!)
        # Wx + b operation for the first layer
        Z1 = torch.matmul(inputs, self.W1)
        Z1_b = Z1 + self.b1
        # Sigmoid is already implemented in PyTorch, feel fre to reuse it!
        A1 = torch.sigmoid(Z1_b)
        
        # Wx + b operation for the second layer
        # (Same as first layer)
        Z2 = torch.matmul(A1, self.W2)
        Z2_b = Z2 + self.b2
        y_pred = torch.sigmoid(Z2_b)
        return y_pred
    
    def train(self, inputs, outputs, N_max = 1000, alpha = 1, beta1 = 0.9, beta2 = 0.999, batch_size = 32):
        # Create a PyTorch dataset object from the input and output data
        dataset = torch.utils.data.TensorDataset(inputs, outputs)
        # Create a PyTorch DataLoader object from the dataset, with the specified batch size
        data_loader = torch.utils.data.DataLoader(dataset, batch_size = batch_size, shuffle = True)
    
        # Optimizer
        # You can use self.parameters() to get the list of parameters for the model
        # self.parameters() is therefore equivalent to [self.W1, self.b1, self.W2, self.b2]
        optimizer = torch.optim.Adam(self.parameters(), # Parameters to be updated by gradient rule
                                     lr = alpha, # Learning rate
                                     betas = (beta1, beta2), # Betas used in Adam rules for V and S
                                     eps = 1e-08) # Epsilon value used in normalization
        optimizer.zero_grad()
        # History of losses
        self.loss_history = []
        # Repeat gradient descent procedure for N_max iterations
        for iteration_number in range(1, N_max + 1):
            # Loop over each mini-batch of data
            for batch in data_loader:
                # Unpack the mini-batch data
                inputs_batch, outputs_batch = batch
                
                # Forward pass
                # This is equivalent to pred = self.forward(inputs)
                pred = self(inputs_batch)
                # Compute loss
                loss_val = self.loss(pred, outputs_batch.to(torch.float64))
                self.loss_history.append(loss_val.item())

                # Backpropagate
                # Compute differentiation of loss with respect to all
                # parameters involved in the calculation that have a flag
                # requires_grad = True (that is W2, W1, b2 and b1)
                loss_val.backward()

                # Update all weights and optimizer step (will update the V 
                # and S parameters in Adam) all at once!
                optimizer.step()

                # Reset gradients to 0
                optimizer.zero_grad()
            
            # Display
            if(iteration_number % (N_max//20) == 1):
                # Compute accuracy for display
                pred = self(inputs)
                acc_val = self.accuracy(pred, outputs)
                print("Iteration {} - Loss = {} - Accuracy = {}".format(iteration_number, loss_val.item(), acc_val.item()))

Using mini-batches allows for a much faster convergence (only 250 iterations needed instead of 1000 before!).

In [58]:
# Define a neural network structure
n_x = 2
n_h = 10
n_y = 1
np.random.seed(37)
shallow_neural_net_pt = ShallowNeuralNet_PT(n_x, n_h, n_y).to(device)
train_pred = shallow_neural_net_pt.train(train_inputs_pt, train_outputs_pt, N_max = 250, \
                                         alpha = 1, beta1 = 0.9, beta2 = 0.999, batch_size = 32)

Iteration 1 - Loss = 0.15743604845979645 - Accuracy = 0.8180000185966492
Iteration 13 - Loss = 0.02051996240903821 - Accuracy = 0.9729999899864197
Iteration 25 - Loss = 0.007127222721020081 - Accuracy = 0.9710000157356262
Iteration 37 - Loss = 0.005443105633340545 - Accuracy = 0.972000002861023
Iteration 49 - Loss = 0.013696338146403067 - Accuracy = 0.9879999756813049
Iteration 61 - Loss = 0.016204086614336637 - Accuracy = 0.9729999899864197
Iteration 73 - Loss = 0.008053626944567276 - Accuracy = 0.9769999980926514
Iteration 85 - Loss = 0.3933300796055345 - Accuracy = 0.9729999899864197
Iteration 97 - Loss = 0.001591107536451724 - Accuracy = 0.9800000190734863
Iteration 109 - Loss = 0.035385071610661935 - Accuracy = 0.984000027179718
Iteration 121 - Loss = 0.6749003095139646 - Accuracy = 0.9900000095367432
Iteration 133 - Loss = 0.0028621269205506275 - Accuracy = 0.9819999933242798
Iteration 145 - Loss = 0.004587757293009814 - Accuracy = 0.984000027179718
Iteration 157 - Loss = 0.00217

In [59]:
# Check accuracy after training
acc = shallow_neural_net_pt.accuracy(shallow_neural_net_pt(train_inputs_pt), train_outputs_pt).item()
print(acc)

0.953000009059906


### Adding a regularization term to the loss function

Another interesting concept is that we can add a regularization term to the loss function very easily. Not that it is necessary here, but if we wanted to, here is how we would do it.

Our first step would be to simply compute our regularization term by using the PyTorch functions, for instance the L1 loss as follows:
```
L1_reg = lambda_l1*sum(torch.abs(param).sum() for param in self.parameters())
```

We would then simply add it to the loss before backpropagating.

```
# Compute loss and regularization term
loss_val = self.loss(pred, outputs_batch.to(torch.float64))
L1_reg = lambda_l1*sum(torch.abs(param).sum() for param in self.parameters())
total_loss = loss_val + L1_reg
self.loss_history.append(total_loss.item())

# Backpropagate
# Compute differentiation of loss with respect to all
# parameters involved in the calculation that have a flag
# requires_grad = True (that is W2, W1, b2 and b1)
total_loss.backward()
```

That is it.

In [81]:
# Our class will inherit from the torch.nn.Module
# used to write all model in PyTorch
class ShallowNeuralNet_PT(torch.nn.Module):
    
    def __init__(self, n_x, n_h, n_y):
        # Super __init__ for inheritance
        super().__init__()
        
        # Network dimensions (as before)
        self.n_x = n_x
        self.n_h = n_h
        self.n_y = n_y
        
        # Initialize parameters using the torch.nn.Parameter type (a subclass of Tensors).
        # We use xavier_uniform_ initialization.
        W1 = torch.zeros(size = (n_x, n_h), requires_grad = True, \
                         dtype = torch.float64, device = device)
        torch.nn.init.xavier_uniform_(W1.data)
        self.W1 = torch.nn.Parameter(W1).retain_grad()
        b1 = torch.zeros(size = (1, n_h), requires_grad = True, \
                         dtype = torch.float64, device = device)
        torch.nn.init.xavier_uniform_(b1.data)
        self.b1 = torch.nn.Parameter(b1).retain_grad()
        W2 = torch.zeros(size = (n_h, n_y), requires_grad = True, \
                         dtype = torch.float64, device = device)
        torch.nn.init.xavier_uniform_(W2.data)
        self.W2 = torch.nn.Parameter(W2).retain_grad()
        b2 = torch.zeros(size = (1, n_y), requires_grad = True, \
                         dtype = torch.float64, device = device)
        torch.nn.init.xavier_uniform_(b2.data)
        self.b2 = torch.nn.Parameter(b2).retain_grad()
        
        # Loss and accuracy functions
        self.loss = torch.nn.BCELoss()
        self.accuracy = BinaryAccuracy()
        
    def forward(self, inputs):
        # Instead of using np.matmul(), we use its equivalent in PyTorch,
        # which is torch.matmul()!
        # (Most numpy matrix operations ahve their equivalent in torch, check it out!)
        # Wx + b operation for the first layer
        Z1 = torch.matmul(inputs, self.W1)
        Z1_b = Z1 + self.b1
        # Sigmoid is already implemented in PyTorch, feel fre to reuse it!
        A1 = torch.sigmoid(Z1_b)
        
        # Wx + b operation for the second layer
        # (Same as first layer)
        Z2 = torch.matmul(A1, self.W2)
        Z2_b = Z2 + self.b2
        y_pred = torch.sigmoid(Z2_b)
        return y_pred
    
    def train(self, inputs, outputs, N_max = 1000, alpha = 1, beta1 = 0.9, beta2 = 0.999, \
              batch_size = 32, lambda_val = 1e-3):
        # Create a PyTorch dataset object from the input and output data
        dataset = torch.utils.data.TensorDataset(inputs, outputs)
        # Create a PyTorch DataLoader object from the dataset, with the specified batch size
        data_loader = torch.utils.data.DataLoader(dataset, batch_size = batch_size, shuffle = True)
    
        # Optimizer
        # You can use self.parameters() to get the list of parameters for the model
        # self.parameters() is therefore equivalent to [self.W1, self.b1, self.W2, self.b2]
        optimizer = torch.optim.Adam(self.parameters(), # Parameters to be updated by gradient rule
                                     lr = alpha, # Learning rate
                                     betas = (beta1, beta2), # Betas used in Adam rules for V and S
                                     eps = 1e-08) # Epsilon value used in normalization
        optimizer.zero_grad()
        # History of losses
        self.loss_history = []
        # Repeat gradient descent procedure for N_max iterations
        for iteration_number in range(1, N_max + 1):
            # Loop over each mini-batch of data
            for batch in data_loader:
                # Unpack the mini-batch data
                inputs_batch, outputs_batch = batch
                
                # Forward pass
                # This is equivalent to pred = self.forward(inputs)
                pred = self(inputs_batch)
                # Compute loss and regularization term
                loss_val = self.loss(pred, outputs_batch.to(torch.float64))
                L1_reg = lambda_val*sum(torch.abs(param).sum() for param in self.parameters())
                total_loss = loss_val + L1_reg
                self.loss_history.append(total_loss.item())

                # Backpropagate
                # Compute differentiation of loss with respect to all
                # parameters involved in the calculation that have a flag
                # requires_grad = True (that is W2, W1, b2 and b1)
                total_loss.backward()

                # Update all weights and optimizer step (will update the V 
                # and S parameters in Adam) all at once!
                optimizer.step()

                # Reset gradients to 0
                optimizer.zero_grad()
            
            # Display
            if(iteration_number % (N_max//20) == 1):
                # Compute accuracy for display
                pred = self(inputs)
                acc_val = self.accuracy(pred, outputs)
                print("Iteration {} - Loss = {} - Accuracy = {}".format(iteration_number, \
                                                                        total_loss.item(), \
                                                                        acc_val.item()))

Regularization has an impact on training. Feel free to play with the value of lambda to see its effect on training!

In [82]:
# Define a neural network structure
n_x = 2
n_h = 10
n_y = 1
np.random.seed(37)
shallow_neural_net_pt = ShallowNeuralNet_PT(n_x, n_h, n_y).to(device)
train_pred = shallow_neural_net_pt.train(train_inputs_pt, train_outputs_pt, N_max = 250, \
                                         alpha = 1, beta1 = 0.9, beta2 = 0.999, batch_size = 32, lambda_val = 1e-5)

TypeError: Parameter.__new__() got an unexpected keyword argument 'size'

In [71]:
# Check accuracy after training
acc = shallow_neural_net_pt.accuracy(shallow_neural_net_pt(train_inputs_pt), train_outputs_pt).item()
print(acc)

0.9769999980926514


### Adding initializers to the model

Finally, the part below, is a bit tedious.
```
# Initialize parameters using the torch.nn.Parameter type (a subclass of Tensors).
# We immediatly initialize the parameters using a random normal.
# The RNG is done using torch.randn instead of the NumPy RNG.
# We add a conversion into float64 (the same float type used by Numpy to generate our data)
# And send them to our GPU/CPU device
self.W1 = torch.nn.Parameter(torch.randn(n_x, n_h, requires_grad = True, \
                             dtype = torch.float64, device = device)*0.1)
self.b1 = torch.nn.Parameter(torch.randn(1, n_h, requires_grad = True, \
                             dtype = torch.float64, device = device)*0.1)
self.W2 = torch.nn.Parameter(torch.randn(n_h, n_y, requires_grad = True, \
                             dtype = torch.float64, device = device)*0.1)
self.b2 = torch.nn.Parameter(torch.randn(1, n_y, requires_grad = True, \
                             dtype = torch.float64, device = device)*0.1)
self.W1.retain_grad()
self.b1.retain_grad()
self.W2.retain_grad()
self.b2.retain_grad()
```

We would love to replace it with something a bit simpler, which allows the user to choose how to initialize said parameters (by using Xavier, LeCun, He, etc.).

This can be done, by using functions from the torch.nn.init, e.g. the xavier_uniform_() one.

```
# Initialize parameters using the torch.nn.Parameter type (a subclass of Tensors).
# We use xavier_uniform_ initialization.
W1 = torch.zeros(size = (n_x, n_h), requires_grad = True, \
                 dtype = torch.float64, device = device)
torch.nn.init.xavier_uniform_(W1.data)
self.W1 = torch.nn.Parameter(W1).retain_grad()
b1 = torch.zeros(size = (1, n_h), requires_grad = True, \
                 dtype = torch.float64, device = device)
torch.nn.init.xavier_uniform_(b1.data)
self.b1 = torch.nn.Parameter(b1).retain_grad()
W2 = torch.zeros(size = (n_h, n_y), requires_grad = True, \
                 dtype = torch.float64, device = device)
torch.nn.init.xavier_uniform_(W2.data)
self.W2 = torch.nn.Parameter(W2).retain_grad()
b2 = torch.zeros(size = (1, n_y), requires_grad = True, \
                 dtype = torch.float64, device = device)
torch.nn.init.xavier_uniform_(b2.data)
self.b2 = torch.nn.Parameter(b2).retain_grad()
```

Note that this is still a very unstable feature, which might be added in later version of PyTorch, so stay tuned! And have a look at this for additional initializers: https://pytorch.org/cppdocs/api/file_torch_csrc_api_include_torch_nn_init.h.html#file-torch-csrc-api-include-torch-nn-init-h

In [83]:
# Our class will inherit from the torch.nn.Module
# used to write all model in PyTorch
class ShallowNeuralNet_PT(torch.nn.Module):
    
    def __init__(self, n_x, n_h, n_y):
        # Super __init__ for inheritance
        super().__init__()
        
        # Network dimensions (as before)
        self.n_x = n_x
        self.n_h = n_h
        self.n_y = n_y
        
        # Initialize parameters using the torch.nn.Parameter type (a subclass of Tensors).
        # We immediatly initialize the parameters using a random normal.
        # The RNG is done using torch.randn instead of the NumPy RNG.
        # We add a conversion into float64 (the same float type used by Numpy to generate our data)
        # And send them to our GPU/CPU device
        self.W1 = torch.nn.Parameter(torch.randn(n_x, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b1 = torch.nn.Parameter(torch.randn(1, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W2 = torch.nn.Parameter(torch.randn(n_h, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b2 = torch.nn.Parameter(torch.randn(1, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W1.retain_grad()
        self.b1.retain_grad()
        self.W2.retain_grad()
        self.b2.retain_grad()
        
        # Loss and accuracy functions
        self.loss = torch.nn.BCELoss()
        self.accuracy = BinaryAccuracy()
        
    def forward(self, inputs):
        # Instead of using np.matmul(), we use its equivalent in PyTorch,
        # which is torch.matmul()!
        # (Most numpy matrix operations ahve their equivalent in torch, check it out!)
        # Wx + b operation for the first layer
        Z1 = torch.matmul(inputs, self.W1)
        Z1_b = Z1 + self.b1
        # Sigmoid is already implemented in PyTorch, feel fre to reuse it!
        A1 = torch.sigmoid(Z1_b)
        
        # Wx + b operation for the second layer
        # (Same as first layer)
        Z2 = torch.matmul(A1, self.W2)
        Z2_b = Z2 + self.b2
        y_pred = torch.sigmoid(Z2_b)
        return y_pred
    
    def train(self, inputs, outputs, N_max = 1000, alpha = 1, beta1 = 0.9, beta2 = 0.999, \
              batch_size = 32, lambda_val = 1e-3):
        # Create a PyTorch dataset object from the input and output data
        dataset = torch.utils.data.TensorDataset(inputs, outputs)
        # Create a PyTorch DataLoader object from the dataset, with the specified batch size
        data_loader = torch.utils.data.DataLoader(dataset, batch_size = batch_size, shuffle = True)
    
        # Optimizer
        # You can use self.parameters() to get the list of parameters for the model
        # self.parameters() is therefore equivalent to [self.W1, self.b1, self.W2, self.b2]
        optimizer = torch.optim.Adam(self.parameters(), # Parameters to be updated by gradient rule
                                     lr = alpha, # Learning rate
                                     betas = (beta1, beta2), # Betas used in Adam rules for V and S
                                     eps = 1e-08) # Epsilon value used in normalization
        optimizer.zero_grad()
        # History of losses
        self.loss_history = []
        # Repeat gradient descent procedure for N_max iterations
        for iteration_number in range(1, N_max + 1):
            # Loop over each mini-batch of data
            for batch in data_loader:
                # Unpack the mini-batch data
                inputs_batch, outputs_batch = batch
                
                # Forward pass
                # This is equivalent to pred = self.forward(inputs)
                pred = self(inputs_batch)
                # Compute loss and regularization term
                loss_val = self.loss(pred, outputs_batch.to(torch.float64))
                L1_reg = lambda_val*sum(torch.abs(param).sum() for param in self.parameters())
                total_loss = loss_val + L1_reg
                self.loss_history.append(total_loss.item())

                # Backpropagate
                # Compute differentiation of loss with respect to all
                # parameters involved in the calculation that have a flag
                # requires_grad = True (that is W2, W1, b2 and b1)
                total_loss.backward()

                # Update all weights and optimizer step (will update the V 
                # and S parameters in Adam) all at once!
                optimizer.step()

                # Reset gradients to 0
                optimizer.zero_grad()
            
            # Display
            if(iteration_number % (N_max//20) == 1):
                # Compute accuracy for display
                pred = self(inputs)
                acc_val = self.accuracy(pred, outputs)
                print("Iteration {} - Loss = {} - Accuracy = {}".format(iteration_number, \
                                                                        total_loss.item(), \
                                                                        acc_val.item()))

Regularization has an impact on training. Feel free to play with the value of lambda to see its effect on training!

In [70]:
# Define a neural network structure
n_x = 2
n_h = 10
n_y = 1
np.random.seed(37)
shallow_neural_net_pt = ShallowNeuralNet_PT(n_x, n_h, n_y).to(device)
train_pred = shallow_neural_net_pt.train(train_inputs_pt, train_outputs_pt, N_max = 250, \
                                         alpha = 1, beta1 = 0.9, beta2 = 0.999, batch_size = 32, lambda_val = 1e-5)

Iteration 1 - Loss = 0.08475216810090405 - Accuracy = 0.8619999885559082
Iteration 13 - Loss = 0.05901462414011047 - Accuracy = 0.9769999980926514
Iteration 25 - Loss = 0.016450984651784465 - Accuracy = 0.9829999804496765
Iteration 37 - Loss = 0.008222233924491921 - Accuracy = 0.984000027179718
Iteration 49 - Loss = 0.13328321891314257 - Accuracy = 0.9800000190734863
Iteration 61 - Loss = 0.009147019171303892 - Accuracy = 0.9810000061988831
Iteration 73 - Loss = 0.44096735252082253 - Accuracy = 0.9769999980926514
Iteration 85 - Loss = 0.009082679657959317 - Accuracy = 0.9819999933242798
Iteration 97 - Loss = 0.07022150526787058 - Accuracy = 0.972000002861023
Iteration 109 - Loss = 0.007796034371179035 - Accuracy = 0.9729999899864197
Iteration 121 - Loss = 0.03185199986167074 - Accuracy = 0.984000027179718
Iteration 133 - Loss = 0.10836565844049802 - Accuracy = 0.9269999861717224
Iteration 145 - Loss = 0.03817443400090914 - Accuracy = 0.9559999704360962
Iteration 157 - Loss = 0.01027958

In [71]:
# Check accuracy after training
acc = shallow_neural_net_pt.accuracy(shallow_neural_net_pt(train_inputs_pt), train_outputs_pt).item()
print(acc)

0.9769999980926514


### What's next?

This concludes our PyTorch basics notebooks on how to implement a Shallow Neural Network in PyTorch for binar classification.

In the next notebooks, we will investigate a different task, which will introduce more advanced concepts, that are variations of the current ones: Deep Neural Networks and Multi-class binary classification. We will investigate this in a guided project approach.