# 5. PyTorch Basics - Advanced Backprop & Optimizers, Regularization, Initializers

### About this notebook

This notebook was used in the 50.039 Deep Learning course at the Singapore University of Technology and Design.

**Author:** Matthieu DE MARI (matthieu_demari@sutd.edu.sg)

**Version:** 1.2 (22/06/2023)

**Requirements:**
- Python 3 (tested on v3.11.4)
- Matplotlib (tested on v3.7.1)
- Numpy (tested on v1.24.3)
- Torch (tested on v2.0.1+cu118)
- Torchmetrics (tested on v0.11.4)

### Imports and CUDA

In [1]:
# Matplotlib
import matplotlib.pyplot as plt
# Numpy
import numpy as np
# Torch
import torch
import torch.nn as nn
from torchmetrics.classification import BinaryAccuracy

In [2]:
# Use GPU if available, else use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


### Mock dataset, with nonlinearity

As in the previous notebooks, we will reuse our nonlinear binary classification mock dataset and generate a training set with 1000 samples.

In [3]:
# All helper functions
eps = 1e-5
min_val = -1 + eps
max_val = 1 - eps
def val(min_val, max_val):
    return round(np.random.uniform(min_val, max_val), 2)
def class_for_val(val1, val2):
    k = np.pi
    return int(val2 >= -1/4 + 3/4*np.sin(val1*k))
def create_dataset(n_points, min_val, max_val):
    val1_list = np.array([val(min_val, max_val) for _ in range(n_points)])
    val2_list = np.array([val(min_val, max_val) for _ in range(n_points)])
    inputs = np.array([[v1, v2] for v1, v2 in zip(val1_list, val2_list)])
    outputs = np.array([class_for_val(v1, v2) for v1, v2 in zip(val1_list, val2_list)]).reshape(n_points, 1)
    return val1_list, val2_list, inputs, outputs

In [4]:
# Generate dataset (train)
np.random.seed(47)
n_points = 1000
train_val1_list, train_val2_list, train_inputs, train_outputs = create_dataset(n_points, min_val, max_val)

In [5]:
# Convert to tensors and send to device (CUDA or CPU)
train_inputs_pt = torch.from_numpy(train_inputs).to(device)
train_outputs_pt = torch.from_numpy(train_outputs).to(device)

### Using Torch loss and accuracy functions for simplicity

In order to make our model even simpler, we will use the loss functions and evaluation functions from PyTorch.

Our **CE_loss()** and **accuracy()** methods will therefore be replaced with the **nn.BCELoss()** function and the **BinaryAccuracy()** functions.

Feel free to have a look at the loss functions available in PyTorch, here: https://pytorch.org/docs/stable/nn.html#loss-functions.

Torchmetrics also provides a few functions, ready to use with PyTorch: https://torchmetrics.readthedocs.io/en/stable/all-metrics.html

In [6]:
# Our class will inherit from the torch.nn.Module
# used to write all model in PyTorch
class ShallowNeuralNet_PT(torch.nn.Module):
    
    def __init__(self, n_x, n_h, n_y):
        # Super __init__ for inheritance
        super().__init__()
        
        # Network dimensions (as before)
        self.n_x = n_x
        self.n_h = n_h
        self.n_y = n_y
        
        # Initialize parameters using the torch.nn.Parameter type (a subclass of Tensors).
        # We immediatly initialize the parameters using a random normal.
        # The RNG is done using torch.randn instead of the NumPy RNG.
        # We add a conversion into float64 (the same float type used by Numpy to generate our data)
        # And send them to our GPU/CPU device
        self.W1 = torch.nn.Parameter(torch.randn(n_x, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b1 = torch.nn.Parameter(torch.randn(1, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W2 = torch.nn.Parameter(torch.randn(n_h, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b2 = torch.nn.Parameter(torch.randn(1, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W1.retain_grad()
        self.b1.retain_grad()
        self.W2.retain_grad()
        self.b2.retain_grad()
        
        # Loss and accuracy functions
        self.loss = torch.nn.BCELoss()
        self.accuracy = BinaryAccuracy()
        
    def forward(self, inputs):
        # Instead of using np.matmul(), we use its equivalent in PyTorch,
        # which is torch.matmul()!
        # (Most numpy matrix operations ahve their equivalent in torch, check it out!)
        # Wx + b operation for the first layer
        Z1 = torch.matmul(inputs, self.W1)
        Z1_b = Z1 + self.b1
        # Sigmoid is already implemented in PyTorch, feel fre to reuse it!
        A1 = torch.sigmoid(Z1_b)
        
        # Wx + b operation for the second layer
        # (Same as first layer)
        Z2 = torch.matmul(A1, self.W2)
        Z2_b = Z2 + self.b2
        y_pred = torch.sigmoid(Z2_b)
        return y_pred
    
    def train(self, inputs, outputs, N_max = 1000, alpha = 1):
        # History of losses
        self.loss_history = []
        # Repeat gradient descent procedure for N_max iterations
        for iteration_number in range(1, N_max + 1):
            # Forward pass
            # This is equivalent to pred = self.forward(inputs)
            pred = self(inputs)
            # Compute loss
            loss_val = self.loss(pred, outputs.to(torch.float64))
            self.loss_history.append(loss_val.item())

            # Backpropagate
            # Compute differentiation of loss with respect to all
            # parameters involved in the calculation that have a flag
            # requires_grad = True (that is W2, W1, b2 and b1)
            loss_val.backward()

            # Update all weights
            # Note that this operation should not be tracked for gradients,
            # hence the torch.no_grad()!
            with torch.no_grad():
                self.W1 -= alpha*self.W1.grad
                self.W2 -= alpha*self.W2.grad
                self.b1 -= alpha*self.b1.grad
                self.b2 -= alpha*self.b2.grad

            # Reset gradients to 0
            self.W1.grad.zero_()
            self.W2.grad.zero_()
            self.b1.grad.zero_()
            self.W2.grad.zero_()
            
            # Display
            if(iteration_number % (N_max//20) == 1):
                # Compute accuracy for display
                acc_val = self.accuracy(pred, outputs)
                print("Iteration {} - Loss = {} - Accuracy = {}".format(iteration_number, \
                                                                        loss_val.item(), \
                                                                        acc_val.item()))

In [7]:
# Define a neural network structure
n_x = 2
n_h = 10
n_y = 1
np.random.seed(37)
shallow_neural_net_pt = ShallowNeuralNet_PT(n_x, n_h, n_y).to(device)
train_pred = shallow_neural_net_pt.train(train_inputs_pt, train_outputs_pt, N_max = 1001, alpha = 5)

Iteration 1 - Loss = 0.6846828376506631 - Accuracy = 0.6259999871253967
Iteration 51 - Loss = 0.26970442606825384 - Accuracy = 0.8740000128746033
Iteration 101 - Loss = 0.26588161768039037 - Accuracy = 0.8740000128746033
Iteration 151 - Loss = 0.24665687982057477 - Accuracy = 0.8759999871253967
Iteration 201 - Loss = 0.2012575848165803 - Accuracy = 0.9100000262260437
Iteration 251 - Loss = 0.17043867978016575 - Accuracy = 0.9179999828338623
Iteration 301 - Loss = 0.15055741754257038 - Accuracy = 0.9300000071525574
Iteration 351 - Loss = 0.1360626253532511 - Accuracy = 0.9449999928474426
Iteration 401 - Loss = 0.12446396702287565 - Accuracy = 0.9549999833106995
Iteration 451 - Loss = 0.11453266013374971 - Accuracy = 0.9610000252723694
Iteration 501 - Loss = 0.10562308618115626 - Accuracy = 0.9660000205039978
Iteration 551 - Loss = 0.09743487771921175 - Accuracy = 0.968999981880188
Iteration 601 - Loss = 0.08985124628490582 - Accuracy = 0.9729999899864197
Iteration 651 - Loss = 0.0828610

In [8]:
# Check accuracy after training
acc = shallow_neural_net_pt.accuracy(shallow_neural_net_pt(train_inputs_pt), train_outputs_pt).item()
print(acc)

0.9890000224113464


### Advanced backpropagation and optimizers

We can also define some advanced optimizers, e.g. Adam, as below.

Feel free to have a look at all the available optimizers, here: https://pytorch.org/docs/stable/optim.html

Three modifications are to be considered to use Adam instead of the Vanilla gradient descent rule.

1. **Adam** has been added as an optimizer and its parameters can be passed to the train() method.

```
# Optimizer
# You can use self.parameters() to get the list of parameters for the model
# self.parameters() is therefore equivalent to [self.W1, self.b1, self.W2, self.b2]
optimizer = torch.optim.Adam(self.parameters(), # Parameters to be updated by gradient rule
                                 lr = alpha, # Learning rate
                                 betas = (beta1, beta2), # Betas used in Adam rules for V and S
                                 eps = 1e-08) # Epsilon value used in normalization
optimizer.zero_grad()
```

2. **Optimizer step** must be performed to update the V and S parameters in Adam. This also replaces the gradient rule update entirely (damn!).

```
# Update all weights and optimizer step (will update the V 
# and S parameters in Adam) all at once!
optimizer.step()
```

3. **Reset gradients in optimizer to 0**, like you would in the parameters tensors. This replaces all four self.Xx.grad.zero_() operations!
```
# Reset gradients to 0
optimizer.zero_grad()
```        

In [9]:
# Our class will inherit from the torch.nn.Module
# used to write all model in PyTorch
class ShallowNeuralNet_PT(torch.nn.Module):
    
    def __init__(self, n_x, n_h, n_y):
        # Super __init__ for inheritance
        super().__init__()
        
        # Network dimensions (as before)
        self.n_x = n_x
        self.n_h = n_h
        self.n_y = n_y
        
        # Initialize parameters using the torch.nn.Parameter type (a subclass of Tensors).
        # We immediatly initialize the parameters using a random normal.
        # The RNG is done using torch.randn instead of the NumPy RNG.
        # We add a conversion into float64 (the same float type used by Numpy to generate our data)
        # And send them to our GPU/CPU device
        self.W1 = torch.nn.Parameter(torch.randn(n_x, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b1 = torch.nn.Parameter(torch.randn(1, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W2 = torch.nn.Parameter(torch.randn(n_h, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b2 = torch.nn.Parameter(torch.randn(1, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W1.retain_grad()
        self.b1.retain_grad()
        self.W2.retain_grad()
        self.b2.retain_grad()
        
        # Loss and accuracy functions
        self.loss = torch.nn.BCELoss()
        self.accuracy = BinaryAccuracy()
        
    def forward(self, inputs):
        # Instead of using np.matmul(), we use its equivalent in PyTorch,
        # which is torch.matmul()!
        # (Most numpy matrix operations ahve their equivalent in torch, check it out!)
        # Wx + b operation for the first layer
        Z1 = torch.matmul(inputs, self.W1)
        Z1_b = Z1 + self.b1
        # Sigmoid is already implemented in PyTorch, feel fre to reuse it!
        A1 = torch.sigmoid(Z1_b)
        
        # Wx + b operation for the second layer
        # (Same as first layer)
        Z2 = torch.matmul(A1, self.W2)
        Z2_b = Z2 + self.b2
        y_pred = torch.sigmoid(Z2_b)
        return y_pred
    
    def train(self, inputs, outputs, N_max = 1000, alpha = 1, beta1 = 0.9, beta2 = 0.999):
        # Optimizer
        # You can use self.parameters() to get the list of parameters for the model
        # self.parameters() is therefore equivalent to [self.W1, self.b1, self.W2, self.b2]
        optimizer = torch.optim.Adam(self.parameters(), # Parameters to be updated by gradient rule
                                     lr = alpha, # Learning rate
                                     betas = (beta1, beta2), # Betas used in Adam rules for V and S
                                     eps = 1e-08) # Epsilon value used in normalization
        optimizer.zero_grad()
        
        # History of losses
        self.loss_history = []
        
        # Repeat gradient descent procedure for N_max iterations
        for iteration_number in range(1, N_max + 1):
            # Forward pass
            # This is equivalent to pred = self.forward(inputs)
            pred = self(inputs)
            # Compute loss
            loss_val = self.loss(pred, outputs.to(torch.float64))
            self.loss_history.append(loss_val.item())

            # Backpropagate
            # Compute differentiation of loss with respect to all
            # parameters involved in the calculation that have a flag
            # requires_grad = True (that is W2, W1, b2 and b1)
            loss_val.backward()

            # Update all weights and optimizer step (will update the V 
            # and S parameters in Adam) all at once!
            optimizer.step()
            
            # Reset gradients to 0
            optimizer.zero_grad()
            
            # Display
            if(iteration_number % (N_max//20) == 1):
                # Compute accuracy for display
                acc_val = self.accuracy(pred, outputs)
                print("Iteration {} - Loss = {} - Accuracy = {}".format(iteration_number, \
                                                                        loss_val.item(), \
                                                                        acc_val.item()))

Adam allows for a much faster convergence during training now!

In [10]:
# Define a neural network structure
n_x = 2
n_h = 10
n_y = 1
np.random.seed(37)
shallow_neural_net_pt = ShallowNeuralNet_PT(n_x, n_h, n_y).to(device)
train_pred = shallow_neural_net_pt.train(train_inputs_pt, train_outputs_pt, N_max = 1001, \
                                         alpha = 1, beta1 = 0.9, beta2 = 0.999)

Iteration 1 - Loss = 0.7256579630424042 - Accuracy = 0.37400001287460327
Iteration 51 - Loss = 0.30131781315086026 - Accuracy = 0.859000027179718
Iteration 101 - Loss = 0.11356165433273784 - Accuracy = 0.9589999914169312
Iteration 151 - Loss = 0.0985226010264138 - Accuracy = 0.9639999866485596
Iteration 201 - Loss = 0.0563414400665118 - Accuracy = 0.9860000014305115
Iteration 251 - Loss = 0.023382629994158754 - Accuracy = 0.9940000176429749
Iteration 301 - Loss = 0.019522731813001487 - Accuracy = 0.9959999918937683
Iteration 351 - Loss = 0.016657622513671776 - Accuracy = 0.9980000257492065
Iteration 401 - Loss = 0.013820511464043868 - Accuracy = 1.0
Iteration 451 - Loss = 0.012054448371668235 - Accuracy = 1.0
Iteration 501 - Loss = 0.010806482608849734 - Accuracy = 1.0
Iteration 551 - Loss = 0.009854259763919498 - Accuracy = 1.0
Iteration 601 - Loss = 0.009092534416311786 - Accuracy = 1.0
Iteration 651 - Loss = 0.008462483956728852 - Accuracy = 1.0
Iteration 701 - Loss = 0.007928444747

In [11]:
# Check accuracy after training
acc = shallow_neural_net_pt.accuracy(shallow_neural_net_pt(train_inputs_pt), train_outputs_pt).item()
print(acc)

1.0


### Stochastic Mini-Batches

We can also define Stochastic Mini-Batches as shown below. This is done with two modifications.

1. **Create a Dataloader** using the inputs and outputs provided in the train(). We will leanr more about these Dataset and Dataloader objects in the next notebook. For now, just consider that it allows to conveniently zip the data in an object that is able to shuffle and draw randomly mini-batches of data for us.

```
# Create a PyTorch dataset object from the input and output data
dataset = torch.utils.data.TensorDataset(inputs, outputs)
# Create a PyTorch DataLoader object from the dataset, with the specified batch size
data_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
```

2. **Loop over the mini-batches of data**, instead of using the entire inputs/outputs at once, like in batch gradient descent.

```
# Loop over each mini-batch of data
for batch in data_loader:
    # Unpack the mini-batch data
    x_batch, y_batch = batch
    # Forward pass
    # This is equivalent to pred = self.forward(inputs)
    pred = self(x_batch)
    # Compute loss
    loss_val = self.loss(pred, y_batch.to(torch.float64))
    self.loss_history.append(loss_val.item())
```

In [12]:
# Our class will inherit from the torch.nn.Module
# used to write all model in PyTorch
class ShallowNeuralNet_PT(torch.nn.Module):
    
    def __init__(self, n_x, n_h, n_y):
        # Super __init__ for inheritance
        super().__init__()
        
        # Network dimensions (as before)
        self.n_x = n_x
        self.n_h = n_h
        self.n_y = n_y
        
        # Initialize parameters using the torch.nn.Parameter type (a subclass of Tensors).
        # We immediatly initialize the parameters using a random normal.
        # The RNG is done using torch.randn instead of the NumPy RNG.
        # We add a conversion into float64 (the same float type used by Numpy to generate our data)
        # And send them to our GPU/CPU device
        self.W1 = torch.nn.Parameter(torch.randn(n_x, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b1 = torch.nn.Parameter(torch.randn(1, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W2 = torch.nn.Parameter(torch.randn(n_h, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b2 = torch.nn.Parameter(torch.randn(1, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W1.retain_grad()
        self.b1.retain_grad()
        self.W2.retain_grad()
        self.b2.retain_grad()
        
        # Loss and accuracy functions
        self.loss = torch.nn.BCELoss()
        self.accuracy = BinaryAccuracy()
        
    def forward(self, inputs):
        # Instead of using np.matmul(), we use its equivalent in PyTorch,
        # which is torch.matmul()!
        # (Most numpy matrix operations ahve their equivalent in torch, check it out!)
        # Wx + b operation for the first layer
        Z1 = torch.matmul(inputs, self.W1)
        Z1_b = Z1 + self.b1
        # Sigmoid is already implemented in PyTorch, feel fre to reuse it!
        A1 = torch.sigmoid(Z1_b)
        
        # Wx + b operation for the second layer
        # (Same as first layer)
        Z2 = torch.matmul(A1, self.W2)
        Z2_b = Z2 + self.b2
        y_pred = torch.sigmoid(Z2_b)
        return y_pred
    
    def train(self, inputs, outputs, N_max = 1000, alpha = 1, beta1 = 0.9, beta2 = 0.999, batch_size = 32):
        # Create a PyTorch dataset object from the input and output data
        dataset = torch.utils.data.TensorDataset(inputs, outputs)
        # Create a PyTorch DataLoader object from the dataset, with the specified batch size
        data_loader = torch.utils.data.DataLoader(dataset, batch_size = batch_size, shuffle = True)
    
        # Optimizer
        # You can use self.parameters() to get the list of parameters for the model
        # self.parameters() is therefore equivalent to [self.W1, self.b1, self.W2, self.b2]
        optimizer = torch.optim.Adam(self.parameters(), # Parameters to be updated by gradient rule
                                     lr = alpha, # Learning rate
                                     betas = (beta1, beta2), # Betas used in Adam rules for V and S
                                     eps = 1e-08) # Epsilon value used in normalization
        optimizer.zero_grad()
        
        # History of losses
        self.loss_history = []
        
        # Repeat gradient descent procedure for N_max iterations
        for iteration_number in range(1, N_max + 1):
            # Loop over each mini-batch of data
            for batch in data_loader:
                # Unpack the mini-batch data
                inputs_batch, outputs_batch = batch
                
                # Forward pass
                # This is equivalent to pred = self.forward(inputs)
                pred = self(inputs_batch)
                # Compute loss
                loss_val = self.loss(pred, outputs_batch.to(torch.float64))
                self.loss_history.append(loss_val.item())

                # Backpropagate
                # Compute differentiation of loss with respect to all
                # parameters involved in the calculation that have a flag
                # requires_grad = True (that is W2, W1, b2 and b1)
                loss_val.backward()

                # Update all weights and optimizer step (will update the V 
                # and S parameters in Adam) all at once!
                optimizer.step()

                # Reset gradients to 0
                optimizer.zero_grad()
            
            # Display
            if(iteration_number % (N_max//20) == 1):
                # Compute accuracy for display
                pred = self(inputs)
                acc_val = self.accuracy(pred, outputs)
                print("Iteration {} - Loss = {} - Accuracy = {}".format(iteration_number, loss_val.item(), acc_val.item()))

Using mini-batches allows for a much faster convergence (only 250 iterations needed instead of 1000 before!).

In [13]:
# Define a neural network structure
n_x = 2
n_h = 10
n_y = 1
np.random.seed(37)
shallow_neural_net_pt = ShallowNeuralNet_PT(n_x, n_h, n_y).to(device)
train_pred = shallow_neural_net_pt.train(train_inputs_pt, train_outputs_pt, N_max = 250, \
                                         alpha = 1, beta1 = 0.9, beta2 = 0.999, batch_size = 32)

Iteration 1 - Loss = 0.14650448640410285 - Accuracy = 0.8679999709129333
Iteration 13 - Loss = 0.2175808240729715 - Accuracy = 0.8889999985694885
Iteration 25 - Loss = 0.03749913033918094 - Accuracy = 0.9620000123977661
Iteration 37 - Loss = 0.2670290737263268 - Accuracy = 0.9549999833106995
Iteration 49 - Loss = 0.013469152838245362 - Accuracy = 0.9549999833106995
Iteration 61 - Loss = 0.10986057991961877 - Accuracy = 0.9309999942779541
Iteration 73 - Loss = 0.09949791248444503 - Accuracy = 0.9599999785423279
Iteration 85 - Loss = 0.026708570440933905 - Accuracy = 0.9639999866485596
Iteration 97 - Loss = 0.06493889738594429 - Accuracy = 0.9580000042915344
Iteration 109 - Loss = 0.02904200021543981 - Accuracy = 0.9620000123977661
Iteration 121 - Loss = 0.2199134343409183 - Accuracy = 0.953000009059906
Iteration 133 - Loss = 0.12767312729963085 - Accuracy = 0.9599999785423279
Iteration 145 - Loss = 0.0520170804752565 - Accuracy = 0.9599999785423279
Iteration 157 - Loss = 0.1048044037652

In [14]:
# Check accuracy after training
acc = shallow_neural_net_pt.accuracy(shallow_neural_net_pt(train_inputs_pt), train_outputs_pt).item()
print(acc)

0.949999988079071


### Adding a regularization term to the loss function

Another interesting concept is that we can add a regularization term to the loss function very easily. Not that it is necessary here, but if we wanted to, here is how we would do it.

Our first step would be to simply compute our regularization term by using the PyTorch functions, for instance the L1 loss as follows:
```
L1_reg = lambda_l1*sum(torch.abs(param).sum() for param in self.parameters())
```

We would then simply add it to the loss before backpropagating.

```
# Compute loss and regularization term
loss_val = self.loss(pred, outputs_batch.to(torch.float64))
L1_reg = lambda_l1*sum(torch.abs(param).sum() for param in self.parameters())
total_loss = loss_val + L1_reg
self.loss_history.append(total_loss.item())

# Backpropagate
# Compute differentiation of loss with respect to all
# parameters involved in the calculation that have a flag
# requires_grad = True (that is W2, W1, b2 and b1)
total_loss.backward()
```

That is it.

In [15]:
# Our class will inherit from the torch.nn.Module
# used to write all model in PyTorch
class ShallowNeuralNet_PT(torch.nn.Module):
    
    def __init__(self, n_x, n_h, n_y):
        # Super __init__ for inheritance
        super().__init__()
        
        # Network dimensions (as before)
        self.n_x = n_x
        self.n_h = n_h
        self.n_y = n_y
        
        # Initialize parameters using the torch.nn.Parameter type (a subclass of Tensors).
        # We immediatly initialize the parameters using a random normal.
        # The RNG is done using torch.randn instead of the NumPy RNG.
        # We add a conversion into float64 (the same float type used by Numpy to generate our data)
        # And send them to our GPU/CPU device
        self.W1 = torch.nn.Parameter(torch.randn(n_x, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b1 = torch.nn.Parameter(torch.randn(1, n_h, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W2 = torch.nn.Parameter(torch.randn(n_h, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.b2 = torch.nn.Parameter(torch.randn(1, n_y, requires_grad = True, \
                                     dtype = torch.float64, device = device)*0.1)
        self.W1.retain_grad()
        self.b1.retain_grad()
        self.W2.retain_grad()
        self.b2.retain_grad()
        
        # Loss and accuracy functions
        self.loss = torch.nn.BCELoss()
        self.accuracy = BinaryAccuracy()
        
    def forward(self, inputs):
        # Instead of using np.matmul(), we use its equivalent in PyTorch,
        # which is torch.matmul()!
        # (Most numpy matrix operations ahve their equivalent in torch, check it out!)
        # Wx + b operation for the first layer
        Z1 = torch.matmul(inputs, self.W1)
        Z1_b = Z1 + self.b1
        # Sigmoid is already implemented in PyTorch, feel fre to reuse it!
        A1 = torch.sigmoid(Z1_b)
        
        # Wx + b operation for the second layer
        # (Same as first layer)
        Z2 = torch.matmul(A1, self.W2)
        Z2_b = Z2 + self.b2
        y_pred = torch.sigmoid(Z2_b)
        return y_pred
    
    def train(self, inputs, outputs, N_max = 1000, alpha = 1, beta1 = 0.9, beta2 = 0.999, \
              batch_size = 32, lambda_val = 1e-3):
        # Create a PyTorch dataset object from the input and output data
        dataset = torch.utils.data.TensorDataset(inputs, outputs)
        # Create a PyTorch DataLoader object from the dataset, with the specified batch size
        data_loader = torch.utils.data.DataLoader(dataset, batch_size = batch_size, shuffle = True)
    
        # Optimizer
        # You can use self.parameters() to get the list of parameters for the model
        # self.parameters() is therefore equivalent to [self.W1, self.b1, self.W2, self.b2]
        optimizer = torch.optim.Adam(self.parameters(), # Parameters to be updated by gradient rule
                                     lr = alpha, # Learning rate
                                     betas = (beta1, beta2), # Betas used in Adam rules for V and S
                                     eps = 1e-08) # Epsilon value used in normalization
        optimizer.zero_grad()
        # History of losses
        self.loss_history = []
        # Repeat gradient descent procedure for N_max iterations
        for iteration_number in range(1, N_max + 1):
            # Loop over each mini-batch of data
            for batch in data_loader:
                # Unpack the mini-batch data
                inputs_batch, outputs_batch = batch
                
                # Forward pass
                # This is equivalent to pred = self.forward(inputs)
                pred = self(inputs_batch)
                           
                # Compute regularization term
                L1_reg = lambda_val*sum(torch.abs(param).sum()
                                        for param in self.parameters())
                
                # Add regularization to loss
                loss_val = self.loss(pred, outputs_batch.to(torch.float64))
                total_loss = loss_val + L1_reg
                self.loss_history.append(total_loss.item())

                # Backpropagate
                # Compute differentiation of loss with respect to all
                # parameters involved in the calculation that have a flag
                # requires_grad = True (that is W2, W1, b2 and b1)
                # Here, combining loss and regularization term
                total_loss.backward()

                # Update all weights and optimizer step (will update the V 
                # and S parameters in Adam) all at once!
                optimizer.step()

                # Reset gradients to 0
                optimizer.zero_grad()
            
            # Display
            if(iteration_number % (N_max//20) == 1):
                # Compute accuracy for display
                pred = self(inputs)
                acc_val = self.accuracy(pred, outputs)
                print("Iteration {} - Loss = {} - Accuracy = {}".format(iteration_number, \
                                                                        total_loss.item(), \
                                                                        acc_val.item()))

Regularization has an impact on training. Feel free to play with the value of lambda to see its effect on training!

In [16]:
# Define a neural network structure
n_x = 2
n_h = 10
n_y = 1
np.random.seed(37)
shallow_neural_net_pt = ShallowNeuralNet_PT(n_x, n_h, n_y).to(device)
train_pred = shallow_neural_net_pt.train(train_inputs_pt, train_outputs_pt, N_max = 50, \
                                         alpha = 1e-1, beta1 = 0.9, beta2 = 0.999, batch_size = 32, lambda_val = 1e-5)

Iteration 1 - Loss = 0.17261779257275459 - Accuracy = 0.8709999918937683
Iteration 3 - Loss = 0.342984103976192 - Accuracy = 0.8730000257492065
Iteration 5 - Loss = 0.5132312795830537 - Accuracy = 0.8999999761581421
Iteration 7 - Loss = 0.08754411525564934 - Accuracy = 0.9139999747276306
Iteration 9 - Loss = 0.07182989475023617 - Accuracy = 0.925000011920929
Iteration 11 - Loss = 0.08903025257505413 - Accuracy = 0.9300000071525574
Iteration 13 - Loss = 0.26089040857903667 - Accuracy = 0.9470000267028809
Iteration 15 - Loss = 0.09109016622075831 - Accuracy = 0.9620000123977661
Iteration 17 - Loss = 0.18054698106189462 - Accuracy = 0.9620000123977661
Iteration 19 - Loss = 0.012311306849543808 - Accuracy = 0.9509999752044678
Iteration 21 - Loss = 0.06207002332709065 - Accuracy = 0.9789999723434448
Iteration 23 - Loss = 0.013176133427765767 - Accuracy = 0.9869999885559082
Iteration 25 - Loss = 0.03905963046214341 - Accuracy = 0.984000027179718
Iteration 27 - Loss = 0.018144369503289886 - A

In [17]:
# Check accuracy after training
acc = shallow_neural_net_pt.accuracy(shallow_neural_net_pt(train_inputs_pt), train_outputs_pt).item()
print(acc)

0.9900000095367432


### Adding initializers to the model

Finally, the part below, is a bit tedious.
```
# Initialize parameters using the torch.nn.Parameter type (a subclass of Tensors).
# We immediatly initialize the parameters using a random normal.
# The RNG is done using torch.randn instead of the NumPy RNG.
# We add a conversion into float64 (the same float type used by Numpy to generate our data)
# And send them to our GPU/CPU device
self.W1 = torch.nn.Parameter(torch.randn(n_x, n_h, requires_grad = True, \
                             dtype = torch.float64, device = device)*0.1)
self.b1 = torch.nn.Parameter(torch.randn(1, n_h, requires_grad = True, \
                             dtype = torch.float64, device = device)*0.1)
self.W2 = torch.nn.Parameter(torch.randn(n_h, n_y, requires_grad = True, \
                             dtype = torch.float64, device = device)*0.1)
self.b2 = torch.nn.Parameter(torch.randn(1, n_y, requires_grad = True, \
                             dtype = torch.float64, device = device)*0.1)
self.W1.retain_grad()
self.b1.retain_grad()
self.W2.retain_grad()
self.b2.retain_grad()
```

We would love to replace it with something a bit simpler, which allows the user to choose how to initialize said parameters (by using Xavier, LeCun, He, etc.).

This can be done, by using functions from the torch.nn.init, e.g. the xavier_uniform_() one.

```
# Initialize parameters using the torch.nn.Parameter type (a subclass of Tensors).
# We use xavier_uniform_ initialization.
self.W1 = torch.nn.Parameter(torch.zeros(size = (n_x, n_h), requires_grad = True, \
                             dtype = torch.float64, device = device))
torch.nn.init.xavier_uniform_(self.W1.data)
self.b1 = torch.nn.Parameter(torch.zeros(size = (1, n_h), requires_grad = True, \
                             dtype = torch.float64, device = device))
torch.nn.init.xavier_uniform_(self.b1.data)
self.W2 = torch.nn.Parameter(torch.zeros(size = (n_h, n_y), requires_grad = True, \
                             dtype = torch.float64, device = device))
torch.nn.init.xavier_uniform_(self.W2.data)
self.b2 = torch.nn.Parameter(torch.zeros(size = (1, n_y), requires_grad = True, \
                             dtype = torch.float64, device = device))
torch.nn.init.xavier_uniform_(self.b2.data)
```

Note that this is still a somewhat unstable feature, which might be added in later version of PyTorch, so stay tuned!

And have a look at this for additional initializers: https://pytorch.org/cppdocs/api/file_torch_csrc_api_include_torch_nn_init.h.html#file-torch-csrc-api-include-torch-nn-init-h

In [18]:
# Our class will inherit from the torch.nn.Module
# used to write all model in PyTorch
class ShallowNeuralNet_PT(torch.nn.Module):
    
    def __init__(self, n_x, n_h, n_y):
        # Super __init__ for inheritance
        super().__init__()
        
        # Network dimensions (as before)
        self.n_x = n_x
        self.n_h = n_h
        self.n_y = n_y
        
        # Initialize parameters using the torch.nn.Parameter type (a subclass of Tensors).
        # We use xavier_uniform_ initialization.
        self.W1 = torch.nn.Parameter(torch.zeros(size = (n_x, n_h), requires_grad = True, \
                                                 dtype = torch.float64, device = device))
        torch.nn.init.xavier_uniform_(self.W1.data)
        self.b1 = torch.nn.Parameter(torch.zeros(size = (1, n_h), requires_grad = True, \
                                                 dtype = torch.float64, device = device))
        torch.nn.init.xavier_uniform_(self.b1.data)
        self.W2 = torch.nn.Parameter(torch.zeros(size = (n_h, n_y), requires_grad = True, \
                                                 dtype = torch.float64, device = device))
        torch.nn.init.xavier_uniform_(self.W2.data)
        self.b2 = torch.nn.Parameter(torch.zeros(size = (1, n_y), requires_grad = True, \
                                                 dtype = torch.float64, device = device))
        torch.nn.init.xavier_uniform_(self.b2.data)
        
        # Loss and accuracy functions
        self.loss = torch.nn.BCELoss()
        self.accuracy = BinaryAccuracy()
        
    def forward(self, inputs):
        # Instead of using np.matmul(), we use its equivalent in PyTorch,
        # which is torch.matmul()!
        # (Most numpy matrix operations ahve their equivalent in torch, check it out!)
        # Wx + b operation for the first layer
        Z1 = torch.matmul(inputs, self.W1)
        Z1_b = Z1 + self.b1
        # Sigmoid is already implemented in PyTorch, feel fre to reuse it!
        A1 = torch.sigmoid(Z1_b)
        
        # Wx + b operation for the second layer
        # (Same as first layer)
        Z2 = torch.matmul(A1, self.W2)
        Z2_b = Z2 + self.b2
        y_pred = torch.sigmoid(Z2_b)
        return y_pred
    
    def train(self, inputs, outputs, N_max = 1000, alpha = 1, beta1 = 0.9, beta2 = 0.999, \
              batch_size = 32, lambda_val = 1e-3):
        # Create a PyTorch dataset object from the input and output data
        dataset = torch.utils.data.TensorDataset(inputs, outputs)
        # Create a PyTorch DataLoader object from the dataset, with the specified batch size
        data_loader = torch.utils.data.DataLoader(dataset, batch_size = batch_size, shuffle = True)
    
        # Optimizer
        # You can use self.parameters() to get the list of parameters for the model
        # self.parameters() is therefore equivalent to [self.W1, self.b1, self.W2, self.b2]
        optimizer = torch.optim.Adam(self.parameters(), # Parameters to be updated by gradient rule
                                     lr = alpha, # Learning rate
                                     betas = (beta1, beta2), # Betas used in Adam rules for V and S
                                     eps = 1e-08) # Epsilon value used in normalization
        optimizer.zero_grad()
        # History of losses
        self.loss_history = []
        # Repeat gradient descent procedure for N_max iterations
        for iteration_number in range(1, N_max + 1):
            # Loop over each mini-batch of data
            for batch in data_loader:
                # Unpack the mini-batch data
                inputs_batch, outputs_batch = batch
                
                # Forward pass
                # This is equivalent to pred = self.forward(inputs)
                pred = self(inputs_batch)
                # Compute loss and regularization term
                loss_val = self.loss(pred, outputs_batch.to(torch.float64))
                L1_reg = lambda_val*sum(torch.abs(param).sum() for param in self.parameters())
                total_loss = loss_val + L1_reg
                self.loss_history.append(total_loss.item())

                # Backpropagate
                # Compute differentiation of loss with respect to all
                # parameters involved in the calculation that have a flag
                # requires_grad = True (that is W2, W1, b2 and b1)
                total_loss.backward()

                # Update all weights and optimizer step (will update the V 
                # and S parameters in Adam) all at once!
                optimizer.step()

                # Reset gradients to 0
                optimizer.zero_grad()
            
            # Display
            if(iteration_number % (N_max//20) == 1):
                # Compute accuracy for display
                pred = self(inputs)
                acc_val = self.accuracy(pred, outputs)
                print("Iteration {} - Loss = {} - Accuracy = {}".format(iteration_number, \
                                                                        total_loss.item(), \
                                                                        acc_val.item()))

### Finally, using Linear layers prototypes from PyTorch to simplify even further

Below is the final version of our ShallowNeuralNet_PT class, which combines all the features discussed earlier together.

It will also make use of the Linear() layers prototype, which will allow us to simplify the code even further.

Using these Linear() layers, we can now remove the Parameters we had defined manually and simply use the layer object as a function in the forward method.

In [19]:
class ShallowNeuralNet_PT(torch.nn.Module):
    def __init__(self, n_x, n_h, n_y, device):
        super().__init__()
        self.n_x, self.n_h, self.n_y = n_x, n_h, n_y
        
        # Using the Linear() layer prototype
        self.linear1 = nn.Linear(n_x, n_h, dtype = torch.double)
        self.linear2 = nn.Linear(n_h, n_y, dtype = torch.double)
        
        self.loss = torch.nn.BCELoss()
        self.accuracy = BinaryAccuracy()
        
    def forward(self, inputs):
        # Reusing the layers as functions in the forward method
        Z1 = self.linear1(inputs)
        A1 = torch.sigmoid(Z1)
        Z2 = self.linear2(A1)
        A2 = torch.sigmoid(Z2)
        return A2
        
    def train(self, inputs, outputs, N_max = 1000, alpha = 1, beta1 = 0.9, beta2 = 0.999, \
              batch_size = 32, lambda_val = 1e-3):
        dataset = torch.utils.data.TensorDataset(inputs, outputs)
        data_loader = torch.utils.data.DataLoader(dataset, batch_size = batch_size, shuffle = True)
        optimizer = torch.optim.Adam(self.parameters(), lr = alpha, betas = (beta1, beta2), eps = 1e-08)
        optimizer.zero_grad()
        self.loss_history = []
        for iteration_number in range(1, N_max + 1):
            for batch in data_loader:
                inputs_batch, outputs_batch = batch
                total_loss = self.loss(self(inputs_batch), outputs_batch.to(torch.float64))\
                    + lambda_val*sum(torch.abs(param).sum() for param in self.parameters()).item()
                self.loss_history.append(total_loss)
                total_loss.backward()
                optimizer.step()
                optimizer.zero_grad()
            if(iteration_number % (N_max//20) == 1):
                pred = self(inputs)
                acc_val = self.accuracy(pred, outputs).item()
                print("Iteration {} - Loss = {} - Accuracy = {}".format(iteration_number, total_loss, acc_val))

It trains very nicely, but feel free to play with the model and its hyperparameters if you want!

In [20]:
# Define a neural network structure
n_x = 2
n_h = 10
n_y = 1
np.random.seed(37)
shallow_neural_net_pt = ShallowNeuralNet_PT(n_x, n_h, n_y, device).to(device)
train_pred = shallow_neural_net_pt.train(train_inputs_pt, train_outputs_pt, N_max = 200, \
                                         alpha = 1, beta1 = 0.9, beta2 = 0.999, batch_size = 128, lambda_val = 1e-5)

Iteration 1 - Loss = 0.3926024365960113 - Accuracy = 0.8610000014305115
Iteration 11 - Loss = 0.3101898580957781 - Accuracy = 0.8650000095367432
Iteration 21 - Loss = 0.12105549019381083 - Accuracy = 0.9769999980926514
Iteration 31 - Loss = 0.10358540003699022 - Accuracy = 0.9800000190734863
Iteration 41 - Loss = 0.07639140401980078 - Accuracy = 0.9760000109672546
Iteration 51 - Loss = 0.07327596635785066 - Accuracy = 0.9940000176429749
Iteration 61 - Loss = 0.0795085158846797 - Accuracy = 0.9909999966621399
Iteration 71 - Loss = 0.027989940810582925 - Accuracy = 0.9869999885559082
Iteration 81 - Loss = 0.04702639876190272 - Accuracy = 0.9779999852180481
Iteration 91 - Loss = 0.034392911520930616 - Accuracy = 0.9850000143051147
Iteration 101 - Loss = 0.02265342730897736 - Accuracy = 0.9929999709129333
Iteration 111 - Loss = 0.14702158635351717 - Accuracy = 0.9929999709129333
Iteration 121 - Loss = 0.048470289925868555 - Accuracy = 0.9850000143051147
Iteration 131 - Loss = 0.02397845759

In [21]:
# Check accuracy after training
acc = shallow_neural_net_pt.accuracy(shallow_neural_net_pt(train_inputs_pt), train_outputs_pt).item()
print(acc)

0.9900000095367432


### What's next?

This concludes our PyTorch basics notebooks on how to implement a Shallow Neural Network in PyTorch for binary classification.

In the next notebooks, we will investigate a different task, which will introduce more advanced concepts, that are variations of the current ones: Deep Neural Networks and Multi-class (non-binary) classification. We will investigate this in a guided project approach.

But first, let us have a quick look at Dataset and Dataloader objects.

Below is our final neural network class prototype.

In [22]:
class ShallowNeuralNet_PT(torch.nn.Module):
    def __init__(self, n_x, n_h, n_y, device):
        super().__init__()
        self.n_x, self.n_h, self.n_y = n_x, n_h, n_y
        self.linear1 = nn.Linear(n_x, n_h, dtype = torch.double)
        self.linear2 = nn.Linear(n_h, n_y, dtype = torch.double)
        self.loss = torch.nn.BCELoss()
        self.accuracy = BinaryAccuracy()
    def forward(self, inputs):
        return torch.sigmoid(self.linear2(torch.sigmoid(self.linear1(inputs))))
    def train(self, inputs, outputs, N_max = 1000, alpha = 1, beta1 = 0.9, beta2 = 0.999, \
              batch_size = 32, lambda_val = 1e-3):
        dataset = torch.utils.data.TensorDataset(inputs, outputs)
        data_loader = torch.utils.data.DataLoader(dataset, batch_size = batch_size, shuffle = True)
        optimizer = torch.optim.Adam(self.parameters(), lr = alpha, betas = (beta1, beta2), eps = 1e-08)
        optimizer.zero_grad()
        self.loss_history = []
        for iteration_number in range(1, N_max + 1):
            for batch in data_loader:
                inputs_batch, outputs_batch = batch
                total_loss = self.loss(self(inputs_batch), outputs_batch.to(torch.float64))\
                    + lambda_val*sum(torch.abs(param).sum() for param in self.parameters()).item()
                self.loss_history.append(total_loss)
                total_loss.backward()
                optimizer.step()
                optimizer.zero_grad()
            if(iteration_number % (N_max//20) == 1):
                pred = self(inputs)
                acc_val = self.accuracy(pred, outputs).item()
                print("Iteration {} - Loss = {} - Accuracy = {}".format(iteration_number, total_loss, acc_val))