# Activation Functions, Gradient Flow and Dead Neurons - Worked Example

<hr style="border:2px solid gray">

# Index: <a id='index'></a>
1. [Introduction and Setup](#intro)
1. [Common Activation Functions](#common-activation-functions)
1. [Visualising Activation Functions](#visualising-act)
1. [Fashion MNIST Dataset](#fashion)
1. [Gradient Flow](#gradient-flow)
1. [Training a Model](#training-a-model)
1. [Dead Neurons](#dead-neurons)
1. [Conclusion](#conclusion)
1. [Appendix](#appendix)
    1. [PyTorch backwards() Function](#backwards_app) 
    1. [Backpropagation with Matrices](#matrix_app)
    2. [Cross-Entropy](#cross)
    3. [Activation Functions Overview](#act_fn_app)

This notebook is heavily based on one from Universitaet van Amsterdam's deep learning course written by Phillip Lippe.

You can see his original filled notebook at:
[![View on Github](https://img.shields.io/static/v1.svg?logo=github&label=Repo&message=View%20On%20Github&color=lightgrey)](https://github.com/phlippe/uvadlc_notebooks/blob/master/docs/tutorial_notebooks/tutorial3/Activation_Functions.ipynb)

**Pre-trained models:** 
[![View files on Github](https://img.shields.io/static/v1.svg?logo=github&label=Repo&message=View%20On%20Github&color=lightgrey)](https://github.com/phlippe/saved_models/tree/main/tutorial3)
[![GoogleDrive](https://img.shields.io/static/v1.svg?logo=google-drive&logoColor=yellow&label=GDrive&message=Download&color=yellow)](https://drive.google.com/drive/folders/1sFpZUpDJVjiYEvIqISqfkFizfsTnPf4s?usp=sharing)   

<hr style="border:2px solid gray">

# Introduction and Setup [^](#index) <a id='intro'></a>

Last week, you were introduced to the basics of neural networks. In this tutorial we will focus on the **activation function**, which characterises the response at each neuron. 

As discussed previously in the Single Neuron Notebook,  activation functions are a crucial part of deep learning models because they can add **non-linearity** to the network. Without non-linearity, increasing the number of layers will not effect our network's performance - the output would always be a linear combination of the input variables， and so could have been achieved with just a single layer.

There is a great variety of activation functions in the literature, and some are more beneficial than others.
The goal of this tutorial is to show the importance of choosing a good activation function (and how to do so), and what problems might occur if we do not.

Before we start, we should import our standard libraries and set up basic functions:

In [None]:
## Standard libraries
import os
import json
import math
import numpy as np 

## Imports for plotting
import matplotlib.pyplot as plt
%matplotlib inline 
from IPython.display import set_matplotlib_formats
import seaborn as sns
sns.set()

## Progress bar
from tqdm.notebook import tqdm

## PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import torch.optim as optim

To make our training reproducible, we will set a seed for the random numbers that we generate from Torch and numPy.

If you are using a department provided laptop, or many other types of commercial machines, you will not have access to a GPU. GPUs can increase the speed of the training process and thus are very important when working with more complex models, however simply using a CPU in this course will be **sufficient**.

_Unlike the CPU, the same seed on different GPU architectures can give different results. All the prebuilt models linked above here have been trained on an NVIDIA GTX1080Ti._

The following cell also defines two file paths: DATASET_PATH and CHECKPOINT_PATH. The dataset path is the directory where we will download datasets used in the notebooks. It is recommended to store all datasets from PyTorch in one joined directory to prevent duplicate downloads. The checkpoint path is the directory where we will store the trained model weights and additional files. The files required will be automatically downloaded. 

In [None]:
# Path to the folder where the datasets are/should be downloaded (e.g. MNIST)
DATASET_PATH = "../data"

# Path to the folder where the pretrained models are saved
CHECKPOINT_PATH = "../saved_models/tutorial3"

# Function for setting the seed
def set_seed(seed):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available(): # GPU operation have separate seed
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
set_seed(42)

# Additionally, some operations on a GPU are implemented stochastic for efficiency
# We want to ensure that all operations are deterministic on GPU (if used) for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Fetching the device that will be used throughout this notebook
device = torch.device("cpu") if not torch.cuda.is_available() else torch.device("cuda:0")
print("Using device", device)

The following cell downloads all pretrained models we will use in this notebook. These are copies from Philip Lippe's course. The files are stored in a separate [repository](https://github.com/pjdunne/saved_models). Please let me (Patrick) know if an error occurs so it can be fixed for all students.

In [None]:
import urllib.request
from urllib.error import HTTPError
# Github URL where saved models are stored for this tutorial
base_url = "https://raw.githubusercontent.com/pjdunne/saved_models/main/tutorial3/"
# Files to download
pretrained_files = ["FashionMNIST_elu.config", "FashionMNIST_elu.tar", 
                    "FashionMNIST_leakyrelu.config", "FashionMNIST_leakyrelu.tar",
                    "FashionMNIST_relu.config", "FashionMNIST_relu.tar",
                    "FashionMNIST_sigmoid.config", "FashionMNIST_sigmoid.tar",
                    "FashionMNIST_swish.config", "FashionMNIST_swish.tar",
                    "FashionMNIST_tanh.config", "FashionMNIST_tanh.tar"]
# Create checkpoint path if it doesn't exist yet
os.makedirs(CHECKPOINT_PATH, exist_ok=True)

# For each file, check whether it already exists. If not, try downloading it.
for file_name in pretrained_files:
    file_path = os.path.join(CHECKPOINT_PATH, file_name)
    if not os.path.isfile(file_path):
        file_url = base_url + file_name
        print(f"Downloading {file_url}...")
        try:
            urllib.request.urlretrieve(file_url, file_path)
        except HTTPError as e:
            print("Something went wrong. Please try to download the file from the GDrive folder, or contact the author with the full output including the following error:\n", e)

<hr style="border:2px solid gray">

# Common Activation Functions [^](#index) <a id='common-activation-functions'></a>

We will first implement some common activation functions, writing our own functions to gain a better understanding and insight into how they operate. Most of these functions can also be found in the `torch.nn` package (see the [documentation](https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity) for an overview).

_The following code uses Object Orientated Programming (OOP). If you are unfamiliar or out of practice with OOP, you may wish to read the 2nd Bootcamp Notebook for some guidance._


To make it easier to compare various activation functions, we will start by defining a base class from which all our future modules will inherit:

In [None]:
class ActivationFunction(nn.Module):
    def __init__(self):
        super().__init__()
        self.name = self.__class__.__name__
        self.config = {"name": self.name}

Every activation function will be an `nn.Module` so that we can integrate them nicely in a network. We will use the `config` dictionary to store adjustable parameters for some activation functions.

Next, we will implement two of the "oldest" activation functions that are still commonly used for various tasks: sigmoid and tanh. Both the sigmoid and tanh activation can be also found as PyTorch functions (`torch.sigmoid`, `torch.tanh`) or as modules (`nn.Sigmoid`, `nn.Tanh`). 

<div style="background-color:#C2F5DD">
    
### Exercise
    
Below, you should write the code for the `forward` function of two activation functions. 
     
If you are unsure of the form of either of the two functions, feel free to look them up online.

In [None]:
class Sigmoid(ActivationFunction):
    
    def forward(self, x):
        return 1 / (1 + torch.exp(-x))

##############################   
    
class Tanh(ActivationFunction):
    
    def forward(self, x):
        x_exp, neg_x_exp = torch.exp(x), torch.exp(-x)
        return (x_exp - neg_x_exp) / (x_exp + neg_x_exp)
    
##############################

### Alternative Activation Functions

Another popular activation function that has allowed the training of deeper networks is the **Rectified Linear Unit (ReLU)**. This function returns the input for inputs greater than 0, and returns 0 for inputs less than or equal to 0. Despite its simplicity, ReLU has one major benefit compared to sigmoid and tanh: a strong, stable gradient for a large range of values.

Based on this idea, many variations of ReLU have been proposed. We will implement the following three:

**LeakyReLU:** In the negative region of the input space, LeakyReLU has the form of a linear function with a small gradient (alpha), rather than just being 0. This non-zero gradient means that gradients can also 'flow' in this part of the input. We'll see what this means later.

**ELU:** This is similar to the leaky ReLU but replaces the function in the negative input space with an exponential decay.

**Swish:** This is the most recently proposed of these activation functions, and is actually the result of a large experiment with the purpose of finding the "optimal" activation function. Swish returns the input multiplied by the sigmoid of the input.
Compared to the other activation functions, Swish is both smooth and non-monotonic (i.e. Swish's gradient  has a change of sign). Swish is designed to solve a problem called 'dead neurons' which can occur with standard ReLU. We'll see what this is later, but it is a particular problem with deep networks.
If interested, a more detailed discussion of the benefits of Swish can be found in [this paper](https://arxiv.org/abs/1710.05941) [1].

<div style="background-color:#C2F5DD">
    
### Exercise
    
Given the desciptions above, implement these four activation functions.

In [None]:
##############################

class ReLU(ActivationFunction):
    
    def forward(self, x):
        return x * (x > 0).float()

##############################

class LeakyReLU(ActivationFunction):
    
    def __init__(self, alpha=0.1):
        super().__init__()
        self.config["alpha"] = alpha
        
    def forward(self, x):
        return torch.where(x > 0, x, self.config["alpha"] * x)

##############################
    
class ELU(ActivationFunction):
    
    def forward(self, x):
        return torch.where(x > 0, x, torch.exp(x)-1)

##############################
    
class Swish(ActivationFunction):
    
    def forward(self, x):
        return x * torch.sigmoid(x)
    
##############################

We can summarise all our activation functions in a dictionary, which maps the name of the function to the class object. If you wish to implement a new activation function, add it to this dictionary to include it in future comparisons.

In [None]:
act_fn_by_name = {
    "sigmoid": Sigmoid,
    "tanh": Tanh,
    "relu": ReLU,
    "leakyrelu": LeakyReLU,
    "elu": ELU,
    "swish": Swish
}

<hr style="border:2px solid gray">

# Visualising Activation Functions [^](#index) <a id='visualising-act'></a>

To understand what each activation function actually does, it may be useful to visualise both the function and its **gradient**. Understanding the gradient of the activation function is crucial for optimising the neural network. 

PyTorch allows us to compute the gradients by calling the `backward` function. The cell below demonstrates how this can be implemented. For more information on the backwards and how it operates function, please see the Appendix section [PyTorch backwards() Function](#backwards_app).



In [None]:
def get_grads(act_fn, x):
    """
    Computes the gradients of an activation function at specified positions.
    
    Inputs:
        act_fn - An object of the class "ActivationFunction" with an implemented forward pass.
        x - 1D input tensor. 
    Output:
        A tensor with the same size of x containing the gradients of act_fn at x.
    """
    x = x.clone().requires_grad_() # Mark the input as tensor for which we want to store gradients
    out = act_fn(x)
    out.sum().backward() # Summing results in an equal gradient flow to each element in x
    return x.grad # Accessing the gradients of x by "x.grad"

We can now visualise all our activation functions including their gradients.

<div style="background-color:#C2F5DD">
    
### Exercise

Complete the following functions in order to generate nice plots of the activation functions:

In [None]:
def vis_act_fn(act_fn, ax, x):
    # Run activation function
    y = act_fn(x)
    y_grads = get_grads(act_fn, x)
    # Push x, y and gradients back to cpu for plotting
    x, y, y_grads = x.cpu().numpy(), y.cpu().numpy(), y_grads.cpu().numpy()
    ## Plotting
    ax.plot(x, y, linewidth=2, label="ActFn")
    ax.plot(x, y_grads, linewidth=2, label="Gradient")
    ax.set_title(act_fn.name)
    ax.legend()
    ax.set_ylim(-1.5, x.max())

# Add activation functions if wanted
act_fns = [act_fn() for act_fn in act_fn_by_name.values()]
x = torch.linspace(-5, 5, 1000) # Range on which we want to visualize the activation functions
## Plotting
rows = math.ceil(len(act_fns)/2.0)
fig, ax = plt.subplots(rows, 2, figsize=(8, rows*4))
for i, act_fn in enumerate(act_fns):
    vis_act_fn(act_fn, ax[divmod(i,2)], x)
fig.subplots_adjust(hspace=0.3)
plt.show()

<hr style="border:2px solid gray">

# Fashion MNIST Dataset [^](#index) <a id='fashion'></a>

To see the impact each activation function has on training our neural network, we will look at a simple example NN trained on [FashionMNIST](https://github.com/zalandoresearch/fashion-mnist).

FashionMNIST is a more complex version of MNIST and contains black-and-white images of clothes instead of digits. The 10 classes include trousers, coats, shoes, bags and more. 

To load this dataset, we will make use of the PyTorch package `torchvision` ([documentation](https://pytorch.org/docs/stable/torchvision/index.html)). The `torchvision` package consists of popular datasets, model architectures, and common image transformations for computer vision. We will use the package for many of the notebooks in this course to simplify our dataset handling. 

FashionMNIST already has the training and test sets separated but we need to split the training and validation sets ourselves. Below we load in the dataset:

In [None]:
import torchvision
from torchvision.datasets import FashionMNIST
from torchvision import transforms

# Transformations applied on each image => first make them a tensor, then normalize them in the range -1 to 1

transform = transforms.Compose([transforms.ToTensor(), 
                                transforms.Normalize((0.5,), (0.5,))])

# Loading the training dataset. We need to split it into a training and validation part
train_dataset = FashionMNIST(root=DATASET_PATH, train=True, transform=transform, download=True)
train_set, val_set = torch.utils.data.random_split(train_dataset, [50000, 10000])

# Loading the test set
test_set = FashionMNIST(root=DATASET_PATH, train=False, transform=transform, download=True)

# We define a set of data loaders that we can use for various purposes later.
# Note that for actually training a model, we will use different data loaders
# with a lower batch size.
train_loader = data.DataLoader(train_set, batch_size=1024, shuffle=True, drop_last=False)
val_loader = data.DataLoader(val_set, batch_size=1024, shuffle=False, drop_last=False)
test_loader = data.DataLoader(test_set, batch_size=1024, shuffle=False, drop_last=False)

<div style="background-color:#C2F5DD">

### Exercise
    
Visualize a few of the MNIST images to get an impression of the data.

In [None]:
exmp_imgs = [train_set[i][0] for i in range(16)]
# Organize the images into a grid for nicer visualization
img_grid = torchvision.utils.make_grid(torch.stack(exmp_imgs, dim=0), nrow=4, normalize=True, pad_value=0.5)
img_grid = img_grid.permute(1, 2, 0)

plt.figure(figsize=(8,8))
plt.title("FashionMNIST examples")
plt.imshow(img_grid)
plt.axis('off')
plt.show()
plt.close()

### Neural Network Setup

The code below contains the framework to implement the required network: we must view the Fashion MNIST images as 1D tensors and then push them through a sequence of linear layers and a specified activation function. 

I've given the function signature below. Feel free to experiment with other network architectures.

<div style="background-color:#C2F5DD">
    
### Exercise

Complete the cell below by adding the code required to create the hidden layers.
   

In [None]:
class BaseNetwork(nn.Module):
    
    def __init__(self, act_fn, input_size=784, num_classes=10, hidden_sizes=[512, 256, 256, 128]):
        """
        Inputs:
            act_fn - Object of the activation function that should be used as non-linearity in the network.
            input_size - Size of the input images in pixels
            num_classes - Number of classes we want to predict
            hidden_sizes - A list of integers specifying the hidden layer sizes in the NN
        """
        super().__init__()
        
        # Create the network based on the specified hidden sizes
        layers = []
        layer_sizes = [input_size] + hidden_sizes
        for layer_index in range(1, len(layer_sizes)):
            layers += [nn.Linear(layer_sizes[layer_index-1], layer_sizes[layer_index]),
                       act_fn]
        layers += [nn.Linear(layer_sizes[-1], num_classes)]
        self.layers = nn.Sequential(*layers) # nn.Sequential summarizes a list of modules into a single module, applying them in sequence
        
        # We store all hyperparameters in a dictionary for saving and loading of the model
        self.config = {"act_fn": act_fn.config, "input_size": input_size, "num_classes": num_classes, "hidden_sizes": hidden_sizes} 
        
    def forward(self, x):
        x = x.view(x.size(0), -1) # Reshape images to a flat vector
        out = self.layers(x)
        return out    

We also add functions for loading and saving the model. The hyperparameters are stored in a configuration file (simple json file):

In [None]:
def _get_config_file(model_path, model_name):
    # Name of the file for storing hyperparameter details
    return os.path.join(model_path, model_name + ".config")

def _get_model_file(model_path, model_name):
    # Name of the file for storing network parameters
    return os.path.join(model_path, model_name + ".tar")

def load_model(model_path, model_name, net=None):
    """
    Loads a saved model from disk.
    
    Inputs:
        model_path - Path of the checkpoint directory
        model_name - Name of the model (str)
        net - (Optional) If given, the state dict is loaded into this model. Otherwise, a new model is created.
    """
    config_file, model_file = _get_config_file(model_path, model_name), _get_model_file(model_path, model_name)
    assert os.path.isfile(config_file), f"Could not find the config file \"{config_file}\". Are you sure this is the correct path and you have your model config stored here?"
    assert os.path.isfile(model_file), f"Could not find the model file \"{model_file}\". Are you sure this is the correct path and you have your model stored here?"
    with open(config_file, "r") as f:
        config_dict = json.load(f)
    if net is None:
        act_fn_name = config_dict["act_fn"].pop("name").lower()
        act_fn = act_fn_by_name[act_fn_name](**config_dict.pop("act_fn"))
        net = BaseNetwork(act_fn=act_fn, **config_dict)
    net.load_state_dict(torch.load(model_file, map_location=device))
    return net
    
def save_model(model, model_path, model_name):
    """
    Given a model, we save the state_dict and hyperparameters.
    
    Inputs:
        model - Network object to save parameters from
        model_path - Path of the checkpoint directory
        model_name - Name of the model (str)
    """
    config_dict = model.config
    os.makedirs(model_path, exist_ok=True)
    config_file, model_file = _get_config_file(model_path, model_name), _get_model_file(model_path, model_name)
    with open(config_file, "w") as f:
        json.dump(config_dict, f)
    torch.save(model.state_dict(), model_file)

<hr style="border:2px solid gray">

# Gradient flow [^](#index)  <a id='gradient-flow'></a>

An important aspect of neural network training are the **gradients** of each parameter (i.e. the partial derivative of the loss function wrt that parameter). Each activation function results in different gradients, and it is important to understand what these differences mean for the training process.

<div style="background-color:#C2F5DD">
    
### Exercise

Plot a freshly initialised network and measure the gradients for each parameter at the activation function layers of your model for a batch of 256 images. This should give you a feeling about how each activation function influences the gradients.

For your loss function, use **cross entropy** (please read the Appendix section [Cross-Entropy](#cross) or see the [documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html)).

In [None]:
def visualize_gradients(net, color="C0"):
    """
    Inputs:
        net - Object of class BaseNetwork
        color - Color in which we want to visualize the histogram (for easier separation of activation functions)
    """
    net.eval()
    small_loader = data.DataLoader(train_set, batch_size=256, shuffle=False)
    imgs, labels = next(iter(small_loader))
    imgs, labels = imgs.to(device), labels.to(device)
    
    # Pass one batch through the network, and calculate the gradients for the weights
    
    net.zero_grad()
    preds =net(imgs)
    loss =F.cross_entropy(preds, labels)
    loss.backward()
    # We limit our visualization to the weight parameters and exclude the bias to reduce the number of plots
    
    grads = {name: params.grad.data.view(-1).cpu().clone().numpy() for name, params in net.named_parameters() if "weight" in name}
    net.zero_grad()
    
    ## Plotting
    columns = len(grads)
    fig, ax = plt.subplots(1, columns, figsize=(columns*3.5, 2.5))
    fig_index = 0
    for key in grads:
        key_ax = ax[fig_index%columns]
        sns.histplot(data=grads[key], bins=30, ax=key_ax, color=color, kde=True)  
        key_ax.set_title(str(key))
        key_ax.set_xlabel("Grad magnitude")
        fig_index += 1
    fig.suptitle(f"Gradient magnitude distribution for activation function {net.config['act_fn']['name']}", fontsize=14, y=1.05)
    fig.subplots_adjust(wspace=0.45)
    plt.show()
    # The initialisation of the weight parameters can also be crucial to creating an optimal model and avoiding problems with vanishing
    # gradient. By default, PyTorch uses the [Kaiming](https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_uniform_) 
    # initialisation for linear layers optimised for ReLU activations. We could do a whole extra class on initialisation, 
    # but assume for now that the Kaiming initialisation works for all activation functions reasonably well (the Universitet van Amsterdam
    # course has more information on initialisation in [Lecture 4](https://github.com/phlippe/uvadlc_notebooks/tree/master/docs/tutorial_notebooks/tutorial4)).
    plt.close() 

In [None]:
# Seaborn prints warnings if histogram has small values. We can ignore them for now
import warnings
warnings.filterwarnings('ignore')
## Create a plot for every activation function
for i, act_fn_name in enumerate(act_fn_by_name):
    set_seed(42) # Setting the seed ensures that we have the same weight initialization for each activation function
    act_fn = act_fn_by_name[act_fn_name]()
    net_actfn = BaseNetwork(act_fn=act_fn).to(device)
    visualize_gradients(net_actfn, color=f"C{i}")
    
#N.b. The number indicating the number of layers refers to both the linear layers and those containing the activation function

To make sense of these plots, we must first understand the principles behind **gradient flow**.

**Gradient flow** describes how activation functions propagate the gradients of the loss function (referred to here as gradients) through the network. First, we should remind ourselves about how a multi-layered NN trains:

1. **Forward Pass:** The input data is fed into the neural network and passes through the successive layers. 
1. **Loss Calculation:** Using a specified loss function, the error of our predicted values compared to the true values is calculated. By training our model, we want to minimise this error.
1. **Backward Pass (Backpropagation):** This is the key concept in gradient flow, and was discussed in more detail in the Neural Networks notebook. 
- Just like in the linear models we considered in ML Basics, we can minimise the loss by moving in the direction of the negative gradient of the loss function. Unlike in linear models, the relationship between loss function and the weights and bias is often not direct (there may be many NN layers with activation function in between).
- To solve this, we calculate the gradients by using the chain rule and moving **backwards** through our network. 
- Consider the image below, where $\phi$ represents our activation function and the term in brackets is $z$ (z will have the same indexing notation as the other terms), as we saw before. From the chain rule, we can calculate the effects of weights in the final layer using:


$$\frac{\partial L}{\partial w^{(2)}_{11}} = \frac{\partial L}{\partial \phi  (z^{(2)} )}
                                              \frac{\partial \phi (z^{(2)})}{\partial z^{(2)}}
                                              \frac{\partial z^{(2)}}{\partial w^{(2)}_{11}}
                                              = \frac{\partial L}{\partial y}
                                              \frac{\partial y}{\partial w^{(2)}_{11}}$$
                                                                                           

- Working backwards allows us to increase the efficiency of our model. Since we have already calculated $\frac{\partial L}{\partial y}$, we can use the result in further chain rules. We can repeat this with $\frac{\partial L}{\partial a_1}$, and so on. 

$$ \frac{\partial L}{\partial w^{(1)}_{11}} = \frac{\partial L}{\partial y}
                                              \frac{\partial y}{\partial \phi  (z_1^{(1)} )}
                                              \frac{\partial \phi (z_1^{(1)} )}{\partial z_1^{(1)}}
                                              \frac{\partial z_1^{(1)}}{\partial w^{(2)}_{11}} 
                                              = \frac{\partial L}{\partial a_1}
                                              \frac{\partial a_1}{\partial w^{(2)}_{11}}$$
                                            
<img src="weight-and-bias.png" width="600" />

- For more complex networks, we would also have to sum the different partial differentials from each branch. [JeremyJordan](https://www.jeremyjordan.me/neural-networks-training/) includes these terms written out in full, as well as explaining how matrix calculations can decrease the complexity of the many derivatives required. To read a brief description of how these matrices operate, and the use of the backward() function, please see Appendix section [Backpropagation with Matrices](#matrix_app).

4. **Update:** Using these gradient calculations, we can then update our weights and biases. Using a gradient descent method. Note that the learning rate $\alpha$ for a particular variable is the same for all layers.
5. **Repeat:** The process repeats until a stopping condition is met (such as maximum number of iterations reached).

### Issues with Gradient Flow

Imagine we have a very deep neural network with more than 50 layers. Each layer we move back through the network, we will introduce a further term of $\frac{\partial \phi (z_j^{(i)} )}{\partial z_j^{(i)}}$. So to calculate the effects of the weights from the input terms to the first nodes, our equation would have 50 multiples of this term.

**Vanishing Gradient Problem** 

If the expectation of this gradient term is considerably smaller than 1, the multiplication effect will mean that $\frac{\partial L}{\partial w}$ will become extremely close to 0 for any weights far from the output. This can not be resolved through the learning rate either, since in general it kept constant for all layers. The weights are updated in proportion to this gradient term, so a gradient close to 0 will mean that these earlier weights barely update as we repeat each iteration. This results in **slow** or **stalled** learning.

Using an activation function that will not have a 0 gradient (such as ReLU) will help to resolve this problem

**Exploding Gradient**

The opposite effect is if the expectation of $\frac{\partial \phi (z )}{\partial z}$ is much greater than one. In this instance, backpropogation will cause $\frac{\partial L}{\partial w}$ for earlier terms to be extremely large, and thus cause wild fluctuations in weights that make it difficult for the model to train effectively.

### Interpretting our Gradient Plots

The sigmoid activation function shows a clearly undesirable behavior. While the gradients for the output layer are very large with values up to 0.1, the input layer has the lowest gradient norm across all activation functions (only 1e-5). This is due to its small maximum gradient of 1/4, and thus finding a suitable learning rate across all layers is not possible with this setup.
All the other activation functions show similar gradient norms across all layers. Interestingly, the ReLU activation has a spike around 0. This is due to the function returning 0 for negative inputs, and dead neurons (we will take a closer look at this later on).

The initialisation of the weight parameters can also be crucial to creating an optimal model and avoiding problems with vanishing gradient. By default, PyTorch uses the [Kaiming](https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_uniform_) initialisation for linear layers optimised for ReLU activations. We could do a whole extra class on initialisation, but assume for now that the Kaiming initialisation works for all activation functions reasonably well (the Universitet van Amsterdam course has more information on initialisation in [Lecture 4](https://github.com/phlippe/uvadlc_notebooks/tree/master/docs/tutorial_notebooks/tutorial4)).

<hr style="border:2px solid gray">

# Training a model [^](#index) <a id='training-a-model'></a>

We now want to train our model with different activation functions on the FashionMNIST dataset and compare the performances. Our ultimate goal is to achieve the best possible performance on a dataset of our choice.

<div style="background-color:#C2F5DD">

### Exercise
    
Write a training loop in the next cell, including a validation after every epoch and a final test on the best model. Then try answering the following questions before continuing:
- Which activation functions perform well? 
- Is this what you were expecting from their gradient flow properties?

In [None]:
def train_model(net, model_name, max_epochs=50, patience=7, batch_size=256, overwrite=False):
    """
    Train a model on the training set of FashionMNIST
    
    Inputs:
        net - Object of BaseNetwork
        model_name - (str) Name of the model, used for creating the checkpoint names
        max_epochs - Number of epochs we want to (maximally) train for
        patience - If the performance on the validation set has not improved for #patience epochs, 
                    we stop training early
        batch_size - Size of batches used in training
        overwrite - Determines how to handle the case when there already exists a checkpoint.
                    If True, it will be overwritten. Otherwise, we skip training.
    """
    file_exists = os.path.isfile(_get_model_file(CHECKPOINT_PATH, model_name))
    if file_exists and not overwrite:
        print("Model file already exists. Skipping training...")
    else:
        if file_exists:
            print("Model file exists, but will be overwritten...")
            
        # Defining optimizer, loss and data loader
        optimizer = optim.SGD(net.parameters(), lr=1e-2, momentum=0.9) # Default parameters, feel free to change
        loss_module = nn.CrossEntropyLoss() 
        train_loader_local = data.DataLoader(train_set, batch_size=batch_size, shuffle=True, drop_last=True, pin_memory=True)

        val_scores = []
        best_val_epoch = -1
        for epoch in range(max_epochs):
            ############
            # Training #
            ############
            net.train()
            true_preds, count = 0., 0
            for imgs, labels in tqdm(train_loader_local, desc=f"Epoch {epoch+1}", leave=False):
                imgs, labels = imgs.to(device), labels.to(device) # To GPU if we have one
                optimizer.zero_grad() # Zero-grad can be placed anywhere before "loss.backward()"
                preds = net(imgs)
                loss = loss_module(preds, labels)
                loss.backward()
                optimizer.step()
                # Record statistics during training
                true_preds += (preds.argmax(dim=-1) == labels).sum()
                count += labels.shape[0]
            train_acc = true_preds / count

            ##############
            # Validation #
            ##############
            val_acc = test_model(net, val_loader)
            val_scores.append(val_acc)
            print(f"[Epoch {epoch+1:2d}] Training accuracy: {train_acc*100.0:05.2f}%, Validation accuracy: {val_acc*100.0:05.2f}%")

            if len(val_scores) == 1 or val_acc > val_scores[best_val_epoch]:
                print("\t   (New best performance, saving model...)")
                save_model(net, CHECKPOINT_PATH, model_name)
                best_val_epoch = epoch
            elif best_val_epoch <= epoch - patience:
                print(f"Early stopping due to no improvement over the last {patience} epochs")
                break

        # Plot a curve of the validation accuracy
        plt.plot([i for i in range(1,len(val_scores)+1)], val_scores)
        plt.xlabel("Epochs")
        plt.ylabel("Validation accuracy")
        plt.title(f"Validation performance of {model_name}")
        plt.show()
        plt.close()
    
    load_model(CHECKPOINT_PATH, model_name, net=net)
    test_acc = test_model(net, test_loader)
    print((f" Test accuracy: {test_acc*100.0:4.2f}% ").center(50, "=")+"\n")
    return test_acc
    

def test_model(net, data_loader):
    """
    Test a model on a specified dataset.
    
    Inputs:
        net - Trained model of type BaseNetwork
        data_loader - DataLoader object of the dataset to test on (validation or test)
    """
    net.eval()
    true_preds, count = 0., 0
    for imgs, labels in data_loader:
        imgs, labels = imgs.to(device), labels.to(device)
        with torch.no_grad():
            preds = net(imgs).argmax(dim=-1) #returns the indices of the maximum value of all elements
            # dim = the dimension to reduce
            true_preds += (preds == labels).sum().item()
            count += labels.shape[0]
    test_acc = true_preds / count
    return test_acc 

We train one model for each activation function.

*Runtime Warning - this may take over 15 minutes to run.*

In [None]:
for act_fn_name in act_fn_by_name:
    print(f"Training BaseNetwork with {act_fn_name} activation...")
    set_seed(42)
    act_fn = act_fn_by_name[act_fn_name]()
    net_actfn = BaseNetwork(act_fn=act_fn).to(device)
    train_model(net_actfn, f"FashionMNIST_{act_fn_name}", overwrite=True)

Unsurprisingly, the model using the sigmoid activation function fails and does not improve upon random performance (10 classes => 1/10 for random chance). This is because the gradient of the input layer parameters on the loss function is negligible compared to the other layers.
All the other activation functions gain similar performance. To have a more accurate conclusion, we would have to train the models for multiple seeds and look at the averages. However, the "optimal" activation function also depends on many other factors (hidden sizes, number of layers, type of layers, task, dataset, optimizer, learning rate, etc.), meaning a thorough grid search would not be useful in our case.
In the literature, activation functions that have shown to work well with deep networks are all types of ReLU functions that we experimented with here, with small gains for specific activation functions in specific networks.

### Visualising the activation distribution 

After we have trained the models, we can look at the actual activation values that are found inside the model. This may provide further insight into the model's performance.

<div style="background-color:#C2F5DD">
    
### Exercise
    
Write a simple function which takes a trained model, applies it to a batch of images, and plots the histogram of the activation function outputs inside the network. Use the histograms to answer the following questions:
    
- How many neurons are set to zero in ReLU? 
- Where do we find most values in Tanh?



In [None]:
def visualize_activations(net, color="C0"):
    activations = {}
    
    net.eval()
    small_loader = data.DataLoader(train_set, batch_size=1024)
    imgs, labels = next(iter(small_loader))
    with torch.no_grad():
        layer_index = 0
        imgs = imgs.to(device)
        imgs = imgs.view(imgs.size(0), -1)
        # We need to manually loop through the layers to save all activations
        for layer_index, layer in enumerate(net.layers[:-1]):
            imgs = layer(imgs)
            activations[layer_index] = imgs.view(-1).cpu().numpy() #reshapes tensor
    
    ## Plotting
    columns = 4
    rows = math.ceil(len(activations)/columns)
    fig, ax = plt.subplots(rows, columns, figsize=(columns*2.7, rows*2.5))
    fig_index = 0
    for key in activations:
        key_ax = ax[fig_index//columns][fig_index%columns]
        sns.histplot(data=activations[key], bins=50, ax=key_ax, color=color, kde=True, stat="density")
        key_ax.set_title(f"Layer {key} - {net.layers[key].__class__.__name__}")
        fig_index += 1
    fig.suptitle(f"Activation distribution for activation function {net.config['act_fn']['name']}", fontsize=14)
    fig.subplots_adjust(hspace=0.4, wspace=0.4)
    plt.show()
    plt.close() 

In [None]:
for i, act_fn_name in enumerate(act_fn_by_name):
    net_actfn = load_model(model_path=CHECKPOINT_PATH, model_name=f"FashionMNIST_{act_fn_name}").to(device)
    visualize_activations(net_actfn, color=f"C{i}")

As the model with **sigmoid** activation was not able to train properly, the activation outputs (or 'activations') are also less informative and are all gathered around 0.5 (this is what the activation function returns for an input of 0).

**Tanh** shows more diverse behavior. 
- In the input layer, we see a larger number of neuron outputs close to -1 and 1 (the expected outputs when the gradients are close to 0), whereas the values in the two consecutive layers are closer to zero. This is probably because the input layers look for specific features in the input image, and the consecutive layers combine those together. 
- The activations for the last layer are again more biased to the extreme values. This is because neurons in the middle layers are simply identifying features and weighting their importance, and there can be many such features. However, the last layer is the true classifier layer. Each neuron must produce the probability that a data instance belongs to a particular set on not, and thus returning extreme values is more instructive.

**ReLU** has a strong peak at 0, as we initially expected. The effect of having no gradients for negative values is that the network does not have a Gaussian-like distribution after the linear layers, but instead has a longer tail towards the positive values. 

**LeakyReLU** shows very similar behaviour, whereas **ELU** follows a more Gaussian-like distribution. The **Swish activation** seems to lie in between, although it is worth noting that Swish uses significantly higher activations than other activation functions (up to a value of 20).

We see that all activation functions show slightly different behaviuor, although obtaining similar performances for our simple network. Thus it is apparent that the selection of the "optimal" activation function really depends on many factors, and is not the same for all possible networks.


The tanh shows a more diverse behaviour. While for the input layer we see a larger amount of neurons (neuron outputs?) to be close to -1 and 1 (where the gradients are close to zero), the values in the two consecutive layers are closer to zero. This is probably because the input layers look for specific features in the input image, and the consecutive layers combine those together. The activations for the last layer are again more biased to the extreme points. 


because the classification layer can be seen as a weighted average of those values (the gradients push the activations to those extremes).



<hr style="border:2px solid gray">

# Dead Neurons [^](#index) <a id='dead-neurons'></a>

One known drawback of the ReLU activation is the occurrence of "dead neurons". These are neurons with **no gradient** (or only a small gradient) for any training input. For ReLU, this is true if the neuron returns 0 for all the given input data.

A neural network trains by calculating the gradient of the loss function wrt each layer's parameters. If the gradient calculated when backpropogating is close to 0, the associated parameters will update very slowly (or not at all). This an cause the training to slow or to stall.

A neuron first calculates `'z'`, the output of the linear function, before passing this value to the activation function. If this `'z'` is negative, the ReLU function will then always return 0, resulting in a dead neuron.

Considering the large number of neurons we have in a neural network, it is not unlikely for this to occur. 

To get a better understanding of how much of a problem this is, and when we need to be careful, we will measure how many dead neurons different networks have. 
    

<div style="background-color:#C2F5DD">

### Exercise
    
Implement a function which runs the network on the whole training set and records whether a neuron is exactly 0 for all data points or not.

We will then use this function to measure the number of dead neurons in an untrained ReLU activation function network:   

In [None]:
def measure_number_dead_neurons(net):

    # For each neuron, we create a boolean variable initially set to 1. If it has an activation unequals 0 at any time,
    # we set this variable to 0. After running through the whole training set, only dead neurons will have a 1.
    neurons_dead = [
        torch.ones(layer.weight.shape[0], device=device, dtype=torch.bool) for layer in net.layers[:-1] if isinstance(layer, nn.Linear)
    ] # Same shapes as hidden size in BaseNetwork

    net.eval()
    with torch.no_grad():
        for imgs, labels in tqdm(train_loader, leave=False): # Run through whole training set
            layer_index = 0
            imgs = imgs.to(device)
            imgs = imgs.view(imgs.size(0), -1)
            for layer in net.layers[:-1]:
                imgs = layer(imgs)
                if isinstance(layer, ActivationFunction):
                    
                    # Are all activations == 0 in the batch, and we did not record the opposite in the last batches?
                    neurons_dead[layer_index] = torch.logical_and(neurons_dead[layer_index], (imgs == 0).all(dim=0))
                    layer_index += 1
    number_neurons_dead = [t.sum().item() for t in neurons_dead]
    print("Number of dead neurons:", number_neurons_dead)
    print("In percentage:", ", ".join([f"{(100.0 * num_dead / tens.shape[0]):4.2f}%" for tens, num_dead in zip(neurons_dead, number_neurons_dead)]))

In [None]:
set_seed(42)
net_relu = BaseNetwork(act_fn=ReLU()).to(device)
measure_number_dead_neurons(net_relu)

We see that only a minor amount of neurons are dead, but that the number of dead neurons increases with the depth of the layer.

This is not too great of an issue if we only have a small number of dead neurons, such as in this case. As weights are updated in the training process, the inputs to these later layers will change too. These new inputs could allow the dead neuron to become "alive"/active again. 

<div style="background-color:#C2F5DD">
    
### Exercise

Measure the number of dead neurons for different activation functions. Are the results what you expect?

In [None]:
set_seed(42)
net_elu = BaseNetwork(act_fn=ELU()).to(device)
measure_number_dead_neurons(net_elu)

Now let us look at this for a **trained** ReLU network (with the same initialisation):

In [None]:
net_relu = load_model(model_path=CHECKPOINT_PATH, model_name="FashionMNIST_relu").to(device)
measure_number_dead_neurons(net_relu)

The number of dead neurons has indeed decreased in the later layers. However, dead neurons are especially problematic in the input layer. The training data is fixed and thus the inputs to the input layer will not change over epochs. Training the network **cannot** make these neurons become active.

Usually, the input data has a sufficiently high standard deviation to reduce the risk of dead neurons (the range of inputs is quite varied).

<div style="background-color:#C2F5DD">

### Exercise
    
Investigate how the number of dead neurons changes with increasing layer depth. For instance, try building a 10-layer neural network with 256 nodes in the first 5 layers and 128 nodes in the last 5:

In [None]:
set_seed(42)
net_relu = BaseNetwork(act_fn=ReLU(), hidden_sizes=[256, 256, 256, 256, 256, 128, 128, 128, 128, 128]).to(device)
measure_number_dead_neurons(net_relu)

The number of dead neurons is significantly higher than before which harms the gradient flow, especially in the first iterations. For instance, more than 56% of the neurons in the pre-last layer are dead which creates a considerable bottleneck.
It is therefore advisible to use other nonlinear activation functions like Swish for very deep networks.

<hr style="border:2px solid gray">

# Conclusion [^](#index) <a id='conclusion'></a>

In this notebook, we have reviewed a set of six activation functions in neural networks (sigmoid, tanh, ReLU, LeakyReLU, ELU, and Swish), and discussed how they influence the gradient distribution across layers.

Sigmoid tends to fail in deep neural networks since the highest gradient it provides is 0.25. This leads to vanishing gradients in early layers. 

All ReLU-based activation functions performed well and, besides the original ReLU, do not have the issue of dead neurons. 

When implementing your own neural network, it is recommended to start with a ReLU-based network and select the specific activation function based on the properties of the network.

<hr style="border:2px solid gray">
<hr style="border:2px solid gray">

<div style="background-color:#efdff2">

# Appendix <a id='appendix'></a>

[Return to Index](#index)

In this section, you will find longer pieces of mathematics and code which are **non-examinable**. Please read at your own discretion

- A. [PyTorch backwards() Function](#backwards_app)
- B. [Backpropagation with Matrices](#matrix_app)
- C. [Cross-Entropy](#cross)
- D. [Activation Functions Overview](#act_fn_app)

<hr style="border:2px solid gray">

<div style="background-color:#efdff2">

## A. PyTorch backwards() Function [^^](#appendix) <a id='backwards_app'></a>
    
[Return to Visualising Activation Functions](#visualising-act)     
    
_References: [tutorialspoint](https://www.tutorialspoint.com/how-to-compute-gradients-in-pytorch), PyTorch: [A Gentle Introduction to torch.autograd](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html), Medium: Abishek Bashyall - [Playing with .backward() method in Pytorch](https://abishekbashyall.medium.com/playing-with-backward-method-in-pytorch-bd34b58745a0)_
    
_For further explanation, click [here](https://stackoverflow.com/questions/57320830/why-torch-sum-before-doing-backward) and [here](https://stackoverflow.com/questions/57248777/backward-function-in-pytorch/57249287#57249287) to view two Stack Overflow discussions_
    
PyTorch is designed to create neural networks, where calculating the gradient in back propagation is vital. Therefore when we create a Torch tensor, we can specify if it is a variable from which we would like to calculate the derivatives. We do this by setting `requires_grad = True` (it is False by default). 
    
To calculate the gradient of a function wrt this variable, we must then call the `autograd.backward()` (or just `backward()` function on our output. Two such examples are given below.

In [None]:
x = torch.tensor(5., requires_grad=True)
y = x**2
y.backward()
print (x.grad) # Expect 2*x = 10

In [None]:
x = torch.tensor(3., ) # We would not need this gradient for back propagation

w = torch.tensor(2.0, requires_grad = True)
b = torch.tensor(5.0, requires_grad = True)

y = w * x + b
print("y:", y)

# Compute gradients by calling backward function for y
y.backward()

# Access and print the gradients w.r.t x, w, and b
dx = x.grad
dw = w.grad
db = b.grad
print("x.grad :", dx) # We did not require a gradient for this variable
print("w.grad :", dw) # Expect x = 3
print("b.grad :", db) # Expect 1

<div style="background-color:#efdff2">

We can see what happens if we have a function act on another, for example by multiplying our input y by 2. Unlike w and b, we will be unable to perform y.grad as y is not a **leaf nodes**; it is calculated from a function of leaf node. Hence we will obtain an error. If we did want to find the gradient wrt y, we would use the `retain_grad()` method.

In [None]:
x = torch.tensor(3., ) # We would not need this gradient for back propagation

w = torch.tensor(2.0, requires_grad = True)
b = torch.tensor(5.0, requires_grad = True)

y = w * x + b

# Uncomment the line below to retain the gradient
#y.retain_grad()

z = 2 * y

z.backward()

db = b.grad
dw = w.grad
dy = y.grad

print("w.grad :", dw) # Expect x = 6
print("b.grad :", db) # Expect 2
print("y.grad :", dy) # Expect 2


<div style="background-color:#efdff2">

By default, `backward()` is called on a scalar tensor since the function is unable to calculate non-scalar derivatives. If the output is not of this form, the code will not run. An example of this is given below.

In [None]:
x_arr = torch.tensor([1., 2., 3.], requires_grad=True)
y_arr= x_arr ** 2 
y_arr.backward()

<div style="background-color:#efdff2">
    
One way to get past this would be to simply create a scalar tensor that would make backwards() produce the required gradients - for example the sum of all the elements in our output vector: $\Sigma y_i^{(j)}$. Since this sum is unweighted, this method can be described as that of _equal gradient flow_. Note that our output elements $y_i^{(j)}$ only depend on $x_i^{(j)}$:
    
\begin{equation}    
\frac{\partial \Sigma y_i^{(j)}}{\partial x_k^{(l)}} = \Sigma \frac{\partial f(x_i^{(j)})}{\partial x_k^{(l)}} = \frac{\partial f(x_k^{(l)})}{\partial x_k^{(l)}}
\end{equation}    
    
`x.grad` would then produce a torch tensor containing all of these derivatives, as required.

In [None]:
x = torch.tensor([[1.,2.,3.],[4.,5.,6.]], requires_grad=True)
print (x)
print ()

# Imaging that squaring our data is the activation function
# This would means the gradient should be given by 2*x
out = x**2 
out.sum().backward()


print(x.grad) # This should be 2*input values

<div style="background-color:#efdff2">
    
An alternative way to achieve our desired outcome is by setting the gradient argument in the backward() function. Understanding the principles behind why this works requires knowledge of PyTorch's autograd function. 
    
First, imagine our function is of the form $\boldsymbol{y} = f (\boldsymbol{x})$, where **x** and **y** have size n and m respectively. 
   
When we call `backward()`, we are actually calling the default `backward(gradient = torch.tensor[1.])`. We will refer to this gradient using $\boldsymbol{v}$.. Autograd then performs the following calculations, where J is the **Jacobian**:
    
\begin{equation}
    J^T \cdot \boldsymbol{v} =
                              \begin{bmatrix} 
                              \frac{\partial y_1}{\partial x_1} & \dots  & \frac{\partial y_m}{\partial x_1}\\
                                    \vdots & \ddots & \vdots\\
                              \frac{\partial y_1}{\partial x_n} & \dots  & \frac{\partial y_m}{\partial x_n} 
                              \end{bmatrix} \boldsymbol{v}
\end{equation}
    
If **x** and **y** were the same size, and $y_i$ only depended on $x_i$, we would just have a diagonal matrix. By setting **v** to be the same size as **x** and only contain ones, we would get the equation we were after. We can also see how the simple 1D equation works:
    
$$
\begin{bmatrix} 
      \frac{\partial y_1}{\partial x_1} & \dots  & 0 \\
      \vdots & \ddots & \vdots\\
      0 & \dots  & \frac{\partial y_m}{\partial x_m} 
\end{bmatrix} 

\begin{pmatrix} 
      1 \\ \vdots \\ 1 
\end{pmatrix} = 
\begin{pmatrix} 
      \frac{\partial y_1}{\partial x_1} \\ \vdots \\ \frac{\partial y_m}{\partial x_m} 
\end{pmatrix}
$$
    
If X and Y are matrices rather than vectors, then we can imagine the autograd function working on each column (or row) in turn. The gradient called by backward() must still have the same shape as the matrices X and Y. An example of this is shown below:

In [None]:
x = torch.tensor([[1.,2.,3.],[4.,5.,6.]], requires_grad=True)
print (x)
print ()

out = x**2 
out.backward(gradient = torch.ones_like(x))

print(x.grad) # This should be 2*input values


<hr style="border:2px solid gray">

<div style="background-color:#efdff2">

## B. Backpropagation with Matrices [^^](#appendix) <a id='matrix_app'></a>
    
[Return to Gradient Flow](#gradient-flow) 
    
_References: Machine Learning Mastery: [A Gentle Introduction to the Jacobian](https://machinelearningmastery.com/a-gentle-introduction-to-the-jacobian/), PyTorch: [A Gentle Introduction to torch.autograd](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html)_
    
We will use the previously considered backward() function to briefly examine how to use matrices for the chain rule.

In most instances, we will call backward() on your loss function in order to calculate the gradients required for optimisation. 

In this section, we will consider how autograd behaves if **v** is the gradient of some **scalar** function $l$: $v = \nabla l(\boldsymbol{y})$, with grad calculated with respect to **y**. Therefore:

$$

J^T \cdot \boldsymbol{v} =
                        \begin{bmatrix} 
                        \frac{\partial y_1}{\partial x_1} & \dots  & \frac{\partial y_m}{\partial x_1}\\
                              \vdots & \ddots & \vdots\\
                        \frac{\partial y_1}{\partial x_n} & \dots  & \frac{\partial y_m}{\partial x_n} 
                        \end{bmatrix}

,

\begin{pmatrix}  
      \frac{\partial l}{\partial y_1} \\ \vdots \\  \frac{\partial l}{\partial y_m} \end{pmatrix}  = \begin{pmatrix}  \frac{\partial l}{\partial x_1} \\ \vdots \\  \frac{\partial l}{\partial x_m} 
\end{pmatrix}
$$

This means that for a loss function $l$, we can calculate all the partial derivatives with respect to the elements of **x** (which could be the weights, the biases etc) by calculating the Jacobian of **y** wrt **x**, and having previously calculated how $l$ depends on **y**. We can then use this differential to calculate the next gradient, and so on - this is a key principle of backpropagation. If **v** contained only ones, as we saw previously, this would mean that a change to any element in **y** would have an equal impact on the loss function - this is why described this ones tensor as having _equal gradient flow_.




<hr style="border:2px solid gray">

<div style="background-color:#efdff2">

## C. Cross-Entropy [^^](#appendix) <a id='cross'></a>

[Return to Gradient Flow](#gradient-flow)   

Cross-Entropy is a commonly used loss function for classification problems.

### Binary Cross-Entropy/Log Loss

If we have a binary classification problem, the cross entropy loss function is **the same** as that of the log loss. 

When this type of model makes a prediction about a data instance's classification, it is actually calculating the **probability** of the instance belonging to each category, and assigning it to the category where the probability is greatest. We use this probability value in our loss function calculations.

Imagine we assign a label, $y_i$, to our data points such that belonging to the category has a label value of 1, and not belonging has a label value of 0. The model's predicted probability of a data point belonging to the category is given by $p(x_i)$.

The formula for log loss is given by:

\begin{equation}
    H = - \frac{1}{N} \sum_{i=1}^N y_i \cdot \log( p(x_i)) + (1 - y_i) \cdot \log(1 - p(x_i))
\end{equation}

The label $y_i$ can only be 0 or 1, so for each instance only one of the log probability terms will be included in the sum. Also note that since this a binary classification problem, 1 - p(x_i) is the model's predicted probability that the instance is not in the category.

-log(a) tends to infinity as a tends to 0, and -log(1) = 0. Therefore worse prediction values will contribute a larger amount to the loss function, and more accurate predictions will contribute amounts closer to 0. This is what we require for a loss function.

### More categories

It is easier to work with a binary model than when there are multiple outputs. Thus we choose to convert a multiple category model to a binary type. 

Let there be m possible categories for our dataset to fall into. We now give each data instance m labels, where the label $y_i^{(k)}$ is 1 or 0 depending on whether the instance is in the $k^{th}$ category or not. The categories are discrete, so only one label value will be non-zero. This data engineering method is called **one-hot encoding**. $p_k(x_i)$ denotes the model's predicted probability of the data instance belonging to category k.

To calculate the cross-entropy, we can then simply sum over each respective category:

\begin{equation}
    H = - \frac{1}{N} \sum_{j=1}^m \sum_{i=1}^N y_i^{(m)} \cdot \log( p_k(x_i)) + (1 - y_i^{(m)})) \cdot \log(1 - p_k(x_i))
\end{equation}
### Connection to Entropy

If you have previously studied thermodynamics or information theory, you will be able to see why this loss function is related to **entropy**. For example, consider the similarities of the loss function to the equation for Gibbs entropy: $S_G = -\sum p_i \log(p_i)$, where $p_i$ is the probability of the system being in a given microstate.

_References: Medium: Daniel Godoy - [Understanding binary cross-entopy/log loss](https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a), Medium: Vlastimil Martinek - [Cross-entropy for Classification](https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451)._

<div style="background-color:#efdff2">

## D. Activation Functions Overview [^^](#appendix) <a id='act_fn_app'></a>

 
### Sigmoid

The sigmoid function has the form: $$f(x) = \frac{1}{1+e^{-x}} = \frac{e^x}{e^x + 1}$$


### Tanh

The tanh function has the form: 
    
$$ f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

### ReLU

ReLU stands for "Rectified Linear Unit", and has the form: 
    
$$ f(x) = \max(0,x) =  \begin{cases}
x, & x > 0 \\
0, & x \leq 0
\end{cases}$$
    
### Leaky ReLU

The Leaky ReLU function had the form: 
    
$$ f(x) =  \begin{cases}
  x, & x > 0 \\
  ax, & x \leq 0
 \end{cases}, \textrm{ } (a<1).$$
 
 
### ELU

ELU stands for "Exponential Linear Unit". Notice that when $\alpha=0$, it reduces to the ReLU function. ELU has the form:
    
$$ f(x) = \begin{cases}
  x, & x > 0 \\
  \alpha(e^x-1), & x \leq 0
 \end{cases}$$
    

 
 
### Swish

The Swish function is the sigmoid function multipied by $x$. It has the form:
$$f(x) = \frac{x}{1+e^{-x}}$$

## References

[1] Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. "Searching for activation functions." arXiv preprint arXiv:1710.05941 (2017). [Paper link](https://arxiv.org/abs/1710.05941) 