# Notebook 05 - Loss Functions

__Last notebook we ended off training our neural net on a small dataset by writing the following training loop__

1) Feed our input samples into the model and perform a forward pass to get the predicted values

2) Calculate the loss value (which we want to minimize through training). For this example we'll use a simple maximum-margin hinge loss for binary classification. This is basically squared error where the labels are `-1` or `1`.

3) "Zero out" the gradients - if we don't manually set the gradients to zero, the gradients for each Scalar value will keep accumulating during model training, which may lead to undesired behaviour during training. The PyTorch equivalent is `zero_grad()`.

4) Perform a backward pass to calculate the gradients of every parameter (weight & bias) in the model

5) Update the parameter values of the model, each parameter will be adjusted by a `learning_rate` (set to 0.1) multiplied by the negative of the gradient. We'll go deeper into gradient descent next notebook.

__In step 2, I kind of just threw a random loss function at you__, so in this notebook we'll be diving deeper into this step and building more functionality around our `Sequential` class.

# Loss Functions

Going back to notebook 1 for a second, remember a function maps inputs to outputs.

A loss function is a function that takes the __predicted output__ of our neural network and the __actual values__ from our training data as inputs, and outputs a __measure of difference between predicted and actual values__, called the __loss value__.

This is useful when it comes to neural networks because we want to get our predicted outputs to match the actual values closely, or in other words, __minimize the loss value__. We can achieve this during our training using gradient descent.

Since we can decompose functions into basic operations, it means loss functions are differentiable, and by calling `backward()` on the loss value, we can calculate the gradients of the whole network relative to to the loss value through backpropogation, and train the network by updating the weights.

In this notebook we'll implement the 3 most common loss functions:

- Mean Squared Error (Regression)
- Binary Cross Entropy (Binary Classification)
- Categorical Cross Entropy (Multi-class Classification)

In [2]:
# kaitorch/losses.py

from kaitorch.utils import wrap
from kaitorch.core import Scalar

__all__ = ['mse', 'binary_crossentropy', 'categorical_crossentropy']

### Mean Squared Error

Mean Squared Error (MSE) is often used for regression problems, where we are trying to predict a continuous and unbounded output value. It measures the average of the squared differences between the predicted output and the actual value for a given set of training data.

__Breaking down MSE into it's respective terms in reverse:__

- __Error__ - the error is the difference between the predicted output and the actual value

- __Squared__ - squaring the error penalizes larger differences more heavily, making the model more likely to correct large errors during training.

- __Mean__ - taking the mean of the squared error gives us a better estimate of the overall model prediction error (as compared to using only the error of a single sample), reducing the variance of our model training.

$$ \text{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2 $$

- $N$ is the number of samples
- $y_i$ is the actual value (continuous unbounded)
- $\hat{y}_i$ is the predicted value (continuous unbounded)


In [3]:
# kaitorch/losses.py

def mse():
    return MeanSquaredError()

class MeanSquaredError:

    def __init__(self):
        pass

    def __call__(self, ys: list, y_preds: list):

        ys, y_preds = wrap(ys), wrap(y_preds)

        # 1/N
        pred_length = len(ys)
        
        # Summation Term
        squared_error = sum(
            (y - y_pred)**2 for y, y_pred in zip(ys, y_preds))
        
        # Mean Squared Error
        mean_squared_error = squared_error/pred_length

        return mean_squared_error

    def __repr__(self):
        return 'MeanSquaredError()'

In [4]:
mse = mse()

ys = [0, 0, 0, 0]
y_preds = [0, 0.1, 1, 10]

for y, y_pred in zip(ys, y_preds):
    mse_loss = mse(y, Scalar(y_pred))
    print(f'When y={y}, y_pred={y_pred:<3}, MSE loss is {mse_loss.data:.5}')

When y=0, y_pred=0  , MSE loss is 1e-16
When y=0, y_pred=0.1, MSE loss is 0.01
When y=0, y_pred=1  , MSE loss is 1.0
When y=0, y_pred=10 , MSE loss is 100.0


### Binary Cross Entropy

Binary Cross Entropy (BCE) is often used in binary classification, problems where we are trying to determine if a sample belongs to the positive (1) or negative (0) class. BCE is a measure of __how well a model is able to distinguish between the positive and negative class.__

$$ \text{BCE} = -\frac{1}{N} \sum_{i=1}^N [y_i * \ln \hat{y}_i + (1-y_i) * \ln (1-\hat{y}_i)] $$

- $N$ is the number of samples
- $y_i$ is the actual class (0 or 1)
- $\hat{y}_i$ is the predicted class (continuous between 0 and 1)

Splitting this up into its left and right hand terms.

__left term:__ $y_i * \ln \hat{y}_i$

- When the actual class is positive ($y_i=1$), then $y_i = 1, 1 - y_i = 0$ and the right hand term is 0.

__right term:__ $(1-y_i) * \ln (1-\hat{y}_i)$

- When the actual class is negative ($y_i=0$), then $y_i = 0, 1 - y_i = 1$ and the left hand term is 0.

So, for any data point, only half of this loss function is "active" (the other is 0), and it reduces to the negative logarithm of the predicted probability of the actual class.

Negative logarithm is used because it has the following property:
- When $x=1, \ln(x)=0$, so when we predict the correct class, the error term is 0
- When $x=0, \ln(x)=\infty$, so when we predict the incorrect class, the error term grows increasingly large

By minimizing the BCE loss function, the model is encouraged to produce predicted probabilities that are closer to the true class.

In [5]:
# kaitorch/losses.py

def binary_crossentropy():
    return BinaryCrossentropy()

class BinaryCrossentropy:

    def __init__(self):
        pass

    def __call__(self, ys, y_preds):

        loss = 0.0
        ys, y_preds = wrap(ys), wrap(y_preds)

        # 1/N
        pred_length = len(ys)

        # Summation term - could've done this more concisely but wanted to make the logic clear
        for y, y_pred in zip(ys, y_preds):

            # active left term
            if y == 1:
                loss += -(y_pred).log()

            # active right term
            elif y == 0:
                loss += -(1 - y_pred).log()

        # Binary Cross Entropy
        binary_crossentropy_loss = loss / pred_length

        return binary_crossentropy_loss

    def __repr__(self):
        return 'BinaryCrossentropy()'

In [6]:
bce = BinaryCrossentropy()

ys = [1, 1, 0, 0]
y_preds = [0.99, 0.01, 0.99, 0.01]

for y, y_pred in zip(ys, y_preds):
    bce_loss = bce(y, Scalar(y_pred))
    print(f'When y={y}, y_pred={y_pred}, BCE loss is {bce_loss.data:.5}')

When y=1, y_pred=0.99, BCE loss is 0.01005
When y=1, y_pred=0.01, BCE loss is 4.6052
When y=0, y_pred=0.99, BCE loss is 4.6052
When y=0, y_pred=0.01, BCE loss is 0.01005


### Categorical Cross Entropy

Categorical Cross Entropy (CCE) is often used for multi-class classification, problems where we are trying to determine which class a sample belongs to. CCE is a measure of __how well a model is able to distinguish the correct class__.

CCE is the generalized version of BCE - instead of determining whether or not a sample belongs to a class, it determines which, out of multiple classes, a sample belongs to.

Class membership (or otherwise) is represented using __one-hot encodings__. For example, if we have 3 classes and a sample belongs to the second class, we represent this using the __one-hot vector__ `[0, 1, 0]`.

$$ \text{CCE} = -\frac{1}{N} \sum_{i=1}^N \sum_{j}^C [y_{ij} * \ln \hat{y}_{ij} + (1-y_{ij}) * \ln (1-\hat{y}_{ij})] $$

- $N$ is the number of samples
- $C$ is the number of classes
- $y_i$ is the OHE vector of the actual class (list of 0s and 1s)
- $y_{ij}$ is whether or not the actual class is `j` (list of 0s and 1s)
- $\hat{y}_i$ is the vector of the predcted classes (list of values between 0 and 1)
- $\hat{y}_{ij}$ is the predicted probability of class `j`

In plain terms, this is how BCE generalizes to CCE:

- With BCE, the question is whether or not a sample belongs to a class.
- With CCE, the question is whether or not a sample belongs to class A, whether or not it belongs to class B, ... etc.

Categorical cross entropy is essentially the sum of BCEs for every class in our problem, and as you'll see in the code below, the logic is essentially the same, with an added loop to account for the additional classes.

In [7]:
# kaitorch/losses.py

def categorical_crossentropy():
    return CategoricalCrossentropy()

class CategoricalCrossentropy:

    def __init__(self):
        pass

    def __call__(self, ys, y_preds):

        loss = 0.0
        if isinstance(ys[0], (int, float, Scalar)):
            ys, y_preds = [ys], [y_preds]

        # 1/N
        pred_length = len(ys)

        # Outer summation term - again, this could've been more concise but wanted to make the logic clear
        for y_ohe, y_pred_ohe in zip(ys, y_preds):

            # Inner summation term
            for y, y_pred in zip(y_ohe, y_pred_ohe):
                
                # if j is the actual class
                if y == 1:
                    loss += -(y_pred).log()
                    
                # if j is not the actual class
                elif y == 0:
                    loss += -(1 - y_pred).log()

        # Categorical Cross Entropy
        categorical_crossentropy_loss = loss / pred_length

        return categorical_crossentropy_loss

    def __repr__(self):
        return 'CategoricalCrossEntropy()'

In [8]:
cce = CategoricalCrossentropy()

ys = [
    [1, 0, 0],
    [1, 0, 0],
    [1, 0, 0],
    [1, 0, 0]
]

y_preds = [
    [Scalar(0.99), Scalar(0.01), Scalar(0.01)],
    [Scalar(0.99), Scalar(0.01), Scalar(0.99)],
    [Scalar(0.01), Scalar(0.01), Scalar(0.99)],
    [Scalar(0.01), Scalar(0.99), Scalar(0.99)]
]

for y, y_pred in zip(ys, y_preds):
    cce_loss = cce(y, y_pred)
    print(f'When y={y}, y_pred={y_pred}, CCE loss is {cce_loss.data:.5}')

When y=[1, 0, 0], y_pred=[Scalar(data=0.99), Scalar(data=0.01), Scalar(data=0.01)], CCE loss is 0.030151
When y=[1, 0, 0], y_pred=[Scalar(data=0.99), Scalar(data=0.01), Scalar(data=0.99)], CCE loss is 4.6253
When y=[1, 0, 0], y_pred=[Scalar(data=0.01), Scalar(data=0.01), Scalar(data=0.99)], CCE loss is 9.2204
When y=[1, 0, 0], y_pred=[Scalar(data=0.01), Scalar(data=0.99), Scalar(data=0.99)], CCE loss is 13.816


__if we wanted to__ (which we dont) __, we could just plug BCE into our CCE :)__

In [9]:
def __call__(self, ys, y_preds):
    
    print("Using BCE") # just to prove we're using this __call__

    if isinstance(ys[0], (int, float, Scalar)):
        ys, y_preds = [ys], [y_preds]

    # 1/N
    pred_length = len(ys)

    # Outer summation term
    for y_ohe, y_pred_ohe in zip(ys, y_preds):

        categorical_crossentropy_loss = 3*BinaryCrossentropy()(y_ohe, y_pred_ohe) 
        # add 3* since there's no 1/C term   ^^ here lol ^^ 

    # Categorical Cross Entropy
    categorical_crossentropy_loss = categorical_crossentropy_loss / pred_length

    return categorical_crossentropy_loss

CategoricalCrossentropy.__call__ = __call__

In [10]:
# Verify we get the same result

cce = CategoricalCrossentropy()

for y, y_pred in zip(ys, y_preds):
    cce_loss = cce(y, y_pred)
    print(f'When y={y}, y_pred={y_pred}, CCE loss is {cce_loss.data:.5}')

Using BCE
When y=[1, 0, 0], y_pred=[Scalar(data=0.99), Scalar(data=0.01), Scalar(data=0.01)], CCE loss is 0.030151
Using BCE
When y=[1, 0, 0], y_pred=[Scalar(data=0.99), Scalar(data=0.01), Scalar(data=0.99)], CCE loss is 4.6253
Using BCE
When y=[1, 0, 0], y_pred=[Scalar(data=0.01), Scalar(data=0.01), Scalar(data=0.99)], CCE loss is 9.2204
Using BCE
When y=[1, 0, 0], y_pred=[Scalar(data=0.01), Scalar(data=0.99), Scalar(data=0.99)], CCE loss is 13.816


__Those are our 3 loss functions in KaiTorch!__ Now that we have proper loss functions, let's continue building functionality around our `Sequential` class.

This is where we left off last notebook:

In [11]:
from kaitorch.core import Module
from kaitorch.layers import Dense
from kaitorch.utils import unwrap

class Sequential(Module):

    def __init__(self, layers=None):
        self.built = False

        self.layers = layers if layers else []
        self.layer_sizes = [
            layer.nouts for layer in self.layers
            ] if self.layers else []
    
    def __repr__(self):
        print([layer.parameters() for layer in self.layers])
        return '\n'.join(str(layer) for layer in self.layers)
    
    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return unwrap(x)
        
    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]
    
    def add(self, layer):
        self.layers.append(layer)
        self.layer_sizes.append(layer.nouts)
    
    def build(self, input_size):

        if self.built:
            return

        self.layer_sizes.insert(0, input_size)

        for idx, layer in enumerate(self.layers):
            layer.__build__(self.layer_sizes[idx])

        self.built = True

    def summary(self):
        print('_' * 115)
        print('Layer (params)' + ' '*59 + 'Output Shape' + ' '*5 + 'Params = Weights + Biases')
        print('=' * 115)
        for layer_num, layer in enumerate(self.layers):
            l_name = layer.__repr__()
            l_output = f'(None, {layer.nouts})'
            l_params = len(layer.parameters())
            l_w = l_params - layer.nouts if l_params > 0 else 0
            l_b = layer.nouts if l_params > 0 else 0

            print(f'{l_name:<73}{l_output:<17}{l_params:<9}{l_w:<10}{l_b:<6}')
            if layer_num != (len(self.layers) - 1):
                print('_' * 115)
        print('=' * 115)
        print(f'Total Params: {sum([len(layer.parameters()) for layer in self.layers])}')
        print('_' * 115)

    def plot(self, filename=None):

        if not self.built:
            raise Exception('[Model Not Built] - Use Sequential.build(input_size) to build model')
        empty_input = self.__call__([0]*self.layer_sizes[0])
        return plot_model(empty_input, filename=filename)

First, let's define the `Module` class that `Sequential` inherits that I introduced last notebook. Every layer and model will inherit this class that contains the `zero_grad` method that should look familiar (step 3 of our training loop) and a `parameter` method as a reminder that this method should be implemented for every `Module`.

In [12]:
# kaitorch/core.py

class Module:

    def zero_grad(self):
        for p in self.parameters():
            p.grad = 0.0

    def parameters(self):
        return []

Revisiting our manual training loop from last notebook, let's start with the idea that we don't want to train the model every time we run it, let's introduce a new parameter `train` when we call our model. Its currently unused in `__call__` but we'll need it in Notebook 7.

In [13]:
def __call__(self, x, train):
    for layer in self.layers:
        x = layer(x)
    return unwrap(x)

Sequential.__call__ = __call__

Let's incorporate our work on loss functions. We'll add a `compiled` property to `Sequential` and implement a `compile` method that sets the loss function of the model (as well as the optimizer, but that's next notebook).

In [14]:
# kaitorch/models.py

def __init__(self, layers=None):
    self.built = False
    self.compiled = False

    self.layers = layers if layers else []
    self.layer_sizes = [layer.nouts for layer in self.layers] if self.layers else []
    
Sequential.__init__ = __init__

In [15]:
import kaitorch.losses

def compile(self, loss):

    # kaitorch/models.py
    def set_loss(loss):
        if isinstance(loss, str):
            if loss in kaitorch.losses.__all__:
                self.loss = getattr(kaitorch.losses, loss)()
            else:
                raise Exception(
                    f'[Undefined Loss Function] - Loss Function "{loss}" has not been implemented'
                )
        else:
            self.loss = loss
            
    if not self.compiled:
        if loss:
            set_loss(loss)
            self.compiled = True
        else:
            raise Exception(
                '[Unable to Compile] - Optimizer and Loss Function must be specified'
            )
            
Sequential.compile = compile

Next, let's write each iteration of the training loop from last notebook into a method `run_epoch`, and separate the functions that are only needed for training into an if statement.

In [16]:
from tqdm import tqdm

def run_epoch(self, x, y=None, epoch=1, epochs=1, train=False):

    postfix_type = 'Train' if train is True else ''

    # Progress bar looping through all inputs
    tqdm_x = tqdm(
        x,
        ncols=160,
        desc=f"Epoch {epoch:>3}/{epochs}", 
        postfix='',
        bar_format='{l_bar}{bar:40}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}{postfix}]'
    )

    # List to store model predictions
    y_pred = []

    # For every input
    for x_record in tqdm_x:
        
        # Replacing [model(x) for x in xs]
        y_pred.append(self.__call__(x_record, train=train))

        # If an true y value is provided, calculate the loss value and display in progress bar
        if y:
            run_loss = self.loss(y[:len(y_pred)], y_pred)
            tqdm_x.set_postfix_str(f"{postfix_type} Loss: {run_loss.data:.4f}")

        # Else, loss value is None, and do not display in progress bar
        else:
            run_loss = None
            tqdm_x.set_postfix_str(f"{postfix_type}")

    # If this is a training run, 
    if train:
        self.zero_grad() # using zero_grad() inherited from Module
        run_loss.backward()
        self.step() # we haven't defined this yet

    return y_pred, run_loss

Sequential.run_epoch = run_epoch

We haven't defined the `step()` method yet, which has a dependency on which optimizer we use, but for now we'll use the same simple gradient descent algorithm we used last notebook.

In [17]:
def step(self, **optimizer_params):

    for p in self.parameters():
        p.data += 0.1 * -1 * p.grad
        
Sequential.step = step

### Updated Training Loop

In [18]:
model = Sequential()
model.add(Dense(3, activation='tanh', initializer='he_uniform'))
model.add(Dense(4, activation='tanh', initializer='he_uniform'))
model.add(Dense(4, activation='tanh', initializer='he_uniform'))
model.add(Dense(1, activation='tanh'))

model.compile(loss='mse')
model.build(3)
model.summary()

___________________________________________________________________________________________________________________
Layer (params)                                                           Output Shape     Params = Weights + Biases
Dense(units=3, activation=tanh, initializer=he_uniform)                  (None, 3)        12       9         3     
___________________________________________________________________________________________________________________
Dense(units=4, activation=tanh, initializer=he_uniform)                  (None, 4)        16       12        4     
___________________________________________________________________________________________________________________
Dense(units=4, activation=tanh, initializer=he_uniform)                  (None, 4)        20       16        4     
___________________________________________________________________________________________________________________
Dense(units=1, activation=tanh, initializer=glorot_uniform)             

In [19]:
xs = [[ 2.0,  3.0, -1.0],
      [ 3.0, -1.0,  0.5],
      [-0.5,  1.0,  1.0],
      [ 1.0,  1.0, -1.0],
      [ 2.5, -1.0, -1.0]]

ys = [1.0, -1.0, -1.0, 1.0, -1.0]

In [20]:
epochs = 50

for epoch in range(epochs):

    y_pred, run_loss = model.run_epoch(xs, ys, epoch, epochs, train=True)

Epoch   0/50: 100%|████████████████████████████████████████| 5/5 [00:00<00:00, Train Loss: 1.4682]
Epoch   1/50: 100%|████████████████████████████████████████| 5/5 [00:00<00:00, Train Loss: 1.1969]
Epoch   2/50: 100%|████████████████████████████████████████| 5/5 [00:00<00:00, Train Loss: 0.8670]
Epoch   3/50: 100%|████████████████████████████████████████| 5/5 [00:00<00:00, Train Loss: 0.8156]
Epoch   4/50: 100%|████████████████████████████████████████| 5/5 [00:00<00:00, Train Loss: 0.8044]
Epoch   5/50: 100%|████████████████████████████████████████| 5/5 [00:00<00:00, Train Loss: 0.7933]
Epoch   6/50: 100%|████████████████████████████████████████| 5/5 [00:00<00:00, Train Loss: 0.7746]
Epoch   7/50: 100%|████████████████████████████████████████| 5/5 [00:00<00:00, Train Loss: 0.7258]
Epoch   8/50: 100%|████████████████████████████████████████| 5/5 [00:00<00:00, Train Loss: 0.5894]
Epoch   9/50: 100%|████████████████████████████████████████| 5/5 [00:00<00:00, Train Loss: 0.4321]
Epoch  10/

Cleaner, right?