# Loss and Activation Functions


In [1]:
import torch
import torch.nn as nn

## Loss Functions

It's finally time to start talking about loss and activation (which will be the final piece of info we need to start building actual models).

For neural networks, there are 3 basic loss functions which are used based on the task at hand:
1. **Regression:** uses Mean Squared Error (MSE) as loss.
2. **Binary Classification:** uses Binary Cross-Entropy (BCE) as loss.
3. **Multi-Class Classification:** uses Cross-Entropy (CE) as loss.

Each of these loss functions can be called from the ```torch.nn``` library

In [2]:
mse = nn.MSELoss()

bce = nn.BCELoss()

ce = nn.CrossEntropyLoss()

<br>

We can also define custom loss functions too, if we need them train a specific kind of network for a specific task. Usually, this means for Generative Deep Learning models like GANs and VAEs.

A loss function is just a function (duh!) but the key thing to keep in mind is that we eventually have to differentiate this function using PyTorch's autograd software, so we need to make sure all the operations of the loss function are compatible with the Torch tensors.


In [4]:
def mean_quartic_error(y_pred_batch, y_batch):
    return (y_pred_batch - y_batch)**4/y_batch.shape[0]

<br>

Let's test out PyTorch's built-in Cross Entropy Loss. The thing to note is that ```nn.CrossEntropyLoss()``` automatically:
1. has a built-in softmax activation at the end, so no need to call one on our own.
2. automatically matches indices of the multi-class, so we **DO NOT** one-hot encode the y labels.

In [7]:
loss = nn.CrossEntropyLoss()

# create an output whose class is "0"
y = torch.tensor([0], device='cuda')

# create an example of a "good" prediction that predicts class "0" with high confidence
y_pred_good = torch.tensor([[4.0, 0.2, 0.8]], device='cuda')

# create an example of a "bad" prediction that does not predict "0"
y_pred_bad = torch.tensor([[0.1, 3.6, 2.4]], device='cuda')

# compute the loss for the good and bad predictions.
# note that the loss function returns a tensor
loss_good = loss(y_pred_good, y)
loss_bad = loss(y_pred_bad, y)

# extract the scalar value from the returned tensor
print(loss_good.item())
print(loss_bad.item())


0.061220210045576096
3.786224603652954


In [9]:
# we can also turn the logit tensors to actual weights (logit = "probability" though unnormalized)
_, predictions_good = torch.max(y_pred_good, dim = 1)
_, predictions_bad = torch.max(y_pred_bad, dim=1)

print(predictions_good.item())
print(predictions_bad.item())

0
1


- The good prediction predicts class "0".
- The bad prediction predicts class "1".

Notice that the y label is not encoded as a one-hot matrix. Let's see how this works by looking at another example:

In [10]:
# create an output whose class is "7" out of possible classes 0-9
y = torch.tensor([7], device='cuda')

# create an example of a "good" prediction that predicts class "7" with high confidence
y_pred_good = torch.tensor([[0.1, 0.01, 0.01, 0.8, 0.2, 0.05, 0.4, 6, 0.2,0.1]], device='cuda')

# create an example of a "bad" prediction that does not predict "7"
y_pred_bad = torch.tensor([[0.4, 9.8, 0.01, 0.8, 1.2, 0.05, 0.4, 0.3, 0.2,0.1]], device='cuda')

# compute the loss for the good and bad predictions.
# note that the loss function returns a tensor
loss_good = loss(y_pred_good, y)
loss_bad = loss(y_pred_bad, y)

# extract the scalar value from the returned tensor
print(loss_good.item())
print(loss_bad.item())


0.02796681597828865
9.5007905960083


In [11]:
_, predictions_good = torch.max(y_pred_good, dim = 1)
_, predictions_bad = torch.max(y_pred_bad, dim=1)

print(predictions_good.item())
print(predictions_bad.item())

7
1


Basically: in Keras/Tensorflow, the ```CategoricalCrossEntropy()``` loss has one-hot encoded target to organize classes. In PyTorch, the ```CrossEntropyLoss()``` uses class-index labelling to organize classes.

---

## Activation Functions

Now its time for the final piece of the puzzle: activation functions. Activation functions add non-linearity to the network allowing it to approximate more complicated shapes. In fact, the beauty of neural networks comes from the fact that mixing in a simple activation function to all the otherwise linear operations is enough to approximate almost every continuous function (Universal Approximation Theorem).

The most commonly used activation functions are:
1. ```Sigmoid()``` function. This was the forerunner to all activation functions, used for logistic regression which eventually evolved into neural networks.
2. ```Softmax()``` function (soft max). The softmax function pools multiple values together and normalizes them into a probability. Mainly used for multi-class problems, although PyTorch's cross entropy loss already does this for us.
3. ```Tanh()``` function (hyperbolic tangent). Similar in shape to the sigmoid, mainly used with nowadays with GANs and other generative models.
4. ```ReLU()``` function (Rectified Linear Unit). Created to address the vanishing and exploding gradient problem with the Sigmoid and Tanh functions. However, since the ReLU function is 0 for all $x<=0$, this causes an issue with "dying neurons" as neurons that hit a negative value can no longer contribute anything to the gradient, thereby staying dead forever.
5. ```LeakyReLU()``` function ("Leaky" ReLU). Created to address the dying neuron problem of ReLU. The idea is to take the ReLU function and slightly alter the negative-branch so that it does not stay 0 forever, allowing the gradient to "leak through" to the dead neuron, bringing it back to life.

In [12]:
# instantiate sigmoid
sigmoid = nn.Sigmoid()

# hyperbolic tangent
tanh = nn.Tanh()

# ReLU
relu = nn.ReLU()

# Leaky ReLU
leaky = nn.LeakyReLU()