# Softmax and Loss

Let' define $z$ as the output of the last linear layer (no activation)

The output $z$ can be converted to probability like values $\hat{y}$ in two ways:

- through sigmoid, $\hat{y}_i = p(c_i) = \mathrm{sigmoid}(z) = \frac{1}{1 + \exp(-z_i)} = \frac{\exp(z_i)}{\exp(z_i)+\exp(0)}$ - different outputs are independent, used for binnary classifier, could be used for multilabel-multiclass categorisation
- through softmax, $\hat{y}_i = p(c_i) = \mathrm{softmax}(z) = \frac{\exp(z_i)}{\sum{\exp(z_j)}}$ - all outputs sum to one, used for multiclass categorisation

where

- $c_i$ is the category assigned to the $i$-th output node
- $\hat{y_i}$ is the estimated likelihood of $c_i$
- $y_i$ is 1 if the actual category is $c_i$ and 0 otherwise. An alternative is to have $y$ as an integer representing the index of the actual category.

In [None]:
import torch
from torch import tensor

In [None]:
c = ['male', 'female']
z = tensor([0, 5])


In [None]:
def sigmoid(x):
    return 1 / (1 + torch.exp(x))

In [1]:
def softmax(x):
    # print('Input shape:', x.shape, 'Sum shape:', torch.exp(x).sum(dim=-1, keepdim=True).shape )
    return torch.exp(x) / torch.exp(x).sum(dim=-1, keepdim=True)

In [12]:
# Generate test data - outputs of the last linear layer
# Six items (N=6) and two output features (H=2)
z = tensor([[1, 10],
           [2, -2],
           [2, 2],
           [0, 2],
           [4.5, 5],
           [0, 0]
           ])
z, z.shape

(tensor([[ 1.0000, 10.0000],
         [ 2.0000, -2.0000],
         [ 2.0000,  2.0000],
         [ 0.0000,  2.0000],
         [ 4.5000,  5.0000],
         [ 0.0000,  0.0000]]),
 torch.Size([6, 2]))

In [13]:
softmax(z)

tensor([[1.2339e-04, 9.9988e-01],
        [9.8201e-01, 1.7986e-02],
        [5.0000e-01, 5.0000e-01],
        [1.1920e-01, 8.8080e-01],
        [3.7754e-01, 6.2246e-01],
        [5.0000e-01, 5.0000e-01]])

In [15]:
from fastcore.test import test_close

In [17]:
test_close(softmax(z), torch.softmax(z, dim=-1))

Crossentropy loss can be defined for binnary cases as follows:

$\mathbb{L} = - \sum_{i=1}^N{[ y_i \ln(\hat{y_i}) + (1 -y_i) \ln(1 -\hat{y}_i)]}$

$\mathbb{L} = - \sum_{i=1}^N{[ y_i \ln(p(c_i)) + (1 -y_i) \ln(1 -p(c_i))]}$

$\mathbb{L} = - \sum_{i=1}^N{[ y_i \ln(\mathrm{sigmoid(x_i)}) + (1 -y_i) \ln(1 -\mathrm{sigmoid(x_i)})]}$

Crossentropy loss can be defined for multiclass cases as follows:

$\mathbb{L} = - \sum_{i=1}^N \sum_{j=1}^K{y_{ij}\ln(\hat{y}_{ij}) }$ 

$\mathbb{L} = - \sum_{i=1}^N \sum_{j=1}^K{y_{ij} \ln(p(c_{ij})) }$

$\mathbb{L} = - \sum_{i=1}^N \sum_{j=1}^K{y_{ij} \ln(\mathrm{softmax(x_{ij})}) }$

$\mathbb{L} = - \sum_{i=1}^N {\ln(\hat{y}_{i}) } 
= - \sum_{i=1}^N { \ln(p(c_{i})) } 
= - \sum_{i=1}^N { \ln(\mathrm{softmax(x_{i})}) }$

We can notice that:
- Only the softmax of the true classes is needes as the other outputs are multiplied by zero ($y_ij=0$ for one hot encoded class different than $y_i$)
- We need logarithm of the softmax, so the expression contain $\log(\exp())$ and can be simplified

In [2]:
def log_softmax(x):
    '''Logarithm of predicted probabilities calculated from the output'''
    return softmax(x).log()

In [None]:
test_close(log_softmax(w1), torch.log_softmax(w1, dim=-1))

In [3]:
def log_softmax2(x):
    return x - x.exp().sum(dim=-1, keepdim=True).log()

In [None]:
test_close(log_softmax(w1), log_softmax2(w1))

In [None]:
def logsumexp(x):
    # a = x.max(dim=-1, keepdim=True)[0]
    # return a + (x-a).exp().sum(dim=-1, keepdim=True).log()
    a = x.max(dim=-1)[0]
    return a + (x-a[...,None]).exp().sum(dim=-1).log()

def log_softmax3(x):
    # print(x.shape, logsumexp(x).shape)
    return x - logsumexp(x).unsqueeze(-1)

In [None]:
test_close(logsumexp(w1), torch.logsumexp(w1, dim=-1))