# Maximizing the likelihood of the labels of the data

Suppose we have $N$ training examples and we have a multi-class problem such that each training example belongs to one and only one out of $K$ possible classes. Let $C(i) \in \{1,\ldots, K\}$ be the correct class for the $i$-th training example and $o^{[C(i)]}_{i}$ is the probability assigned by a classifier to the correct class for the $i$-th training example. We want this classifier to maximize: $$\prod_{i=1}^{N} o^{[C(i)]}_{i}$$ 

If the classifier assigns a probability of $1$ to the correct class for $N-1$ training examples and a probability of $0$ for the $N$-th example then the entire product shown above becomes zero. So to maximize this product of probabilities, the classifier has to assign a high probability to the correct class for each and every training example. 

Now, maximizing the product is equivalent to maximizing $$ln(\prod_{i=1}^{N}o^{[C(i)]}_{i}) = \sum_{i=1}^{N}ln(o^{[C(i)]}_{i})$$ 

This is the same as minimizing the sum of the negative log likelihoods $$-\sum_{i=1}^{N}ln(o^{[C(i)]}_{i})$$

The above can now serve as a loss function for an optimization routine.

Recall that Cross Entropy = $-\sum_{k=1}^{K}y^{[k]}ln(o^{[k]})$ where $y$ is the reference distribution over $K$ classes while our predictions over the $K$ classes is given by $o$. Observe that this summation will collapse to being a single term when, in the reference distribution $y$, only one of the classes has a probability of $1$.

Thus $-\sum_{i=1}^{N}ln(o^{[C(i)]}_{i})$ can be interpreted as the sum of cross entropy losses across all examples.

In [None]:
#| include: false
!pip install git+https://github.com/fastai/fastai
!pip install git+https://github.com/fastai/fastcore

Collecting git+https://github.com/fastai/fastai
  Cloning https://github.com/fastai/fastai to /tmp/pip-req-build-0kh0zo5n
  Running command git clone -q https://github.com/fastai/fastai /tmp/pip-req-build-0kh0zo5n
Collecting fastdownload<2,>=0.0.5
  Downloading fastdownload-0.0.5-py3-none-any.whl (13 kB)
Collecting fastcore<1.4,>=1.3.22
  Downloading fastcore-1.3.26-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 2.0 MB/s 
Building wheels for collected packages: fastai
  Building wheel for fastai (setup.py) ... [?25l[?25hdone
  Created wheel for fastai: filename=fastai-2.5.4-py3-none-any.whl size=186971 sha256=afddb62cb20534748cc2dd18752e02fb05a4ff7f8ed122600a78fa36a0748a15
  Stored in directory: /tmp/pip-ephem-wheel-cache-l1rcznw1/wheels/40/be/4f/b7f2aec4df5712626ceed9f20a8996eb05e31244e57e58d632
Successfully built fastai
Installing collected packages: fastcore, fastdownload, fastai
  Attempting uninstall: fastai
    Found existing installation: fastai 1.0.

Collecting git+https://github.com/fastai/fastcore
  Cloning https://github.com/fastai/fastcore to /tmp/pip-req-build-gg_rdhd7
  Running command git clone -q https://github.com/fastai/fastcore /tmp/pip-req-build-gg_rdhd7
Building wheels for collected packages: fastcore
  Building wheel for fastcore (setup.py) ... [?25l[?25hdone
  Created wheel for fastcore: filename=fastcore-1.3.27-py3-none-any.whl size=55581 sha256=84b7613e8f4899819debcf920a9381e2a31f6ee8fcc4fd0746ca5c6b356168d3
  Stored in directory: /tmp/pip-ephem-wheel-cache-eam2z7qi/wheels/74/46/b7/0d3dddeb22ca1b6f226a3d5b096df11a632951327002d86f1f
Successfully built fastcore
Installing collected packages: fastcore
  Attempting uninstall: fastcore
    Found existing installation: fastcore 1.3.26
    Uninstalling fastcore-1.3.26:
      Successfully uninstalled fastcore-1.3.26
Successfully installed fastcore-1.3.27


In [None]:
from fastai.vision.all import *

Pretend the following are the activations of each class of a multiclass classification problem. So we have 6 examples and in each row we have the activation for each class the example could belong to.

In [None]:
activations = torch.randn((6,2))*2
activations

tensor([[-1.6453,  1.8893],
        [ 1.9800,  1.7681],
        [ 2.8183,  4.6643],
        [-0.3635, -0.0614],
        [ 0.4064, -0.4668],
        [-3.3801,  3.2484]])

Suppose the correct class of each example is as follows

In [None]:
targets = tensor([0,1,0,1,1,0])
targets

tensor([0, 1, 0, 1, 1, 0])

Take the softmax of the activations

In [None]:
sm_acts = torch.softmax(activations, dim=1)
sm_acts

tensor([[0.0283, 0.9717],
        [0.5528, 0.4472],
        [0.1363, 0.8637],
        [0.4250, 0.5750],
        [0.7054, 0.2946],
        [0.0013, 0.9987]])

Extract the probabilities predicted for the correct class.

In [None]:
idx = range(6)
list(idx)

[0, 1, 2, 3, 4, 5]

In [None]:
p_correct_class = sm_acts[idx, targets]
p_correct_class

tensor([0.0283, 0.4472, 0.1363, 0.5750, 0.2946, 0.0013])

Take the log of the softmax activations

In [None]:
torch.log(sm_acts)

tensor([[-3.5634e+00, -2.8753e-02],
        [-5.9281e-01, -8.0469e-01],
        [-1.9925e+00, -1.4659e-01],
        [-8.5559e-01, -5.5344e-01],
        [-3.4895e-01, -1.2222e+00],
        [-6.6298e+00, -1.3213e-03]])

Computing the softmax of the activations and then taking the log is equivalent to applying PyTorch's log_softmax function directly to the original activations. We want to do the latter because it will faster and more accurate.

In [None]:
torch.log_softmax(activations, dim=1)

tensor([[-3.5634e+00, -2.8753e-02],
        [-5.9281e-01, -8.0469e-01],
        [-1.9925e+00, -1.4659e-01],
        [-8.5559e-01, -5.5344e-01],
        [-3.4895e-01, -1.2222e+00],
        [-6.6298e+00, -1.3213e-03]])

Let's compute the mean of cross entropy losses across the training examples:

In [None]:
-1*torch.log(p_correct_class), (-1*torch.log(p_correct_class)).mean()

(tensor([3.5634, 0.8047, 1.9925, 0.5534, 1.2222, 6.6298]), tensor(2.4610))

We can just use Pytorch to compute this directly

In [None]:
nn.CrossEntropyLoss(reduction='none')(activations, targets), nn.CrossEntropyLoss()(activations, targets)

(tensor([3.5634, 0.8047, 1.9925, 0.5534, 1.2222, 6.6298]), tensor(2.4610))

or by using:

In [None]:
F.cross_entropy(activations, targets, reduction='none'), F.cross_entropy(activations, targets)

(tensor([3.5634, 0.8047, 1.9925, 0.5534, 1.2222, 6.6298]), tensor(2.4610))

# Gradient of Cross Entropy

We follow the exposition in @markusthill_ce_note.

Let $z^{[1]},\ldots, z^{[K]}$ denote the activations corresponding to the $K$ classes. The softmax activation for each class is given by:

$$o^{[j]} = \frac{e^{z^{[j]}}}{\sum_{l=1}^{K} e^{z^{[l]}}}$$

The cross-entropy loss across the $K$ classes is given by: 

$$E=-\sum_{l=1}^{K}y^{[l]}ln(o^{[l]})$$

## Partial derivative of $o^{[j]}$ with respect to $z^{[i]}$

$$
\frac{\partial}{\partial z^{[i]}} o^{[j]}  = \frac{\partial}{\partial z^{[i]}} \frac{e^{z^{[j]}}}{\sum_l e^{z^{[l]}}}
= e^{z^{[j]}} \frac{\partial}{\partial z^{[i]}} \Bigg(\sum_l e^{z^{[l]}} \Bigg)^{-1} \\
\qquad = -e^{z^{[j]}} \Bigg(\sum_l e^{z^{[l]}} \Bigg)^{-2} e^{z^{[i]}}
= -o^{[j]} \cdot o^{[i]}
$$

## Partial derivative of $o^{[i]}$ with respect to $z^{[i]}$

$$
\frac{\partial}{\partial z^{[i]}} o^{[i]} 
= \frac{\partial}{\partial z^{[i]}} \frac{e^{z^{[i]}}}{\sum_l e^{z^{[l]}}} 
= \frac{e^{z^{[i]}}}{\sum_{l} e^{z^{[l]}}} + e^{z^{[i]}} \frac{\partial}{\partial z^{[i]}} \Bigg(\sum_l e^{z^{[l]}} \Bigg)^{-1}\\
\quad \qquad \qquad \qquad = o^{[i]}-e^{z^{[i]}} \Bigg(\sum_l e^{z^{[l]}} \Bigg)^{-2} e^{z^{[i]}}
= o^{[i]} - o^{[i]} \cdot o^{[i]} 
= o^{[i]} \cdot (1 - o^{[i]})
$$

Let's compute the gradient of the cross-entropy loss with respect to the activation of the $i$-the class:

In [None]:
#| include: false
# The following renders fine as a markdown cell in Jupyter but not when said notebook is rendered by fastpages
# So I used https://latexeditor.lagrida.com to render the equation take a screenshot and then put it in the page :(
'''
\begin{eqnarray*}
\frac{\partial E}{\partial z^{[i]}} &=&- \sum_{l=1}^{K} y^{[l]}\cdot  \frac{\partial}{\partial z^{[i]}}ln(o^{[l]})
=- \sum_{j \neq i} y^{[j]}\cdot  \frac{\partial}{\partial z^{[i]}}ln(o^{[j]}) - y^{[i]}\cdot  \frac{\partial}{\partial z^{[i]}}ln(o^{[i]})\\
&=&- \sum_{j \neq i} y^{[j]}\cdot \frac{1}{o^{[j]}} \cdot \frac{\partial}{\partial z^{[i]}}o^{[j]} - y^{[i]}\cdot \frac{1}{o^{[i]}} \cdot \frac{\partial}{\partial z^{[i]}}o^{[i]} \\
&=& - \sum_{j \neq i} y^{[j]}\cdot \frac{1}{o^{[j]}} \cdot (-o^{[j]} \cdot o^{[i]}) - y^{[i]}\cdot \frac{1}{o^{[i]}} \cdot o^{[i]} \cdot (1 - o^{[i]}) \\
&=& - \sum_{j \neq i} y^{[j]} (-o^{[i]}) - y^{[i]} (1 - o^{[i]}) \\
&=& - \sum_{j \neq i} y^{[j]} (-o^{[i]}) + y^{[i]} o^{[i]} - y^{[i]} \\
&=& o^{[i]} \sum_{j} y^{[j]}  - y^{[i]} \\
&=& o^{[i]} - y^{[i]}
\end{eqnarray*}
'''

'\n\x08egin{eqnarray*}\n\x0crac{\\partial E}{\\partial z^{[i]}} &=&- \\sum_{l=1}^{K} y^{[l]}\\cdot  \x0crac{\\partial}{\\partial z^{[i]}}ln(o^{[l]})\n=- \\sum_{j \neq i} y^{[j]}\\cdot  \x0crac{\\partial}{\\partial z^{[i]}}ln(o^{[j]}) - y^{[i]}\\cdot  \x0crac{\\partial}{\\partial z^{[i]}}ln(o^{[i]})\\\n&=&- \\sum_{j \neq i} y^{[j]}\\cdot \x0crac{1}{o^{[j]}} \\cdot \x0crac{\\partial}{\\partial z^{[i]}}o^{[j]} - y^{[i]}\\cdot \x0crac{1}{o^{[i]}} \\cdot \x0crac{\\partial}{\\partial z^{[i]}}o^{[i]} \\\n&=& - \\sum_{j \neq i} y^{[j]}\\cdot \x0crac{1}{o^{[j]}} \\cdot (-o^{[j]} \\cdot o^{[i]}) - y^{[i]}\\cdot \x0crac{1}{o^{[i]}} \\cdot o^{[i]} \\cdot (1 - o^{[i]}) \\\n&=& - \\sum_{j \neq i} y^{[j]} (-o^{[i]}) - y^{[i]} (1 - o^{[i]}) \\\n&=& - \\sum_{j \neq i} y^{[j]} (-o^{[i]}) + y^{[i]} o^{[i]} - y^{[i]} \\\n&=& o^{[i]} \\sum_{j} y^{[j]}  - y^{[i]} \\\n&=& o^{[i]} - y^{[i]}\n\\end{eqnarray*}\n'

<div>
<img src="https://github.com/nasheqlbrm/blog/blob/main/images/ce_derivative.png?raw=1" width="500"/>
</div>

Per the _Sylvain says_ section (page 203 Chapter 5) of @fastbook2020, " _The gradient is proportional to the difference between the prediction and the target._... _Because the gradient is linear we won't see sudden jumps or exponential increases in gradients, which should lead to smoother training of models._"