## Loss

The goal for this notebook is to demonstrate the loss function - cross entropy loss - commonly used in language modeling.

* Data: For next-word prediction, we calculate the loss `context_size` * `batch_size` times.
* Logsoftmax: It's just log(softmax(x))
* Negative log liklihood loss is the negative log liklihood of the correct class
* Cross Entropy Loss is a convenient combination of the previous two steps: Negative Log Liklhoood Loss ( LogSoftmax )

### For next-word prediction, we calculate the loss `context_size` * `batch_size` times.

The training goal is to predict the next token. In data, it looks like this:

In [1]:
import torch
import torch.nn.functional as F

torch.set_printoptions(precision=4, sci_mode=False)
torch.manual_seed(538)

batch_size = 3
context_size = 4

data_batch = torch.randint(high=10, size=(batch_size, context_size + 1), dtype=torch.float32)
x_batch = data_batch[:, :context_size]
y_batch = data_batch[:, 1:context_size+1]

print(f"Input tokens:\n{data_batch}")

print("-" * 50)
print(f"| {'Seq': <6} | {'Context':<25} | {'Target'} |")
print("-" * 50)
for b in range(batch_size):
    for t in range(context_size):
        context = x_batch[b, : t + 1]
        target = y_batch[b, t]
        print(f"| {b: <6} | {str(context.tolist()):<25} | {target:<6} |")

Input tokens:
tensor([[7., 9., 4., 7., 4.],
        [9., 3., 0., 4., 9.],
        [0., 0., 7., 0., 5.]])
--------------------------------------------------
| Seq    | Context                   | Target |
--------------------------------------------------
| 0      | [7.0]                     | 9.0    |
| 0      | [7.0, 9.0]                | 4.0    |
| 0      | [7.0, 9.0, 4.0]           | 7.0    |
| 0      | [7.0, 9.0, 4.0, 7.0]      | 4.0    |
| 1      | [9.0]                     | 3.0    |
| 1      | [9.0, 3.0]                | 0.0    |
| 1      | [9.0, 3.0, 0.0]           | 4.0    |
| 1      | [9.0, 3.0, 0.0, 4.0]      | 9.0    |
| 2      | [0.0]                     | 0.0    |
| 2      | [0.0, 0.0]                | 7.0    |
| 2      | [0.0, 0.0, 7.0]           | 0.0    |
| 2      | [0.0, 0.0, 7.0, 0.0]      | 5.0    |


What this means: for any given sequence, we will get context_size number of training batches from it to calculate the loss on.

### `F.log_softmax` is equivalent to `torch.log(F.softmax)`

In [2]:
batch_size = 3
context_size = 4
n_vocab = 8

logits = torch.randn(size=(batch_size * context_size, n_vocab), dtype=torch.float32, requires_grad=True)

output = torch.log(F.softmax(logits, dim=-1))
output2 = F.log_softmax(logits, dim=-1)
print(f"Outputs are the same: {torch.allclose(output, output2)}")

Outputs are the same: True


But IRL you should use `F.log_softmax` for better numerical properties.

### Negative log liklihood loss is the negative log probability of the correct class

* First, you take the neural network outputs (i.e., logits) and compute their probabilities (i.e., Softmax).
* Second, you take the log of these probabilities, these are the log likelihood values.
* Third, you retrieve the log liklihood value for the correct class and multiply by negative one.

In [3]:

batch_size = 3
context_size = 4
n_vocab = 8

# This is the form of the data during training (i.e., from nn output and the dataloader)
logits = torch.randn(size=(batch_size, context_size, n_vocab), dtype=torch.float32, requires_grad=True)
targets = torch.randint(high=n_vocab, size=(batch_size, context_size), dtype=torch.long)

# This is the format we format it to for computing loss
# logits are floats 0...1 with dimension (N = Number of observations, C = Number of classes)
# target are integers 0...C-1 with dimension (N = Number of observations,) 
logits = logits.view(batch_size * context_size, n_vocab)
targets = targets.view(batch_size * context_size)

output = F.nll_loss(F.log_softmax(logits, dim= -1), targets)
print(output)

n_examples = logits.shape[0]
log_probs = F.log_softmax(logits, dim= -1)
nll_losses = torch.zeros(n_examples)
for i in range(n_examples):
    nll_losses[i] = -log_probs[i, targets[i]]

output2 = nll_losses.mean()
print(output2)

tensor(2.3182, grad_fn=<NllLossBackward0>)
tensor(2.3182, grad_fn=<MeanBackward0>)


### Cross Entropy Loss combines three functions (Negative Log Liklihood Loss, Log, and Softmax) into one

In [4]:
batch_size = 3
context_size = 4
n_vocab = 8

# This is the form of the data during training (i.e., from nn output and the dataloader)
logits = torch.randn(size=(batch_size, context_size, n_vocab), dtype=torch.float32, requires_grad=True)
targets = torch.randint(high=n_vocab, size=(batch_size, context_size), dtype=torch.long)

# This is the format we format it to for computing loss
# logits are floats 0...1 with dimension (N = Number of observations, C = Number of classes)
# target are integers 0...C-1 with dimension (N = Number of observations,) 
logits = logits.view(batch_size * context_size, n_vocab)
targets = targets.view(batch_size * context_size)

output = F.cross_entropy(logits, targets)
print(output)
output2 = F.nll_loss(F.log_softmax(logits, dim= -1), targets)
print(output2)

tensor(2.8195, grad_fn=<NllLossBackward0>)
tensor(2.8195, grad_fn=<NllLossBackward0>)


### References

* https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/
* https://cs231n.github.io/neural-networks-case-study/#grad