### RL Deep dive

Intro:

Language models have shown impressive capabilities in the past few years by generating diverse and compelling text from human input prompts. However, what makes a "good" text is inherently hard to define as it is subjective and context dependent. There are many applications such as writing stories where you want creativity, pieces of informative text which should be truthful, or code snippets that we want to be executable.

Writing a loss function to capture these attributes seems intractable and most language models are still trained with a simple next token prediction loss (e.g. cross entropy). 

Large Language Models, such as GPT and BERT, are trained using the cross-entropy loss function. During training, these models predict the probability distribution over a vocabulary for the next word or token in a sequence. The cross-entropy loss measures the difference between this predicted distribution and the actual next word. Minimizing this loss helps the model improve its predictions over time. For instance, when an LLM predicts the next word in a sentence, it generates a probability distribution across all possible words in its vocabulary. The cross-entropy loss then quantifies how far this predicted distribution is from the actual next word, guiding the model to adjust its parameters to make more accurate predictions in future iterations.

In pytorch, `torch.nn.CrossEntropyLoss` combines `LogSoftmax` which converts output logits to log probabilities and `Negative Log-Likelihood Loss`, 

Given a dataset with N observations, and a model that assigns a probability `p(y_i | x_i; \theta)` to each observation `(x_i, y_i)` based on parameters `\theta`, the NLL is defined as:

`\text{NLL}(\theta) = -\sum_{i=1}^{N} \log p(y_i | x_i; \theta)`

Here, `p(y_i | x_i; \theta)` represents the probability assigned by the model to the true outcome `y_i` given input `x_i` and parameters `\theta`. Minimizing the NLL is equivalent to maximizing the likelihood. 

In [4]:
# import torch
# import torch.nn as nn

# # Sample logits tensor of shape (batch_size=2, sequence_length=3, vocab_size=5)
# logits = torch.tensor([[[2.0, 1.0, 0.1, 0.5, 0.3],
#                         [0.2, 1.5, 0.3, 0.7, 0.8],
#                         [1.0, 0.2, 0.8, 0.5, 1.2]],

#                        [[0.5, 1.0, 1.5, 0.2, 0.3],
#                         [1.2, 0.7, 0.5, 1.0, 0.9],
#                         [0.3, 0.8, 1.0, 1.5, 0.4]]])

# # Corresponding target tensor of shape (batch_size=2, sequence_length=3)
# targets = torch.tensor([[0, 3, 4],
#                         [2, 1, 3]])

# # Initialize the CrossEntropyLoss function
# criterion = nn.CrossEntropyLoss()

# # Compute the loss for each position in the sequence
# loss = criterion(logits.view(-1, logits.size(-1)), targets.view(-1))

# print(loss.item())

To compensate for the shortcomings of the loss itself people define metrics that are designed to better capture human preferences such as BLEU or ROUGE. While being better suited than the loss function itself at measuring performance these metrics simply compare generated text to references with simple rules and are thus also limited. Wouldn't it be great if we use human feedback for generated text as a measure of performance or go even one step further and use that feedback as a loss to optimize the model? That's the idea of Reinforcement Learning from Human Feedback (RLHF); use methods from reinforcement learning to directly optimize a language model with human feedback. RLHF has enabled language models to begin to align a model trained on a general corpus of text data to that of complex human values.

**Reinforcement learning from Human Feedback (also referenced as RL from human preferences) is a challenging concept because it involves a multiple-model training process and different stages of deployment.**

1. Pretraining a language model (LM)
   
2. gathering data and training a reward model, and

3. Fine-tuning the LM with reinforcement learning.

References, 

- https://huggingface.co/blog/rlhf