# Reinforcement Learning and Supervised Fine-tuning for LLMs

This notebook has an educational overview on the state of reinforcement learning and supervised fine-tuning for LLMs at a high-level.

## Overview
- Pretrained LLMs learn broad world knowledge via next-token prediction, but they fail on complex reasoning tasks in one-shot.
- For hard problems, "one-shot" correctness is infeasible and doesn't reflect how humans solve these problems. Models need guidance to sustain multi-step reasoning, explore alternative paths, and use tools.
- Reinforcement learning and supervised fine-tuning (SFT) are methods to align LLMs with desired behaviors or ability to solve tasks.
- While these methods don't inherently give the model new knowledge or information, the current SOTA on math and coding benchmarks are dominated by models that are fine-tuned to perform in these domains.
- Unlike prompt scaffolding such as chain-of-thought that are post-training engineering tricks, reinforcement learning and supervised fine-tuning methods intrinsically modify and guide the model itself. The model's weights change but the architecture stays the same.
- After pretraining, the model already contains a lot of latent knowledge, but does not always reliably use that knowledge in ways that are helpful.

## High-level Methodology

The process usually begins with a large pretrained language model was trained on an Internet-scale corpus with the simple objective of predicting the next token. Supervised fine-tuning (SFT) is applied to adapt it more closely to a particular domain. In this step, we gather a dataset of high-quality demonstrations, examples of solutions or reasoning chains we want the model to produce. Training continues with the same next-token prediction objective:

$$
\mathcal{L}_{\text{SFT}}(\theta) = -\sum_t \log P_\theta(y_t \mid y_{<t}, x)
$$

where $\theta$ are the model parameters, $x$ is the input, and $y_t$ are the demonstration tokens. This process does not add new factual knowledge, but it biases the probability distribution toward the desired behaviors shown in the demonstrations.

After the SFT model is trained, we define a reward function $R(x, y)$. The reward is designed to capture task-specific criteria of success, producing a scalar score for a candidate output $y$ given input $x$. This can be any measurable signal that indicates how well the output meets the task objective.

The final step is to use this reward function to further refine policy (language model). Outputs $y$ are sampled from the model distribution $P_\theta(\cdot \mid x)$, scored with $R(x, y)$, and then used to update the model parameters using the gradient of the objective:

$$
\max_\theta \; \mathbb{E}_{y \sim P_\theta(\cdot|x)} \Big[ R(x,y) - \beta \text{KL}(P_\theta(\cdot|x) || P_{\text{SFT}}(\cdot|x)) \Big]
$$

The first term encourages the model to generate outputs that achieve higher reward according to the task definition, while the KL penalty keeps the updated model close to the SFT baseline. The hyperparameter $\beta$ controls the trade-off between pursuing reward and preserving stability. In practice, algorithms such as PPO, DPO, or GRPO are often used to approximate this update.

It's important to note that this is **policy-gradient reinforcement learning on a pretrained LLM**. We treat the LLM as a stochastic policy, sample actions (which are the full completions), score them with a reward (using a reward model or function), and update the parameters using the gradient so the higher-scoring completions become more probable.