In [None]:
import torch
from torch.nn.functional import softmax
from shared import gpt, tokenizer, show_probabilities

### P(u)

![gpt1math](assets/pumath.png)

In [None]:
vocab = ['a', 'b', 'c', 'd', 'e']
logits = torch.tensor([2.5, 1.0, 0.5, -1.0, 3.0])
probabilities = softmax(logits, dim=0)

show_probabilities(logits, probabilities, vocab)

In [None]:
text = 'The sun rises in the'
inputs = tokenizer(text, return_tensors='pt')

with torch.no_grad():
    outputs = gpt(**inputs)

last_token_logits = outputs.logits[0, -1, :]

probabilities = softmax(last_token_logits, dim=0)

# get top 10 most likely next tokens
top_probs, top_indices = probabilities.topk(10)
top_logits = last_token_logits[top_indices]
top_tokens = [tokenizer.decode([idx]) for idx in top_indices]

show_probabilities(top_logits, top_probs, top_tokens)

#### Sampling

Once we have probabilities, how do we actually pick the next word?

- **Greedy sampling**: Always pick the highest probability word (deterministic)
- **Top-k**: Sample from the k most likely tokens when sampling
- **Top-p (nucleus)**: Sample from smallest set of tokens whose cumulative probability >= p
- **Structured Generation**: Only consider certain parts of the vocabulary as valid
- **Temperature**: Scale logits before softmax to control randomness (higher = more creative)
- **Beam search**: Heuristic search algorithm that explores multiple possible sequences simultaneously by keeping the top-k most promising partial sequences (called "beams") at each step, where k is the beam width

Different sampling strategies produce different text styles.