# Beam Search

- 📺 **Video:** [https://youtu.be/wltqDbhlcJ0](https://youtu.be/wltqDbhlcJ0)

## Overview
- Decode sequences by exploring multiple hypotheses at each time step rather than greedy choices.
- Understand how beam size trades computation for solution quality.

## Key ideas
- **Hypothesis expansion:** keep top-k partial sequences ranked by log probability.
- **Length normalization:** adjust scores to avoid favoring short outputs.
- **Diversity:** constraints or penalties encourage different beams.
- **Trade-offs:** larger beams improve accuracy but raise latency.

## Demo
Run beam search on a toy probability table to recover the most likely three-token sequence, paralleling the lecture (https://youtu.be/L369zuF6Lt8).

In [1]:
import math

vocab = ['A', 'B', 'C', '</s>']
log_probs = {
    '<s>': {'A': -0.2, 'B': -1.0, 'C': -1.5},
    'A': {'A': -2.0, 'B': -0.5, 'C': -1.2, '</s>': -0.4},
    'B': {'A': -1.3, 'B': -0.8, 'C': -0.9, '</s>': -0.3},
    'C': {'A': -0.5, 'B': -1.5, 'C': -0.7, '</s>': -1.2}
}

beam_size = 2
max_len = 3
beams = [('<s>', 0.0)]
finished = []

for _ in range(max_len):
    new_beams = []
    for seq, score in beams:
        last_token = seq.split()[-1]
        for token, lp in log_probs[last_token].items():
            new_seq = seq + ' ' + token
            new_score = score + lp
            if token == '</s>':
                finished.append((new_seq.replace('<s> ', ''), new_score))
            else:
                new_beams.append((new_seq, new_score))
    new_beams.sort(key=lambda x: x[1])
    beams = new_beams[:beam_size]

finished.extend(beams)
finished.sort(key=lambda x: x[1])

print('Top hypotheses:')
for seq, score in finished[:3]:
    print(f"{seq} | logP={score:.2f}")

Top hypotheses:
<s> C B A | logP=-4.30
<s> B A A | logP=-4.30
C B </s> | logP=-3.30


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361)
- [Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732)
- [Rethinking Attention with Performers](https://arxiv.org/abs/2009.14794)
- [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150)
- [The Curious Case of Neural Text Degeneration](https://arxiv.org/abs/1904.09751)


*Links only; we do not redistribute slides or papers.*