# Beam Search

- 📺 **Video:** [https://youtu.be/wltqDbhlcJ0](https://youtu.be/wltqDbhlcJ0)

## Overview
Introduces Beam Search, a decoding algorithm for generating sequences from models (like translation or dialogue systems) in a more optimal way than naive greedy choice. The video explains that at each time step, instead of choosing only the best next word, beam search keeps track of the top B (beam width) partial hypotheses (sequences) and extends each of them.

In [None]:
import os, random
random.seed(0)
CI = os.environ.get('CI') == 'true'

## Key ideas
- It then prunes to the top B of those extensions, and repeats.
- This finds a higher probability complete sequence than greedy might, because it can recover from local optima by exploring multiple candidates in parallel.
- The lecture might give a small example: suppose the model at first word gives probabilities: “I” (0.4), “The” (0.3), “A” (0.2), ...; a beam of size 2 would keep “I” and “The”.
- Next word, each of those expands to a few possibilities, and we keep the best 2 overall two-word sequences, and so on.

## Demo

In [None]:
print('Try the exercises below and follow the linked materials.')

## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361)
- [Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732)
- [Rethinking Attention with Performers](https://arxiv.org/abs/2009.14794)
- [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150)
- [The Curious Case of Neural Text Degeneration](https://arxiv.org/abs/1904.09751)


*Links only; we do not redistribute slides or papers.*