# LM Evaluation

- 📺 **Video:** [https://youtu.be/ImW4vJ5XZQc](https://youtu.be/ImW4vJ5XZQc)

## Overview
- Evaluate language models with log-likelihood, perplexity, and cross-entropy metrics.
- Understand how evaluation choices relate to downstream tasks.

## Key ideas
- **Log-likelihood:** sum of log probabilities over tokens quantifies model fit.
- **Cross-entropy:** average negative log-likelihood per token.
- **Perplexity:** exponentiated cross-entropy; lower is better.
- **Calibration:** high perplexity may still yield useful relative scores for re-ranking.

## Demo
Compute perplexity of toy language models on a held-out set, mirroring the evaluation checklist from the lecture (https://youtu.be/eAXNGkzLZd0).

In [1]:
import math

# toy models: bigram vs trigram probabilities
bigram_model = {
    ('<s>', 'the'): 0.6,
    ('<s>', 'a'): 0.4,
    ('the', 'cat'): 0.5,
    ('the', 'dog'): 0.5,
    ('cat', '</s>'): 0.6,
    ('dog', '</s>'): 0.7,
    ('dog', 'runs'): 0.3,
    ('runs', '</s>'): 1.0
}

trigram_model = {
    ('<s>', '<s>', 'the'): 0.8,
    ('<s>', 'the', 'cat'): 0.7,
    ('<s>', 'the', 'dog'): 0.3,
    ('the', 'cat', '</s>'): 0.9,
    ('the', 'dog', '</s>'): 0.4,
    ('the', 'dog', 'runs'): 0.6,
    ('dog', 'runs', '</s>'): 1.0
}

def perplexity(model, sentence, order):
    tokens = ['<s>'] * (order - 1) + sentence.split() + ['</s>']
    log_prob = 0.0
    for i in range(order - 1, len(tokens)):
        if order == 2:
            context = (tokens[i-1], tokens[i])
        else:
            context = (tokens[i-2], tokens[i-1], tokens[i])
        prob = model.get(context, 1e-6)
        log_prob += math.log(prob)
    return math.exp(-log_prob / (len(tokens) - (order - 1)))

sentences = ['the cat', 'the dog runs']
for sent in sentences:
    ppl_bigram = perplexity(bigram_model, sent, order=2)
    ppl_trigram = perplexity(trigram_model, sent, order=3)
    print(f"Sentence '{sent}' | bigram perplexity {ppl_bigram:.2f} | trigram perplexity {ppl_trigram:.2f}")


Sentence 'the cat' | bigram perplexity 1.77 | trigram perplexity 1.26
Sentence 'the dog runs' | bigram perplexity 1.83 | trigram perplexity 1.62


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 6.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [[Blog] Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [[Blog] The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)
- [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)


*Links only; we do not redistribute slides or papers.*