# Neural Language Models

- 📺 **Video:** [https://youtu.be/59NrmwAdOWA](https://youtu.be/59NrmwAdOWA)

## Overview
- Show how neural language models learn embeddings and nonlinear predictors for next-token probabilities.
- Highlight advantages over count-based models in capturing long contexts.

## Key ideas
- **Distributed representations:** neural LMs jointly learn embeddings and predictions.
- **Context windows:** feed-forward neural LMs use concatenated embeddings as input.
- **Optimization:** train with stochastic gradient descent using cross-entropy loss.
- **Generalization:** neural LMs share parameters across contexts, reducing sparsity issues.

## Demo
Implement a simple neural language model with embeddings and a hidden layer on a toy corpus, mirroring the lecture (https://youtu.be/pa4eqR8eYp0).

In [1]:
import numpy as np

corpus = [
    'the cat sat on the mat',
    'the dog chased the cat',
    'the mouse ate the cheese'
]

vocab = sorted(set(['<s>'] + ' '.join(corpus).split() + ['</s>']))
word_to_id = {word: idx for idx, word in enumerate(vocab)}
context_size = 2
embed_dim = 8
hidden_dim = 16

sequences = []
for sentence in corpus:
    tokens = sentence.split() + ['</s>']
    padded = ['<s>'] * context_size + tokens
    for i in range(context_size, len(padded)):
        context = padded[i-context_size:i]
        target = padded[i]
        sequences.append(([word_to_id[c] for c in context], word_to_id[target]))

rng = np.random.default_rng(4)
embeddings = rng.normal(scale=0.1, size=(len(vocab), embed_dim))
W1 = rng.normal(scale=0.1, size=(context_size * embed_dim, hidden_dim))
B1 = np.zeros(hidden_dim)
W2 = rng.normal(scale=0.1, size=(hidden_dim, len(vocab)))
B2 = np.zeros(len(vocab))

lr = 0.1
for epoch in range(1, 301):
    total_loss = 0.0
    rng.shuffle(sequences)
    for context_ids, target_id in sequences:
        context_vec = embeddings[context_ids].reshape(-1)
        h = np.tanh(context_vec @ W1 + B1)
        logits = h @ W2 + B2
        exp = np.exp(logits - logits.max())
        probs = exp / exp.sum()
        loss = -np.log(probs[target_id] + 1e-8)
        total_loss += loss

        grad_logits = probs
        grad_logits[target_id] -= 1
        grad_W2 = np.outer(h, grad_logits)
        grad_B2 = grad_logits
        grad_h = grad_logits @ W2.T
        grad_pre = grad_h * (1 - h ** 2)
        grad_W1 = np.outer(context_vec, grad_pre)
        grad_B1 = grad_pre
        grad_context = grad_pre @ W1.T
        grad_context = grad_context.reshape(context_size, embed_dim)

        W2 -= lr * grad_W2
        B2 -= lr * grad_B2
        W1 -= lr * grad_W1
        B1 -= lr * grad_B1
        for idx, emb_grad in zip(context_ids, grad_context):
            embeddings[idx] -= lr * emb_grad
    if epoch % 100 == 0:
        print(f"epoch {epoch:3d} | loss {total_loss/len(sequences):.4f}")

# Evaluate next-word probabilities
context = ['the', 'cat']
context_ids = [word_to_id[c] for c in context]
context_vec = embeddings[context_ids].reshape(-1)
h = np.tanh(context_vec @ W1 + B1)
logits = h @ W2 + B2
probs = np.exp(logits - logits.max())
probs /= probs.sum()

print()
print('Next-word distribution for context "the cat":')
for word, idx in word_to_id.items():
    print(f"P({word}|{' '.join(context)}) = {probs[idx]:.3f}")


epoch 100 | loss 0.3768
epoch 200 | loss 0.3337
epoch 300 | loss 0.3231

Next-word distribution for context "the cat":
P(</s>|the cat) = 0.633
P(<s>|the cat) = 0.000
P(ate|the cat) = 0.000
P(cat|the cat) = 0.000
P(chased|the cat) = 0.005
P(cheese|the cat) = 0.003
P(dog|the cat) = 0.000
P(mat|the cat) = 0.000
P(mouse|the cat) = 0.000
P(on|the cat) = 0.000
P(sat|the cat) = 0.359
P(the|the cat) = 0.000


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [Eisenstein 6.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 6.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [[Blog] Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [[Blog] The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
- [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409)
- [The Impact of Positional Encoding on Length Generalization in Transformers](https://arxiv.org/abs/2305.19466)


*Links only; we do not redistribute slides or papers.*