# HMMs: Parameter Estimation

- 📺 **Video:** [https://youtu.be/dVF7LZkbl9g](https://youtu.be/dVF7LZkbl9g)

## Overview
- Estimate HMM transition and emission parameters using Baum–Welch (EM).
- Alternate between expectation (E) and maximization (M) steps until convergence.

## Key ideas
- **Forward-backward:** compute posterior probabilities of states and transitions.
- **Expected counts:** use posteriors as fractional counts for transitions/emissions.
- **Normalization:** re-estimate parameters by dividing expected counts.
- **Convergence:** EM monotonically increases data likelihood but may reach local optima.

## Demo
Run two Baum–Welch iterations on toy observation sequences to demonstrate parameter updates described in the lecture (https://youtu.be/KHzObZGJ8bA).

In [1]:
import numpy as np

observations = ['walk', 'shop', 'clean']
obs_to_idx = {'walk': 0, 'shop': 1, 'clean': 2}
seqs = [['walk', 'shop', 'clean'], ['walk', 'walk', 'shop']]
states = 2
obs_dim = len(observations)

rng = np.random.default_rng(0)
start = np.array([0.5, 0.5])
trans = np.array([[0.6, 0.4], [0.3, 0.7]])
emiss = np.array([[0.3, 0.4, 0.3], [0.4, 0.2, 0.4]])

for iteration in range(1, 3):
    gamma_sum = np.zeros(states)
    xi_sum = np.zeros((states, states))
    emit_sum = np.zeros((states, obs_dim))

    for seq in seqs:
        T = len(seq)
        obs_idx = [obs_to_idx[w] for w in seq]
        alpha = np.zeros((T, states))
        beta = np.zeros((T, states))

        alpha[0] = start * emiss[:, obs_idx[0]]
        alpha[0] /= alpha[0].sum()
        for t in range(1, T):
            alpha[t] = emiss[:, obs_idx[t]] * (alpha[t-1] @ trans)
            alpha[t] /= alpha[t].sum()

        beta[-1] = np.ones(states)
        for t in range(T-2, -1, -1):
            beta[t] = (trans @ (emiss[:, obs_idx[t+1]] * beta[t+1]))
            beta[t] /= beta[t].sum()

        gamma = alpha * beta
        gamma /= gamma.sum(axis=1, keepdims=True)
        gamma_sum += gamma.sum(axis=0)

        for t in range(T-1):
            xi = np.outer(alpha[t], emiss[:, obs_idx[t+1]] * beta[t+1]) * trans
            xi /= xi.sum()
            xi_sum += xi

        for t, idx in enumerate(obs_idx):
            emit_sum[:, idx] += gamma[t]

    start = gamma_sum / gamma_sum.sum()
    trans = xi_sum / xi_sum.sum(axis=1, keepdims=True)
    emiss = emit_sum / emit_sum.sum(axis=1, keepdims=True)

    print(f'Iteration {iteration} start:', start)
    print('Transition matrix:')
    print(trans)
    print('Emission matrix:')
    print(emiss)


Iteration 1 start: [0.47897727 0.52102273]
Transition matrix:
[[0.64014467 0.35985533]
 [0.36432026 0.63567974]]
Emission matrix:
[[0.45432977 0.40332147 0.14234875]
 [0.54198473 0.26899309 0.18902217]]
Iteration 2 start: [0.49380369 0.50619631]
Transition matrix:
[[0.65503727 0.34496273]
 [0.39260503 0.60739497]]
Emission matrix:
[[0.45765089 0.38882289 0.15352622]
 [0.54131233 0.27920226 0.17948541]]


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks](https://www.aclweb.org/anthology/W19-4302/)
- [GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding](https://arxiv.org/pdf/1804.07461.pdf)
- [What Does BERT Look At? An Analysis of BERT's Attention](https://arxiv.org/abs/1906.04341)
- [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692.pdf)
- [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461)
- [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
- [UnifiedQA: Crossing Format Boundaries With a Single QA System](https://arxiv.org/abs/2005.00700)
- [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/pdf/1508.07909.pdf)
- [Byte Pair Encoding is Suboptimal for Language Model Pretraining](https://arxiv.org/pdf/2004.03720.pdf)
- [Eisenstein 8.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [TnT - A Statistical Part-of-Speech Tagger](https://arxiv.org/abs/cs/0003055)
- [Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger](https://www.aclweb.org/anthology/W00-1308/)
- [Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?](https://link.springer.com/chapter/10.1007/978-3-642-19400-9_14)
- [Natural Language Processing with Small Feed-Forward Networks](https://www.aclweb.org/anthology/D17-1309.pdf)
- [Eisenstein 10.1-10.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3-10.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Accurate Unlexicalized Parsing](https://www.aclweb.org/anthology/P03-1054/)
- [Eisenstein 10.5](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 11.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Finding Optimal 1-Endpoint-Crossing Trees](https://www.aclweb.org/anthology/Q13-1002/)
- [Eisenstein 11.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)


*Links only; we do not redistribute slides or papers.*