In [1]:
#!/usr/bin/env python3

This file illustrates how you might experiment with the HMM interface.
You can paste these commands in at the Python prompt, or execute `test_en.py` directly.
A notebook interface is nicer than the plain Python prompt, so we provide
a notebook version of this file as `test_en.ipynb`, which you can open with
`jupyter` or with Visual Studio `code` (run it with the `nlp-class` kernel).

In [2]:
import logging
import math
import os
from pathlib import Path

In [3]:
from corpus import TaggedCorpus
from eval import eval_tagging, model_cross_entropy, viterbi_error_rate
from hmm import HiddenMarkovModel
from crf import ConditionalRandomField

Set up logging.

In [4]:
logging.root.setLevel(level=logging.INFO)
log = logging.getLogger("test_en")       # For usage, see findsim.py in earlier assignment.
logging.basicConfig(format="%(levelname)s : %(message)s", level=logging.INFO)  # could change INFO to DEBUG

Switch working directory to the directory where the data live.  You may need to edit this line.

In [5]:
os.chdir("../data")

In [6]:
entrain = TaggedCorpus(Path("ensup"), Path("enraw"))                               # all training
ensup =   TaggedCorpus(Path("ensup"), tagset=entrain.tagset, vocab=entrain.vocab)  # supervised training
endev =   TaggedCorpus(Path("endev"), tagset=entrain.tagset, vocab=entrain.vocab)  # evaluation
print(f"{len(entrain)=}  {len(ensup)=}  {len(endev)=}")

INFO : Read 191873 tokens from ensup, enraw
INFO : Created 26 tag types
INFO : Created 18461 word types


len(entrain)=8064  len(ensup)=4051  len(endev)=996


In [7]:
known_vocab = TaggedCorpus(Path("ensup")).vocab    # words seen with supervised tags; used in evaluation
log.info(f"Tagset: f{list(entrain.tagset)}")

INFO : Read 95936 tokens from ensup
INFO : Created 26 tag types
INFO : Created 12466 word types
INFO : Tagset: f['W', 'J', 'N', 'C', 'V', 'I', 'D', ',', 'M', 'P', '.', 'E', 'R', '`', "'", 'T', '$', ':', '-', '#', 'S', 'F', 'U', 'L', '_EOS_TAG_', '_BOS_TAG_']


Make an HMM.  Let's do some pre-training to approximately maximize the
regularized log-likelihood on supervised training data.  In other words, the
probabilities at the M step will just be supervised count ratios.

On each epoch, you will see two progress bars: first it collects counts from
all the sentences (E step), and then after the M step, it evaluates the loss
function, which is the (unregularized) cross-entropy on the training set.

The parameters don't actually matter during the E step because there are no
hidden tags to impute.  The first M step will jump right to the optimal
solution.  The code will try a second epoch with the revised parameters, but
the result will be identical, so it will detect convergence and stop.

We arbitrarily choose λ=1 for our add-λ smoothing at the M step, but it would
be better to search for the best value of this hyperparameter.

In [8]:
log.info("*** Hidden Markov Model (HMM)")
hmm = HiddenMarkovModel(entrain.tagset, entrain.vocab)  # randomly initialized parameters  
loss_sup = lambda model: model_cross_entropy(model, eval_corpus=endev)
hmm.train(corpus=ensup, loss=loss_sup, λ=1.0,
          save_path="ensup_hmm.pkl") 

INFO : *** Hidden Markov Model (HMM)
100%|██████████| 996/996 [00:01<00:00, 937.29it/s]
INFO : Cross-entropy: 12.6501 nats (= perplexity 311807.579)
100%|██████████| 4051/4051 [00:11<00:00, 355.76it/s]
100%|██████████| 996/996 [00:01<00:00, 959.96it/s]
INFO : Cross-entropy: 7.5993 nats (= perplexity 1996.789)
100%|██████████| 4051/4051 [00:11<00:00, 365.06it/s]
100%|██████████| 996/996 [00:01<00:00, 948.97it/s]
INFO : Cross-entropy: 7.5993 nats (= perplexity 1996.791)
INFO : Saved model to ensup_hmm.pkl


In [9]:
log.info("*** Hidden Markov Model (HMM)")
hmm = HiddenMarkovModel(entrain.tagset, entrain.vocab, unigram=True)  # randomly initialized parameters  
loss_sup = lambda model: model_cross_entropy(model, eval_corpus=endev)
hmm.train(corpus=ensup, loss=loss_sup, λ=1.0,
          save_path="ensup_hmm_unigram.pkl") 

INFO : *** Hidden Markov Model (HMM)
100%|██████████| 996/996 [00:01<00:00, 874.08it/s]
INFO : Cross-entropy: 12.6486 nats (= perplexity 311331.512)
100%|██████████| 4051/4051 [00:10<00:00, 374.18it/s]
100%|██████████| 996/996 [00:01<00:00, 930.48it/s]
INFO : Cross-entropy: 8.1047 nats (= perplexity 3310.069)
100%|██████████| 4051/4051 [00:10<00:00, 371.32it/s]
100%|██████████| 996/996 [00:01<00:00, 965.34it/s]
INFO : Cross-entropy: 8.1047 nats (= perplexity 3310.135)
INFO : Saved model to ensup_hmm_unigram.pkl


In [12]:
look_at_your_data(hmm, endev, 10)

INFO : Gold:    ``/` We/P 're/V strongly/R _OOV_/V that/I anyone/N who/W has/V eaten/V in/I the/D cafeteria/N this/D month/N have/V the/D shot/N ,/, ''/' Mr./N Mattausch/N added/V ,/, ``/` and/C that/D means/V virtually/R everyone/N who/W works/V here/R ./.
INFO : Viterbi: ``/` We/P 're/V strongly/N _OOV_/N that/I anyone/N who/W has/V eaten/V in/I the/D cafeteria/N this/D month/N have/V the/D shot/V ,/, ''/' Mr./N Mattausch/N added/V ,/, ``/` and/C that/I means/V virtually/R everyone/N who/W works/V here/R ./.
INFO : Loss:    4/34


INFO : Cross-entropy: 11.735761642456055 nats (= perplexity 3410.486320483463)
---
INFO : Gold:    I/P was/V _OOV_/V to/T read/V the/D _OOV_/N of/I facts/N in/I your/P Oct./N 13/C editorial/N ``/` _OOV_/N 's/P _OOV_/N _OOV_/N ./. ''/'
INFO : Viterbi: I/P was/V _OOV_/N to/T read/V the/D _OOV_/N of/I facts/N in/I your/P Oct./N 13/C editorial/J ``/` _OOV_/N 's/P _OOV_/N _OOV_/N ./. ''/'
INFO : Loss:    2/21
INFO : Cross-entropy: 11.800881385803223 nats (= perplexity 3567.9539295037325)
---
INFO : Gold:    It/P is/V the/D _OOV_/J guerrillas/N who/W are/V aligned/V with/I the/D drug/N traffickers/N ,/, not/R the/D left/J _OOV_/N ./.
INFO : Viterbi: It/P is/V the/D _OOV_/N guerrillas/N who/W are/V aligned/N with/I the/D drug/N traffickers/N ,/, not/R the/D left/V _OOV_/N ./.
INFO : Loss:    3/18
INFO : Cross-entropy: 11.129284858703613 nats (= perplexity 2240.002911809472)
---
INFO : Gold:    This/D information/N was/V _OOV_/V from/I your/P own/J news/N stories/N on/I the/D region/N ./.
INFO

Now let's throw in the unsupervised training data as well, and continue
training as before, in order to increase the regularized log-likelihood on
this larger, semi-supervised training set.  It's now the *incomplete-data*
log-likelihood.

This time, we'll use a different evaluation loss function: we'll stop when the
*tagging error rate* on a held-out dev set stops getting better.  Also, the
implementation of this loss function (`viterbi_error_rate`) includes a helpful
side effect: it logs the *cross-entropy* on the held-out dataset as well, just
for your information.

We hope that held-out tagging accuracy will go up for a little bit before it
goes down again (see Merialdo 1994). (Log-likelihood on training data will
continue to improve, and that improvement may generalize to held-out
cross-entropy.  But getting accuracy to increase is harder.)

In [13]:
hmm = HiddenMarkovModel.load("ensup_hmm.pkl")  # reset to supervised model (in case you're re-executing this bit)
loss_dev = lambda model: viterbi_error_rate(model, eval_corpus=endev, 
                                            known_vocab=known_vocab)
hmm.train(corpus=entrain, loss=loss_dev, λ=1.0,
          save_path="entrain_hmm.pkl")

INFO : Loaded model from ensup_hmm.pkl
  0%|          | 0/996 [00:00<?, ?it/s]

100%|██████████| 996/996 [00:01<00:00, 800.76it/s]
INFO : Cross-entropy: 7.5993 nats (= perplexity 1996.791)
100%|██████████| 996/996 [00:01<00:00, 800.53it/s]
INFO : Tagging accuracy: all: 88.663%, known: 93.059%, seen: 44.108%, novel: 42.734%
100%|██████████| 8064/8064 [00:23<00:00, 347.73it/s]
100%|██████████| 996/996 [00:01<00:00, 858.24it/s]
INFO : Cross-entropy: 7.3485 nats (= perplexity 1553.842)
100%|██████████| 996/996 [00:01<00:00, 831.62it/s]
INFO : Tagging accuracy: all: 87.031%, known: 91.397%, seen: 45.791%, novel: 40.225%
INFO : Saved model to entrain_hmm.pkl


In [None]:
hmm = HiddenMarkovModel.load("ensup_hmm_unigram.pkl")  # reset to supervised model (in case you're re-executing this bit)
loss_dev = lambda model: viterbi_error_rate(model, eval_corpus=endev, 
                                            known_vocab=known_vocab)
hmm.train(corpus=entrain, loss=loss_dev, λ=1.0,
          save_path="entrain_hmm_unigram.pkl")

INFO : Loaded model from ensup_hmm_unigram.pkl
100%|██████████| 996/996 [00:01<00:00, 719.41it/s]
INFO : Cross-entropy: 8.1047 nats (= perplexity 3310.103)
100%|██████████| 996/996 [00:01<00:00, 816.37it/s]
INFO : Tagging accuracy: all: 91.210%, known: 94.602%, seen: 54.209%, novel: 56.803%
100%|██████████| 8064/8064 [00:23<00:00, 342.06it/s]
100%|██████████| 996/996 [00:01<00:00, 775.63it/s]
INFO : Cross-entropy: 7.8815 nats (= perplexity 2647.866)
100%|██████████| 996/996 [00:01<00:00, 652.25it/s]
INFO : Tagging accuracy: all: 90.697%, known: 94.039%, seen: 54.209%, novel: 56.803%
INFO : Saved model to entrain_hmm_unigram.pkl


You can also retry the above workflow where you start with a worse supervised
model (like Merialdo).  Does EM help more in that case?  It's easiest to rerun
exactly the code above, but first make the `ensup` file smaller by copying
`ensup-tiny` over it.  `ensup-tiny` is only 25 sentences (that happen to cover
all tags in `endev`).  Back up your old `ensup` and your old `*.pkl` models
before you do this.

More detailed look at the first 10 sentences in the held-out corpus,
including Viterbi tagging.

In [11]:
def look_at_your_data(model, dev, N):
    for m, sentence in enumerate(dev):
        if m >= N: break
        viterbi = model.viterbi_tagging(sentence.desupervise(), endev)
        counts = eval_tagging(predicted=viterbi, gold=sentence, 
                              known_vocab=known_vocab)
        num = counts['NUM', 'ALL']
        denom = counts['DENOM', 'ALL']
        
        log.info(f"Gold:    {sentence}")
        log.info(f"Viterbi: {viterbi}")
        log.info(f"Loss:    {denom - num}/{denom}")
        xent = -model.logprob(sentence, endev) / len(sentence)  # measured in nats
        log.info(f"Cross-entropy: {xent/math.log(2)} nats (= perplexity {math.exp(xent)})\n---")

In [None]:
look_at_your_data(hmm, endev, 10)

INFO : Gold:    ``/` We/P 're/V strongly/R _OOV_/V that/I anyone/N who/W has/V eaten/V in/I the/D cafeteria/N this/D month/N have/V the/D shot/N ,/, ''/' Mr./N Mattausch/N added/V ,/, ``/` and/C that/D means/V virtually/R everyone/N who/W works/V here/R ./.
INFO : Viterbi: ``/` We/P 're/V strongly/R _OOV_/V that/I anyone/N who/W has/V eaten/V in/I the/D cafeteria/N this/D month/N have/V the/D shot/N ,/, ''/' Mr./N Mattausch/T added/V ,/, ``/` and/C that/I means/V virtually/R everyone/, who/W works/V here/R ./.
INFO : Loss:    3/34
INFO : Cross-entropy: 10.617786407470703 nats (= perplexity 1571.34741886324)
---
INFO : Gold:    I/P was/V _OOV_/V to/T read/V the/D _OOV_/N of/I facts/N in/I your/P Oct./N 13/C editorial/N ``/` _OOV_/N 's/P _OOV_/N _OOV_/N ./. ''/'
INFO : Viterbi: I/P was/V _OOV_/V to/T read/V the/D _OOV_/N of/I facts/N in/I your/P Oct./N 13/C editorial/, ``/` _OOV_/P 's/V _OOV_/D _OOV_/N ./. ''/'
INFO : Loss:    4/21
INFO : Cross-entropy: 10.876279830932617 nats (= perplex

Now let's try supervised training of a CRF (this doesn't use the unsupervised
part of the data, so it is comparable to the supervised pre-training we did
for the HMM).  We will use SGD to approximately maximize the regularized
log-likelihood. 

As with the semi-supervised HMM training, we'll periodically evaluate the
tagging accuracy (and also print the cross-entropy) on a held-out dev set.
We use the default `eval_interval` and `tolerance`.  If you want to stop
sooner, then you could increase the `tolerance` so the training method decides
sooner that it has converged.

We arbitrarily choose reg = 1.0 for L2 regularization, learning rate = 0.05,
and a minibatch size of 10, but it would be better to search for the best
value of these hyperparameters.

Note that the logger reports the CRF's *conditional* cross-entropy, log p(tags
| words) / n.  This is much lower than the HMM's *joint* cross-entropy log
p(tags, words) / n, but that doesn't mean the CRF is worse at tagging.  The
CRF is just predicting less information.

In [None]:
log.info("*** Conditional Random Field (CRF)\n")
crf = ConditionalRandomField(entrain.tagset, entrain.vocab)  # randomly initialized parameters  
crf.train(corpus=ensup, loss=loss_dev, reg=1.0, lr=0.05, minibatch_size=10,
          save_path="ensup_crf.pkl")

INFO : *** Conditional Random Field (CRF)

100%|██████████| 996/996 [00:00<00:00, 1342.20it/s]
INFO : Cross-entropy: 2.7777 nats (= perplexity 16.082)
100%|██████████| 996/996 [00:00<00:00, 998.49it/s] 
INFO : Tagging accuracy: all: 3.265%, known: 3.045%, seen: 4.377%, novel: 6.011%
100%|██████████| 500/500 [00:02<00:00, 196.00it/s]
100%|██████████| 996/996 [00:00<00:00, 1366.93it/s]
INFO : Cross-entropy: 1.8501 nats (= perplexity 6.361)
100%|██████████| 996/996 [00:00<00:00, 1008.48it/s]
INFO : Tagging accuracy: all: 69.811%, known: 70.652%, seen: 55.387%, novel: 63.342%
100%|██████████| 500/500 [00:03<00:00, 157.25it/s]
100%|██████████| 996/996 [00:00<00:00, 1354.62it/s]
INFO : Cross-entropy: 2.1312 nats (= perplexity 8.425)
100%|██████████| 996/996 [00:01<00:00, 724.70it/s]
INFO : Tagging accuracy: all: 74.262%, known: 75.729%, seen: 54.209%, novel: 60.964%
100%|██████████| 500/500 [00:02<00:00, 181.38it/s]
100%|██████████| 996/996 [00:00<00:00, 1352.75it/s]
INFO : Cross-entropy: 2.

Let's examine how the CRF does on individual sentences. 
(Do you see any error patterns here that would inspire additional CRF features?)

In [None]:
look_at_your_data(crf, endev, 10)

INFO : Gold:    ``/` We/P 're/V strongly/R _OOV_/V that/I anyone/N who/W has/V eaten/V in/I the/D cafeteria/N this/D month/N have/V the/D shot/N ,/, ''/' Mr./N Mattausch/N added/V ,/, ``/` and/C that/D means/V virtually/R everyone/N who/W works/V here/R ./.
INFO : Viterbi: ``/` We/P 're/V strongly/N _OOV_/N that/I anyone/N who/W has/V eaten/N in/I the/D cafeteria/V this/D month/N have/V the/D shot/N ,/, ''/' Mr./N Mattausch/N added/N ,/, ``/` and/C that/I means/N virtually/N everyone/N who/W works/V here/R ./.
INFO : Loss:    8/34
INFO : Cross-entropy: 4.870564937591553 nats (= perplexity 29.2540645842741)
---
INFO : Gold:    I/P was/V _OOV_/V to/T read/V the/D _OOV_/N of/I facts/N in/I your/P Oct./N 13/C editorial/N ``/` _OOV_/N 's/P _OOV_/N _OOV_/N ./. ''/'
INFO : Viterbi: I/P was/V _OOV_/N to/T read/V the/D _OOV_/N of/I facts/N in/I your/N Oct./N 13/N editorial/V ``/` _OOV_/N 's/P _OOV_/N _OOV_/N ./. ''/'
INFO : Loss:    4/21
INFO : Cross-entropy: 4.379086494445801 nats (= perplexit