This notebook clarifies that the HMM internal emission probabilites favour "also" over " "

We will go through each hidden state in the HMM and get the "also" token emission probability, as well as the " " token emission probability. Then we will plot a histogram of the probabilities.

In [29]:
import matplotlib.pyplot as plt
import pickle
import numpy as np
import tokenizer_lib
from hmmlearn import hmm
import random
from tqdm import tqdm

In [None]:
# load the hmm
model_name = "600-word-hmm-300.pkl"
fname = f"/n/holyscratch01/sham_lab/summer_2024/models/{model_name}"

with open(fname, "rb") as f:
    hmm = pickle.load(f)

In [9]:
# verify the shape of the HMM is 300 hidden states and 600 emissions
hmm.emissionprob_.shape

(300, 583)

This is a problem. Since the shape is (300, 583) instead of (300, 600), it could be the case that the token id for " " no longer means what it should be.

I will now run an experiment to see if given tokens (0,1,100,101), a 2-hidden state HMM will collapse the dimensions from (2,100) to (2, 4).

In [33]:
# consider a simple model with two hidden states
# the first hidden state is a fair coin emitting 0 and 1 with equal chance
# the second hidden state gives out p(100)=0.9 and p(101)=0.1
# both states are highly stable, with probability of switching 0.1 equally


def generate_sample(length: int):
    """
    Returns hidden_states and emissions given the model specs above
    """
    # hidden_states are 0 or 1
    # initial hidden state can be either
    hidden_states = [int(random.random() < 0.5)]
    while len(hidden_states) < length:
        switch_flag = int(random.random() < 0.1)
        if switch_flag:
            next_hidden_state = 1 - hidden_states[-1]
        else:
            next_hidden_state = hidden_states[-1]
        hidden_states.append(next_hidden_state)
    emissions = []
    for hidden_state in hidden_states:
        if hidden_state:
            # 0 or 1 equal chance
            emission = int(random.random() < 0.5)
        else:
            # 101 with 10% and 100 with 90%
            emission = 101 if random.random() < 0.1 else 100
        # wrap it with [] since the package supports multi-dimensional emissions
        emissions.append([emission])
    return hidden_states, emissions

In [46]:
# as required by the package, the data will have all sequences concatenated together
sample_length = 500
num_samples = 10000
samples = []
for _ in tqdm(range(num_samples)):
    samples += generate_sample(sample_length)[1]

100%|██████████| 10000/10000 [00:03<00:00, 3251.90it/s]


In [47]:
# the HMM will tell the sequences apart by using the lengths array
lengths = [sample_length] * num_samples

In [48]:
# fit the HMM
n_components = 2
model = hmm.CategoricalHMM(n_components=n_components).fit(samples, lengths)

In [None]:
# look at the emission shape
model.emissionprob_.shape

(2, 102)

In [53]:
model.emissionprob_[0]

array([0.00117238, 0.00136042, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

In [54]:
model.emissionprob_[1]

array([4.97983409e-01, 4.98511841e-01, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
      

In [55]:
model.n_features

102

In [56]:
model.transmat_

array([[0.9043841 , 0.0956159 ],
       [0.09544971, 0.90455029]])

The HMM is a successful fit of the model.

The question is also answered: the model keeps the number of features as 102.

This hints towards that the HMM shape is (300, 583) simply because that the largest token id seen is 582.

Next steps:
- Plot the histogram of emission probabilities for the space token and also token
- Inspect the hidden states that have high emission probabilities for the space token
    - e.g. look at their transition probabilities