In this notebook, we will generate poems in the style of Robert Frost using a 2nd order Markov model. This means that the next token depends on the previous *two* tokens. 

One important difference wrt 'markov_text_classifier.ipynb' is that here we will use dictionaries to store the transition matrices A_ijk, rather than lists. The reason for this is computational efficiency. As we increase the order of the Markov model to N, the size of the A_{...} matrix grows as V^(N+1), where V - vocabulary size. However, because of the curse of dimensionality, the volume of data is increasingly low relative to the full volume of the matrix - that is, the matrix is increasingly sparse. Storing only the actual non-zero values in a dictionary is more efficient. We will need a nested dict - in this case, it will be nested twice, e.g. 

A['my']['cute']['dog'] = 0.3

----------
Another thing that is different here is that we will not be doing classification, so we won't be calculating probabilities of provided sequences. Rather, we will be sampling random tokens according to their probability distributions encoded in the nested dict. For example, if:

A['my'] = {'cute': 0.2, 'own': 0.5, 'son': 0.3}

then we need to randomly sample a float between $x \in [0.0, 1.0]$ and if $x<0.2$, then we sample 'cute', if $0.2<x<0.7$, then we sample 'own', and so on.

----------
The final note is that we still need the first-order transition matrix A_ij, but it should not be trained on all the transitions - it needs to be trained only on the transitions from the first token to the second one. After that, the second-order transition matrix is used.

In [2]:
import numpy as np
import pandas as pd

import string
# from sklearn.model_selection import train_test_split
from nltk import word_tokenize

We will generate poems in the style of Robert Frost by training a second-order Markov model on his poems and then sampling tokens by randomly drawing numbers from the uniform distribution $[0, 1]$.

In [3]:
!wget -nc https://raw.githubusercontent.com/lazyprogrammer/machine_learning_examples/master/hmm_class/robert_frost.txt

[31mNothing to do - goodbye
[m[m[m[m[m

In [4]:
input_files = [
  'edgar_allan_poe.txt',
  'robert_frost.txt',
]

In [5]:
# collect data into lists
input_text = []

for line in open('robert_frost.txt'):
  line = line.rstrip().lower()
  if line:
    # remove punctuation
    line = line.translate(str.maketrans('', '', string.punctuation))
    input_text.append(line)

In [8]:
vocab = []
X_train = []

for line in input_text:
    tokenised_line = word_tokenize(line)
    X_train.append(tokenised_line)
    for tok in tokenised_line:
        if tok not in vocab:
            vocab.append(tok)

V = len(vocab)
print(f'Vocab length: {V}')

Vocab length: 2197


We don't convert tokens to indices because we will store our data in nested dictionaries instead.

In [None]:
pi_i = {}
A_ij = {}
A_ijk = {}