# **Markov Model for Text Generation**

What should the input data look like? Markov models and RNNs can be trained with different datasets similar to any neural network model. You could train a character-based, n-gram based, word-based, or document-based solution.

In [1]:
import nltk
nltk.download('gutenberg')
import re
from nltk.corpus import gutenberg

# Load text from NLTK Gutenberg corpus
text = gutenberg.raw("carroll-alice.txt")

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


In [2]:
text = text.replace('\t', ' ').replace('\n', ' ')
text



Note that when we are generating text, we are going to do minimal preprocessing. This is because we want the text generated to make as much sense as possible without post processing. So things like stopwords removal and stemming or lemmatization can be disregarded. You may want to do further preprocessing for other symbols or other patterns you don't want to show up in a sequence of text.

In [3]:
from collections import defaultdict

def markov_chain(text):
  #tokenize the text by word, though including punctuation
  words = text.split(' ')

  #initialize a default dictionary to hold all of the words and next words
  m_dict = defaultdict(list)

  #create a zipped list of all of the word pairs and put them in word: list of next words format
  for current_word, next_word in zip(words[0:-1], words[1:]):
      m_dict[current_word].append(next_word)

  #convert the default dict back into a dictionary
  m_dict = dict(m_dict)
  return m_dict

The resulting dictionary represents a Markov chain model, where each word in the input text is a state and the possible next words are the transitions from one state to another.

In [4]:
alice_dict = markov_chain(text)
alice_dict

{"[Alice's": ['Adventures'],
 'Adventures': ['in', 'of'],
 'in': ['Wonderland',
  'it,',
  'her',
  'that;',
  'time',
  'the',
  'her',
  'the',
  'the',
  'a',
  'hand',
  'a',
  'sight,',
  'time',
  'a',
  'the',
  'waiting',
  'large',
  'a',
  'fact,',
  'crying',
  'a',
  'it',
  'currants.',
  'the',
  'fact',
  'this',
  'the',
  'one',
  'the',
  'a',
  'a',
  'the',
  'the',
  'such',
  'ringlets',
  '',
  'that',
  'time',
  'existence;',
  'another',
  'salt',
  'that',
  'her',
  'the',
  'the',
  'the',
  'my',
  'the',
  'like',
  "trying.'",
  'her',
  'her',
  'a',
  'a',
  'the',
  'a',
  'a',
  'the',
  'a',
  'a',
  'the',
  'a',
  'an',
  'a',
  'which',
  'the',
  'silence.',
  'a',
  'despair',
  'her',
  'your',
  'a',
  'a',
  'the',
  'reply.',
  'chorus,',
  'particular.',
  'a',
  "bed!'",
  'a',
  'the',
  'the',
  'a',
  'a',
  'the',
  'an',
  'the',
  'without',
  'great',
  'a',
  'the',
  'the',
  'another',
  'the',
  'a',
  'here?',
  'at',
  'the',

In [5]:
import random

def generate_sentence(chain, count=20):
  '''Input a dictionary in the format of key = current word, value = list of next words
      along with the number of words you would like to see in your generated sentence.'''

  #capitalize the first word
  word1 = random.choice(list(chain.keys()))
  sentence = word1.capitalize()

  #generate the second word from the value list. Set the new word as the first word. Repeat.
  for i in range(count-1):
      word2 = random.choice(chain[word1])
      word1 = word2
      sentence += ' ' + word2

  return(sentence)

In [7]:
generate_sentence(alice_dict)

"Ask: perhaps they were', said to begin with; and a fan and she swallowed one of the pool was very"

There are also libraries that can be used like markovify in order to make use of optimized text generation using these models.

In [8]:
!pip install markovify

Collecting markovify
  Downloading markovify-0.9.4.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting unidecode (from markovify)
  Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: markovify
  Building wheel for markovify (setup.py) ... [?25l[?25hdone
  Created wheel for markovify: filename=markovify-0.9.4-py3-none-any.whl size=18606 sha256=f48b6a735047dc48de15379579638d6262e6bd0b9a4e3eaa877e57f6ac50d475
  Stored in directory: /root/.cache/pip/wheels/ca/8c/c5/41413e24c484f883a100c63ca7b3b0362b7c6f6eb6d7c9cc7f
Successfully built markovify
Installing collected packages: unidecode, markovify
Successfully installed markovify-0.9.4 unidecode-1.3.8


The state_size specifies how many previous words are considered when predicting the next word. For example, if state_size=2, the next word in the sequence is predicted based on the two previous words. If state_size=3, the next word is predicted based on the three previous words, and so on.

The state_size parameter can affect the quality and coherence of the generated text. A smaller state size may result in text that is less coherent and more random, while a larger state size may result in text that is more coherent but less varied. The optimal state size depends on the specific text corpus and the desired style of the generated text.

In [9]:
import markovify

#Markov model
model = markovify.Text(text, state_size=2)

#generate new text
generated_text = model.make_sentence()
print(generated_text)


However, it was neither more nor less than no time to go, for the pool as it was talking in a languid, sleepy voice.


Slightly nonsensical, though it is sometimes hard to tell with Alice in Wonderland, but it is doing an okay job generating words that likely occur one after another in a sentence. Let's compare it to something simple we build with a RNN and LSTM.

# **Recurrent Neural Networks**

*Please note that RNNs, LSTMs, etc. can be used for classification tasks as well. They are not specific to NLP either and can be used with a wide array of data modalities. Also, we would also likely increase the number of epochs but for demonstration purposes, the number of epochs has been limited.

The model takes three arguments in the constructor:

- input_size: the number of features in the input vectors
- hidden_size: the number of features in the hidden state of the RNN
- output_size: the number of features in the output vectors

The model consists of an RNN layer followed by a linear layer and a log-softmax activation function. The RNN layer processes sequential input data, updating its hidden state at each time step. The linear layer fc takes the final hidden state of the RNN as input and produces the output vector. The output vector is then passed through the log-softmax activation function, generating a probability distribution over the potential outputs.

The forward method accepts two arguments:

- input: a tensor representing the input vector.
- hidden: a tensor representing the hidden state of the RNN.

Within the method, the input and the hidden state tensors are concatenated. The concatenated tensor is then passed through the RNN layer, producing the output vector. This output vector is further processed by the linear layer, and the log-softmax activation function is applied. Subsequently, the method returns the resulting output vector and the updated hidden state.

In [18]:
import torch
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=2)

    def forward(self, input, hidden):
        batch_size = input.size(0)
        output, hidden = self.rnn(input.view(batch_size, 1, -1), hidden)
        output = self.fc(output)
        output = self.softmax(output)
        return output.view(batch_size, -1), hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size)

Here, we prepare the corpus. We are only going to take a subset of sentences for demonstration purposes.

In [10]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

#our corpus
corpus = sent_tokenize(text)[0:1000]
corpus

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


["[Alice's Adventures in Wonderland by Lewis Carroll 1865]  CHAPTER I.",
 "Down the Rabbit-Hole  Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'",
 'So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.',
 "There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!",
 'Oh dear!',
 "I shall be late!'",
 '(when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all se

In [11]:
#tokenize the corpus
tokenizer = lambda x: x.split()
tokenized_corpus = [tokenizer(doc) for doc in corpus]

#make vocabulary and dictionary of indices
vocab = list(set([word for doc in tokenized_corpus for word in doc]))
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}
vocab_size = len(vocab)

#corpus to indices
corpus_idx = [[word_to_idx[word] for word in doc] for doc in tokenized_corpus]

In [19]:
hidden_size = 32
batch_size = 32
model = RNN(vocab_size, hidden_size, vocab_size)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Train the RNN model
for epoch in range(20):
    for batch_start in range(0, len(corpus_idx), batch_size):
        model.zero_grad()
        batch = corpus_idx[batch_start:batch_start + batch_size]
        hidden = model.initHidden()
        loss = 0

        for doc in batch:
            for i in range(len(doc)-1):
                input = torch.zeros(1, vocab_size)
                input[0, doc[i]] = 1
                target = torch.tensor([doc[i+1]], dtype=torch.long)
                output, hidden = model(input, hidden)
                loss += nn.functional.nll_loss(output, target)

        loss /= len(batch)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)
        optimizer.step()

    print('Epoch {}:, Loss: {:.2f}'.format(epoch, loss.item()))

Epoch 0:, Loss: 43.39
Epoch 1:, Loss: 39.72
Epoch 2:, Loss: 37.73
Epoch 3:, Loss: 35.64
Epoch 4:, Loss: 32.88
Epoch 5:, Loss: 30.43
Epoch 6:, Loss: 28.15
Epoch 7:, Loss: 25.17
Epoch 8:, Loss: 22.51
Epoch 9:, Loss: 19.91
Epoch 10:, Loss: 17.64
Epoch 11:, Loss: 15.66
Epoch 12:, Loss: 14.95
Epoch 13:, Loss: 13.45
Epoch 14:, Loss: 11.21
Epoch 15:, Loss: 9.21
Epoch 16:, Loss: 7.79
Epoch 17:, Loss: 6.67
Epoch 18:, Loss: 5.93
Epoch 19:, Loss: 5.83


In [22]:
def generate_text(model, start_word, length):
  """
  This function generates text of given length using a trained PyTorch model and a starting word.

    Args:
    model: A trained PyTorch model
    start_word (str): The word from which the text generation starts
    length (int): The number of words to generate after the start_word

    Returns:
    output_text (str): The generated text

  """
  #we need to disable the gradient calculation since we are not training the model
  with torch.no_grad():
      #input with all zeros except for the start_word
      input = torch.zeros(1, vocab_size)
      input[0, word_to_idx[start_word]] = 1
      #initialize hidden state of the model
      hidden = model.initHidden()
      output_text = start_word
      #loop to generate text of a particular length
      for i in range(length):
          #the input and hidden state are passed through the model
          #to get the output and new hidden state
          output, hidden = model(input, hidden)
          #convert the output to a probability distribution
          output_dist = torch.exp(output)
          #select the word with highest probability as the next word
          top_prob, top_idx = output_dist.topk(1)
          next_word = idx_to_word[top_idx.item()]
          #append the next_word to the output text
          output_text += ' ' + next_word
          #input for the next iteration
          #2D PyTorch tensor is initialized with zeros with a shape of (1, vocab_size)
          input = torch.zeros(1, vocab_size)
          #update the tensor input at the specified position
          input[0, word_to_idx[next_word]] = 1
  return(output_text)


In [23]:
#Generate text using the trained RNN model
generated_text = generate_text(model, start_word='The', length=20)
print('Generated text:', generated_text)

Generated text: The Dormouse shook his head mournfully. fellow? into the March Hare went on the Hatter with the other side of the


# **Long Short-Term Memory**

Below is a basic LSTM model that can be used for text generation.

An LSTM layer is created using nn.LSTM, which takes the input_size and hidden_size as its arguments. A fully connected linear layer is created using nn.Linear, which takes the hidden_size and output_size as its arguments. Finally, a softmax activation function is created using nn.LogSoftmax, which applies the log of the softmax function along the specified dimension.

Input is the input vector, hidden is the previous hidden state, and cell is the previous cell state. The input vector is first reshaped into a 3D tensor using view and then passed through the LSTM layer along with the previous hidden and cell states. The output of the LSTM layer is then passed through the linear layer and the softmax activation function. The output, hidden state, and cell state are returned.

The initHidden method initializes the hidden and cell states with zeros. It returns two tensors of size (1, 1, self.hidden_size).

In [12]:
import torch
import torch.nn as nn

class LSTM(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTM, self).__init__()
        self.hidden_size = hidden_size
        self.lstm = nn.LSTM(input_size, hidden_size)
        self.fc = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden, cell):
        output, (hidden, cell) = self.lstm(input.view(1, 1, -1), (hidden, cell))
        output = self.fc(output.view(1, -1))
        output = self.softmax(output)
        return output, hidden, cell

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size), torch.zeros(1, 1, self.hidden_size)


We need to adapt the training loop slightly to take into account we have more parameters to keep track of (cell state). However, other than that the training is very similar.

In [13]:
# Initialize the LSTM model and optimizer
hidden_size = 32
model = LSTM(vocab_size, hidden_size, vocab_size)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
batch_size = 32

# Train the LSTM model
for epoch in range(20):
  for batch_start in range(0, len(corpus_idx), batch_size):
      batch = corpus_idx[batch_start:batch_start + batch_size]
      hidden, cell = model.initHidden()
      loss = 0
      for doc in batch:
          for i in range(len(doc)-1):
              input = torch.zeros(1, vocab_size)
              input[0, doc[i]] = 1
              target = torch.tensor([doc[i+1]], dtype=torch.long)
              output, hidden, cell = model(input, hidden, cell)
              loss += nn.functional.nll_loss(output, target)
      loss /= batch_size
      loss.backward()
      torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)
      optimizer.step()
  print('Epoch {}:, Loss: {:.2f}'.format(epoch, loss.item()))

Epoch 0:, Loss: 10.66
Epoch 1:, Loss: 9.91
Epoch 2:, Loss: 9.56
Epoch 3:, Loss: 9.24
Epoch 4:, Loss: 8.83
Epoch 5:, Loss: 8.44
Epoch 6:, Loss: 7.69
Epoch 7:, Loss: 7.13
Epoch 8:, Loss: 6.52
Epoch 9:, Loss: 6.06
Epoch 10:, Loss: 5.59
Epoch 11:, Loss: 5.04
Epoch 12:, Loss: 4.50
Epoch 13:, Loss: 3.99
Epoch 14:, Loss: 3.55
Epoch 15:, Loss: 3.10
Epoch 16:, Loss: 2.84
Epoch 17:, Loss: 2.48
Epoch 18:, Loss: 2.20
Epoch 19:, Loss: 2.10


Likewise, we need to update the text generation function.

In [16]:
def generate_text(model, start_word, length):
    with torch.no_grad():
        #initialize hidden and cell state
        hidden, cell = model.initHidden()
        #convert the start word to a tensor
        start_tensor = torch.zeros(1, vocab_size)
        start_tensor[0, word_to_idx[start_word]] = 1
        #the initial hidden and cell state using the start word
        output, hidden, cell = model(start_tensor, hidden, cell)

        #next word is sampled based on the output probabilities
        _, predicted = output.topk(1)
        word_idx = predicted.item()

        #generate the rest of the text
        output_text = [start_word]
        for i in range(length - 1):
            #convert previous predicted word to tensor
            input_tensor = torch.zeros(1, vocab_size)
            input_tensor[0, word_idx] = 1
            #generate the next hidden and cell state using the previous predicted word
            output, hidden, cell = model(input_tensor, hidden, cell)
            #next word based on the output probabilities
            _, predicted = output.topk(1)
            word_idx = predicted.item()

            #predicted word index converted to a string and added to the generated text
            output_text.append(idx_to_word[word_idx])
    return(' '.join(output_text))

In [17]:
# Generate text using the trained LSTM model
generated_text = generate_text(model, start_word='The', length=20)
print('Generated text:', generated_text)

Generated text: The were all like such a little pattering of feet at the March Hare said the March Hare took the
