# Text generation using Markov Models

## Imports <a name="im"></a>

In [1]:
import os
import re
import sys
from collections import Counter, defaultdict
from urllib.request import urlopen

import numpy as np
import numpy.random as npr
import pandas as pd

## 1: Warm-up <a name="1"></a>

Build a letter or character-based Markov model of language with a tiny corpus. The goal of such a model is to predict next character given a sequence of characters. 

In NLP, a Markov model of language is also referred to as **an n-gram language model**. In this exercise we'll focus on a character-based bigram (2-gram) language model. We will use the variable `n=1` to denote that we are only considering the current character in the sequence to predict the next character. In the next exercise, you'll explore different values for `n` in your n-gram language model. 
 
To build a bigram model, we need bigram frequency counts, i.e., counts of all 2-letter sequences in your corpus. Below code creates bigram frequency counts of letters in a tiny corpus with 6 words. Our goal is to build a character-based bigram language model with this corpus, where the set of states is the unique letters in the corpus.  

> When we say n-gram we're now referring to the last `n` characters, not the last `n` words. In general, the term n-gram can refer to characters or word; see the [Wikipedia article](https://en.wikipedia.org/wiki/N-gram).

In [2]:
corpus = "to be or not to be"  # our tiny corpus
n = 1  # for bigrams
circ_corpus = corpus + corpus[:n]
frequencies = defaultdict(Counter)
for i in range(len(circ_corpus) - n):
    frequencies[circ_corpus[i : i + n]][circ_corpus[i + n]] += 1
frequencies

defaultdict(collections.Counter,
            {'t': Counter({'o': 2, ' ': 1}),
             'o': Counter({' ': 2, 'r': 1, 't': 1}),
             ' ': Counter({'b': 2, 'o': 1, 'n': 1, 't': 1}),
             'b': Counter({'e': 2}),
             'e': Counter({' ': 1, 't': 1}),
             'r': Counter({' ': 1}),
             'n': Counter({'o': 1})})

<br><br>

### 1.1 Visualizing character bigram counts as a co-occurrence matrix
Show the bigram frequencies in `frequencies` variable above as a pandas DataFrame, sorting the column and row labels alphabetically, where
    * column labels and row indices are unique characters in the corpus, and 
    * the value in each cell $a_{ij}$ represents how often the character $i$ precedes character $j$ in our corpus.
    
> Note: Fill in the NaN values with zeros. 

In [4]:
freq_df = pd.DataFrame(frequencies).transpose()
freq_df = freq_df.fillna(0)
freq_df

Unnamed: 0,o,Unnamed: 2,r,t,b,n,e
t,2.0,1.0,0.0,0.0,0.0,0.0,0.0
o,0.0,2.0,1.0,1.0,0.0,0.0,0.0
,1.0,0.0,0.0,1.0,2.0,1.0,0.0
b,0.0,0.0,0.0,0.0,0.0,0.0,2.0
e,0.0,1.0,0.0,1.0,0.0,0.0,0.0
r,0.0,1.0,0.0,0.0,0.0,0.0,0.0
n,1.0,0.0,0.0,0.0,0.0,0.0,0.0


<br><br>

### 1.2 Transition matrix - the conditional probability distribution for every possible bigram

> Recall that the transition matrix $T$ is a square matrix and the number of rows/columns is equal to the number of states. Each row is a discrete probability distribution summing to 1. The element $T_{ij}$ is the probability of transitioning from state $i$ to state $j$. 

In [7]:
trans_df = freq_df.div(freq_df.sum(axis=1), axis=0)
trans_df

Unnamed: 0,o,Unnamed: 2,r,t,b,n,e
t,0.666667,0.333333,0.0,0.0,0.0,0.0,0.0
o,0.0,0.5,0.25,0.25,0.0,0.0,0.0
,0.2,0.0,0.0,0.2,0.4,0.2,0.0
b,0.0,0.0,0.0,0.0,0.0,0.0,1.0
e,0.0,0.5,0.0,0.5,0.0,0.0,0.0
r,0.0,1.0,0.0,0.0,0.0,0.0,0.0
n,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
states = np.unique(list(frequencies.keys()))
print("States:", states)
S = len(states)
print("Number of states:", S)

States: [' ' 'b' 'e' 'n' 'o' 'r' 't']
Number of states: 7


### 1.3 Probability of sequences

Suppose the probability of starting a sequence with letter $b$ is $0.4$ (i.e., $\pi_0$ of state $b$ = 0.4). Given the transition matrix in 1.2, calculate the probabilities for the following sequences. 

1. "be or"
2. "beet"

$P($"be or"$) = P(b|b=0.4) \times P(e|b) \times P($ $|e) \times P(o|$ $) \times P(r|o)$
<br> $\implies 0.4 \times 1 \times 0.5 \times 0.2 \times 0.25 = 0.01$

$P($"beet"$) = P(b|b=0.4) \times P(e|b) \times P(e|e) \times P(t|e)$
<br> $\implies 0.4 \times 1 \times 0 \times 0.5 = 0$

<br><br>

### 1.4 Generate sequences

> Starting at state `t`, we would sample the next letter from the letter's probability distribution, computed in the transition matrix. Once the next state in sampled, we can transition into it. Once we're in the new state -- suppose it's the letter "e" since it has the highest probability of getting sampled -- we can use the Markov assumption and forget the past states. That is, we now need to sample from the probability distribution of state `e`, a sampling methof independent of our previous states. I would repeat this process until the generated sequence of text reaches `seq_len`.

> A few applications for word- or character-level text generation may include:
<br> - iMessages autocomplete text suggester 
<br> - Google home 
<br> - Autofill search bars
<br> - Automatic replies in Emails
<br> - Discord chatbots

<br><br><br><br>

## Character-based Markov model of language <a name="2"></a>

The code below uses the hyperparameter `n`, where our "state" of the Markov chain is the last `n` characters of a give string. Previously, we worked with `n=1` (bigram model) and our "state" of the Markov chain was a single character and each character was dependent on the last one character. When `n=3`, it means that the probability distribution over the _next character_ only depends on the _preceding 3 characters_. 

> Note that `n` in n-gram doesn't exactly correspond to the variable `n` we are using in the implementation below. For 2-gram (bigram) the value of the variable `n` is 1, and for 4-gram the value of `n=3`. 

We train our model from data by recording every occurrence of every n-gram and, in each case, recording what the next letter is. Then, for each n-gram, we normalize these counts into probabilities just like we did with naive Bayes. The `fit` function below implements these steps.

To generate a new sequence, we start with some initial seed at least of length `n` (here, we will just use the first `n` characters in the training text, which are saved at the end of the `fit` function). Then, for the current n-gram we will look up the probability distribution over next characters and sample a random character according to this distribution.

Attribution: assignment adapted with permission from Princeton COS 126, [_Markov Model of Natural Language_]( http://www.cs.princeton.edu/courses/archive/fall15/cos126/assignments/markov.html). Original assignment was developed by Bob Sedgewick and Kevin Wayne.

In [10]:
class MarkovModel:
    def __init__(self, n):
        """
        Initialize the Markov model object.

        Parameters:
        ----------
        n : int
            the size of the ngram
        """
        self.n = n

    def fit(self, text):
        """
        Fit a Markov chain and create a transition matrix.

        Parameters
        ----------
        text : str
            text to fit the Markov chain
        """

        # make text circular so markov chain doesn't get stuck
        circ_text = text + text[: self.n]

        # count the number of occurrences of each letter following a given n-gram
        frequencies = defaultdict(Counter)
        for i in range(len(text)):
            frequencies[circ_text[i : i + self.n]][circ_text[i + self.n]] += 1.0

        # normalize the frequencies into probabilities (separately for each n-gram)
        self.probabilities_ = defaultdict(dict)
        for ngram, counts in frequencies.items():
            self.probabilities_[ngram]["symbols"] = list(counts.keys())
            probs = np.array(list(counts.values()))
            probs /= np.sum(probs)
            self.probabilities_[ngram]["probs"] = probs

        # store the first n characters of the training text, as we will use these
        # to seed our `generate` function
        self.starting_chars = text[: self.n]
        self.frequencies_ = frequencies  # you never know when this might come in handy

    def generate(self, seq_len):
        """
        Using self.starting_chars, generate a sequence of length seq_len
        using the transition matrix created in the fit method.

        Parameters
        ----------
        seq_len : int
            the desired length of the sequence

        Returns:
        ----------
        str
            the generated sequence
        """
        s = self.starting_chars
        while len(s) < seq_len:
            probs = self.probabilities_[s[-self.n :]]
            s += npr.choice(probs["symbols"], p=probs["probs"])
        return s

<br><br>

Let's test our code. 

In [11]:
# Grimms' Fairy Tales by Jacob Grimm and Wilhelm Grimm
data_url = "http://www.gutenberg.org/files/2591/2591-0.txt"

corpus = urlopen(data_url).read().decode("utf-8")
corpus = corpus[2000:]
model = MarkovModel(n=3)
model.fit(corpus)
print(model.generate(500))

E WORK


Thes, she king, and the more
toward to gready’s, who seven to the we cart he was but all pace
swimmed sprey, ‘Of with a carribly, plack if you gale, but you with, and girl, and to beased to
the went lame ask the came to
was time appertailor in it wered,
  And be as a tain.’ ‘I withough forman by the stan-her that for for waitingdom of on he
wishe his a need his eyest. One on a have has batter no go the garder letell.’ Then you with the said the dared: ‘Will it, who disteppensel


<br><br>

### 2.1 "Circular" version of the text

**Question:** Why do we need to use a "circular" version of the text in the `fit` function? What could go wrong if we didn't do that, and instead used the original `text` (instead of `circ_text`) and made the loop:  
```
for i in range(len(text)-self.n):
```
which allow `fit` to run without crashing?

> The main reason we use a "circular" version `circ_text` is that the Markov chain does not get stuck in any state with no characters. That is, we ensure the ability to transition with having non-zero transition probabilities. This also imposes the irreducibity assumption which assumes the graoh does not get stuck at any state, and possibly the assumption of aperiodicity, which suggests no repeation of the same sequences occur. Omitting the use of this text version, we are exposing the chain in getting stuck in a loop when generating the text since the length of the generated text would be smaller than the length of the desired length of the sequence.  

> Remark: Analogous to Laplace smoothing, we could impose some filters and additive smoothing into the Markov model using interpolation or absolute discounting. The effect could be observed in the smoothness when testing various window sizes and n-grams.

<br><br>

### 2.3 States, state space, and transition matrix
Let's consider the Markov chain interpretation of what we've just done. Let's define a state as the previous $n$ characters "emitted" by our chain. Let's assume our vocabulary size (the number of possible characters) is $V$; for example, $V=26$ if we were only using lowercase letters. We can compute this in our case:

In [12]:
print("Vocabulary size = %d" % len(np.unique(list(corpus))))

Vocabulary size = 86


But let's just stick with a general $V$ for now. Call the number of possible states $S$. Let's consider the _transition matrix_ $T$ of this Markov chain. The transition matrix is a square matrix and the number of rows/columns is equal to the number of states, so in our case it is $S \times S$. Each row is a discrete probability distribution summing to 1. The element $T_{ij}$ is the probability of transitioning from state $i$ to state $j$. 

> - The number of possible states is $V^n$
> - Example states in the state space may include {aaa, abc}
> - When $n>1$ not all transitions are possible since we cannot transition some states to others. Suppose we wanted to switch from state $abc$ to state $aaa$. This would not be possible since we can only transition $bc\alpha$ for some $\alpha$ from all permutations of $bc\alpha$ in the corpus.
> - The maximum number of nonzero elements that $T$ could have is $V^{n+1}$. The number of nonzero elements in every row is all the possible states to transtition into, which is equivalent to the number of states. That is, $S=V^n$, the number of columns. Therefore, nonzero elements will be at maximum $S\times V = V^n \times V = V^{n+1}$. This makes up a very small proportion of the total number of elements $V^{2n}$.
> - The state space is **discrete**.

<br><br><br><br>

### Stationary distribution and other fun stuff <a name="3"></a>

The code below computes the transition matrix for you for the Shakespeare data with `n=1`. Consider this transition matrix for the next few questions.

In [13]:
# The Project Gutenberg eBook of The Complete Works of William Shakespeare, by William Shakespeare
data_url = "http://www.gutenberg.org/files/100/100-0.txt"
shakespeare_text = urlopen(data_url).read().decode("utf-8")
shakespeare_text = shakespeare_text[4000:]

In [14]:
states = np.unique(list(shakespeare_text))
print("States:", states)
S = len(np.unique(list(shakespeare_text)))
print("Number of states:", S)

model = MarkovModel(n=1)
model.fit(shakespeare_text)

# implementation note: since len(model.probabilities_[state]["probs"]) might not equal S
# for all letters, we need to be careful and do a reverse-lookup for the actual letters
# the rest of the transition probabilities are just zero (if they don't appear)
lookup = dict(zip(states, list(np.arange(S, dtype=int))))

T = np.zeros((S, S))  # transition matrix
for i, state in enumerate(states):
    for j, letter in enumerate(model.probabilities_[state]["symbols"]):
        T[i, lookup[letter]] = model.probabilities_[state]["probs"][j]

print("Number of nonzero transition probabilities: %d/%d" % (np.sum(T > 0), T.size))

States: ['\t' '\n' '\r' ' ' '!' '"' '$' '%' '&' "'" '(' ')' '*' ',' '-' '.' '/'
 '0' '1' '2' '3' '4' '5' '6' '7' '8' '9' ':' ';' '?' '@' 'A' 'B' 'C' 'D'
 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V'
 'W' 'X' 'Y' 'Z' '[' '\\' ']' '_' '`' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i'
 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z' '|'
 '}' 'Æ' 'É' 'à' 'â' 'æ' 'ç' 'è' 'é' 'ê' 'ë' 'î' 'œ' '—' '‘' '’' '“' '”']
Number of states: 107
Number of nonzero transition probabilities: 2647/11449


<br><br>

### 3.1 Stationary distribution conditions 
Under mild assumptions, a Markov chain has a _stationary distribution_ which is the probability of finding yourself in the same state after the chain is run for a long time. These assumptions are:

- "Irreducibility" (doesn’t get stuck in part of the graph)
- "Aperiodicity" (doesn’t keep repeating same sequence).

**Question:** why might out Markov chain might satisfy these assumptions?

> I would suspect that Markoc chain is irreducible and aperiodic since I'm assuming the text is written in proper English. As such, it's probably unlikely the chain gets stuck at a given state. Also, the states probability distributions seem to have a lot of variation. Looking at the transition matrix $T$, it seems we're able to transition freely between all 107 states, which would satisfy the irreduciblity assumption. However, the third state \r is always followed by \n (with probability 1). Inspecting the shakespearen text, we see that \r is indeed only ever followed by \n. Though seeing the following transition is a little unfeasible, we assume an exit is possible, to a space character perhaps. Thus based on the transition matrix, once if we're in that state (state 2) and cannot leave, this markov chain would be reducible. Seeing how this may be an anomoly, we can make the assumption that the Markov chain is indeed irreducible. We can also check the diagonals of the transition matrix are 1 but that aline may not enough to prove irreduciblity.

<br><br>

### 3.2 Stationary distribution for the `shakespeare_text`
It's not true in general but in this particular case we actually already know the stationary distribution -- it is just the relative frequency that these n-grams occur in the training data (you are welcome to think about this deeply, but you don't have to). (Recall that we are assuming n=1 here.)

In this section, we will perform the following:
1. Calculate the stationary distribution for the `shakespeare_text` by calculating the relative frequencies for each state in the corpus (unique letter in this case, as we are assuming `n=1`).
2. Show empirically that this is a stationary distribution. 

In [15]:
text_len = len(shakespeare_text)
freq_df = np.array([shakespeare_text.count(i) / text_len for i in states])
pd.DataFrame(freq_df, columns=['Freq'])

Unnamed: 0,Freq
0,5.251253e-07
1,2.983482e-02
2,2.983482e-02
3,1.845093e-01
4,1.490131e-03
...,...
102,2.888189e-04
103,6.336512e-05
104,2.995840e-03
105,8.191955e-05


In [16]:
def stationary_dist(pi_0, T, time_step=20):
    print('pi_0 =', pi_0)
    pi_time_step = pi_0@np.linalg.matrix_power(T,time_step)
    print('pi_%d = %s'%(time_step,pi_time_step))
    if not np.allclose(pi_time_step@T, pi_time_step): 
        print('Not steady state yet: pi_%d@T != pi_%d'
              %(time_step,time_step))
    else:     
        print('Steady state: pi_%d@T == pi_%d'%(time_step,time_step))
               
stationary_dist(freq_df, T, time_step = 1)

pi_0 = [5.25125321e-07 2.98348201e-02 2.98348201e-02 1.84509258e-01
 1.49013062e-03 1.40033419e-05 3.50083547e-07 1.75041774e-07
 1.15527571e-05 2.49189469e-03 5.74137018e-05 5.72386600e-05
 4.55108612e-06 1.62132443e-02 1.01104129e-03 1.48939544e-02
 2.10050128e-06 2.87068509e-05 6.16147043e-05 3.97344826e-05
 2.87068509e-05 3.37830623e-05 2.03048458e-05 2.71314749e-05
 1.54036761e-05 2.59061825e-05 1.61038432e-05 8.24796838e-04
 2.99093879e-03 1.94313873e-03 1.75041774e-07 7.78095693e-03
 2.32087888e-03 3.28413376e-03 2.30232445e-03 6.31375678e-03
 1.98112279e-03 1.82165974e-03 2.94070180e-03 9.06016221e-03
 3.25052574e-04 1.00929087e-03 3.81328504e-03 2.53985614e-03
 4.35959042e-03 4.84515630e-03 1.84424013e-03 2.06199209e-04
 4.22288279e-03 5.34472552e-03 6.74610996e-03 2.24858663e-03
 5.58033175e-04 2.91129478e-03 6.42403310e-05 1.27045319e-03
 9.87235604e-05 6.09670498e-04 1.75041774e-07 6.09845540e-04
 9.38048865e-04 1.75041774e-07 4.64706152e-02 8.88109447e-03
 1.28183091e-02 2

<br><br>

> Note that we may chooose to calculate stationary distribution using eigenvalue decomposition of the transition matrix.

<br><br>

### 3.4 Finding probability of occurrences of patterns 
Let's consider the conditional probability that a lower case vowel comes 3 characters later given that the current letter is "a". In other words, we're searching for the pattern `axxv` where `v` is a lower case vowel (defined as a,e,i,o,u,y) and `x` is any character and `a` is literally "a". 

Let's use `n=1`.

**Your tasks:**
1. It turns out we can estimate this probability directly from the transition matrix. While $T$ gives us the probabilities one step ahead, $T\times T$ gives us the probabilities 2 steps ahead, etc. (If you want to think about this, you should be able to convince yourself it's true by contemplating what matrix multiplication really means, but this is optional.) So, taking $T^3$ gives us the transition probabilities for 3 steps later. Compute $T^3$ and find the estimated probability that we're looking for. 
2. We could also estimate this probability directly from the data, by just seeing how frequently this pattern occurs. Do this. How well does it match your above results? What does this tell us about how reasonable our  Markov assumption is?
3. What if we increased `n` and repeated part (1), would you expect to get closer to the answer from part (2)? You are welcome to try this but you don't have to - the goal is just to think about it.

> Note for 3.4.1: you should NOT use `T**3`, as that is element-wise exponentiation rather than a matrix power. You can get $T^3$ using `numpy.linalg.matrix_power(T,3)` or just `T@T@T`.

In [18]:
vowels = ['a', 'e', 'i', 'o', 'u','y']
unique_states = list(states)
T_df = pd.DataFrame(T, columns=unique_states, index=unique_states)
T_df_3 = T_df @ T_df @ T_df
T_df_3.head()

Unnamed: 0,\t,\n,\r,Unnamed: 4,!,"""",$,%,&,',...,é,ê,ë,î,œ,—,‘,’,“,”
\t,1.382155e-07,0.000963,0.016311,0.144041,0.001256,1.1e-05,4.901805e-07,0.0,8e-06,0.002744,...,6e-06,5.325955e-07,1.810708e-07,1.108755e-07,3.487511e-06,0.000143,2.9e-05,0.003378,3.7e-05,3.9e-05
\n,1.523956e-07,0.002177,0.055919,0.171032,0.000947,8e-06,2.654925e-07,4.506554e-07,6e-06,0.002131,...,5e-06,3.974693e-07,1.507668e-07,6.466956e-08,1.403737e-06,0.000205,0.000136,0.002883,0.000216,6.1e-05
\r,2.977522e-07,0.17225,0.002177,0.087209,0.000134,2.3e-05,2.533663e-06,0.0,1.9e-05,0.001342,...,4e-06,1.194201e-06,1.078453e-08,7.291188e-09,4.261705e-08,2.3e-05,7.3e-05,0.0016,0.000104,8e-06
,5.009547e-07,0.006664,0.025838,0.178863,0.001576,1.2e-05,2.520116e-07,3.491315e-07,1e-05,0.002564,...,7e-06,7.597677e-07,2.190216e-07,1.151768e-07,1.449376e-06,0.000294,3.7e-05,0.003121,4.1e-05,7.3e-05
!,8.592789e-08,0.010453,0.094844,0.240122,0.000706,6e-06,3.844704e-07,9.672549e-09,8e-06,0.001662,...,3e-06,3.255726e-07,1.08877e-07,1.045138e-07,4.821439e-07,9.3e-05,0.000368,0.002761,0.000601,1.1e-05


In [19]:
T_df_3.loc['a', vowels].sum().round(4)

0.2551

In [20]:
# using 4-gram method

tot = 0
letter_a = 0
for i in range(len(shakespeare_text)-3):
    if shakespeare_text[i] == 'a':
        letter_a +=1
        if shakespeare_text[i+3] in vowels:
            tot +=1
round(tot/letter_a, 4)

0.2548

> 2. The result matches the above results almost perfectly. This indicates the Markov chain assumptions were pretty reasonable as the model is able to predict the future, regardless of the past, at high accuracy.

> 3. Yes, if we increased `n`, I would expect to get closer to the answer from part (2) because that would mean incorporating more context and information into the Markov chain computations. We could get better predictions as the model would be closer to memorizing the data, though it would take longer.

<br><br><br><br>

## Markov model of language with words <a name="4"></a>

In this exercise we'll continue with the Markov model of language, but we'll work with _words_ instead of _characters_. The `MarkovModel` code stays the same. Just remember that now we are dealing with words and not characters. 

When we say n-gram we're now referring to `n` words, not `n` characters.

One concern with words is that, in a sense, we have less examples to work with. For example, with character n-grams and `n=3` we have a lot of examples of seeing the character combination "per" and checking what character follows it, but with words even with `n=1` (the smallest nontrivial value of `n`), we might only have one example of the word "persuade" in our corpus. You can imagine that the number of "training examples" only gets smaller if `n`>1. This is something to keep in mind when deciding what values of `n` might be reasonable.

Another issue is that the number of states could explode when you are working with words. For example, with character bigram model our states were all unique characters in the text (107 in our example above). For word bigram model, it's going to be the number of unique words in the corpus, which would be a much bigger number. You can imagine that for a large corpus and bigger values of `n`, the number of states could explode.  

Finally, we need to preprocess the text (i.e., tokenization) before creating a word-based n-gram model. To accomplish this preprocessing, we will use the popular python package [NLTK](http://www.nltk.org/) (Natural Language ToolKit), which we have seen in class. If you are using the course conda environment, you should already have the package in your conda environment. 

You might need to install the NLTK data files with the following command in your conda environment in the terminal.

```
python -m nltk.downloader 'punkt'
```

### 4.1 Word-based n-gram language model

The first step is to break apart the text string into words. NLTK's `word_tokenize` will turn our text into a lists of "tokens" which we will use as our language units. This tokenizing process removes whitespace. So, in `generate`, you should add a space character back in between tokens. This won't look quite right because you'll have spaces before punctuations, but it's good enough for now.

> Note: The way `MarkovModel` is implemented, the patterns we're conditioning on (the last $n$ characters/words) are used as keys in python dictionaries. For characters that was easy because we could just use a string as a key. However, for words with $n>1$ we now have a collection of words, and we want to use this as a dict key. `nltk.tokenize.word_tokenize` outputs a list, which can't be used as a key. I've casted the list to a tuple in the code below, because a tuple _can_ be used as a dict key (if you case: I believe this is because it's immutable whereas a list is not). This tiny change allows reuse of the old code.

#### Implement a word-based Markov model of language

In [21]:
import nltk
from nltk.tokenize import word_tokenize

text_tok = word_tokenize(shakespeare_text)
text_tok = tuple(text_tok)

In [22]:
class MarkovModelWords(MarkovModel):

    def generate(self, seq_len):
        """
        Using self.starting_chars, generate a sequence of length seq_len
        using the transition matrix created in the fit method.

        Parameters
        ----------
        seq_len : int
            the desired length of the sequence

        Returns:
        ----------
        str
            the generated sequence
        """
        s = self.starting_chars
        while len(s) < seq_len:
            probs = self.probabilities_[s[-self.n :]]
            s += (npr.choice(probs["symbols"], p=probs["probs"]), )
        return " ".join(s)

#### Train your word-based Markov model on `shakespeare_text` above with your choice of `n` and generate some text

In [23]:
char_model = MarkovModel(1)
char_model.fit(shakespeare_text)
char_model.generate(80)

's  womou  the, ier syesa ARLOl asimever.\r\natas artrishexigu ando, t. bleomyil  m'

In [24]:
word_model = MarkovModelWords(1)
word_model.fit(text_tok)
word_model.generate(80)

"serv 'd masque tonight . So she would incline to play with the matter , we may stop a neutral , her fan , though you mad-headed ape his princely peers , Lads . LEONATO . This Philoten : But , For not asham 'd . I come in my government of wealth , I confess he shall file Where be asham ’ s true and revenge . ’ s successive heir ; But makes civil . Now , Thou"

In [25]:
char_model = MarkovModel(5)
char_model.fit(shakespeare_text)
char_model.generate(80)

'serv’d there, my lord, like a daughter;\r\n           Enter this arrest to motion.'

In [26]:
word_model = MarkovModelWords(5)
word_model.fit(text_tok)
word_model.generate(80)

'serv ’ d thy beauty ’ s use , If thou couldst answer ‘ This fair child of mine Shall sum my count , and make my old excuse , ’ Proving his beauty by succession thine . This were to be new made when thou art old , And see thy blood warm when thou feel ’ st it cold . 3 Look in thy glass and tell the face thou viewest , Now is the time that face'

#### Discuss the differences between character-based and word-based Markov models in terms of generated text, the number of states, and the time taken to generate the text

> We see that for the character based Markov model with a small value of n performed poorly, with words not indentified as proper English such as *notomiglinend*. Some generated words from sequences of characters are still interpreble though. The word based model is fairly slower than the character based model, which makes sense since the word based model has a larger vocabulary and more states, thus the markov chain will need to compute a higher number of conditional probabilities. 

<br><br>

### 4.2 Dealing with punctuation
If you've generated text from 4.1, you probably noticed that the way whitespace is handled is not ideal. In particular, whitespace is added after every token, including right before punctuation. We shall modify the code to fix this and show a generated sequence. Next, generate text using different values of `n` on `shakespeare_text`. Then we'll be ready to 
train the word-based Markov models on another corpus of our choice and generate text with different values of `n`. 

In [27]:
import string

class MarkovModelWords(MarkovModel):
        """
        Using self.starting_chars, generate a sequence of length seq_len
        using the transition matrix created in the fit method.

        Parameters
        ----------
        seq_len : int
            the desired length of the sequence

        Returns:
        ----------
        str
            the generated sequence
        """
        
        def generate(self, seq_len):
            n = self.n
            seq = self.starting_chars
            while len(seq) < seq_len:
                probs = self.probabilities_[seq[-n:]]
                seq += (npr.choice(probs["symbols"], p=probs["probs"]),)
            output = seq[0]
            for s in seq[1:]:
                if s[0] in string.punctuation:
                    output += s
                else:
                    output += " " + s
            return output

In [28]:
model = MarkovModelWords(1)
model.fit(text_tok)
model.generate(200)

"serv ’ d: if she did level At least I should not the more care, enchantment, Titus Sons, Norfolk say, here behind thee by th' afternoon. PUCELLE. ROMEO.[ Enter Agrippa! GREMIO. Fear not knock till I have locks the turf of apple-johns before? SON. Come, now, my brother, Tell her face; And from imprisonment. Cut off one present parting)[_Exit Biondello._] KING HENRY. TROILUS. A shining arms I pity. YORK. HORATIO. O, if you deny it with straw outburneth; condemn our fortunes I. And as she ’ tis in debt to understand.[_Exeunt._] SCENE II. True, I could O'ermount the neck, To fast, I ’ s-head with her mistress, sir, and a black was brief; that name, as you are our intelligence hath sever ’ s spoils FIRST BANDIT. WESTMORELAND. Enter MENENIUS. An if thou obeyed'st thirty thousand part of it. Exeunt SCENE: I lose a valiant Clifford; I do everything seems true as"

In [29]:
model = MarkovModelWords(5)
model.fit(text_tok)
model.generate(200)

'serv ’ d thy beauty ’ s use, If thou couldst answer ‘ This fair child of mine Shall sum my count, and make my old excuse, ’ Proving his beauty by succession thine. This were to be new made when thou art old, And see thy blood warm when thou feel ’ st it cold. 3 Look in thy glass and tell the face thou viewest, Now is the time that face should form another, Whose fresh repair if now thou not renewest, Thou dost beguile the world, unbless some mother. For where is she so fair whose uneared womb Disdains the tillage of thy husbandry? Or who is he so fond will be the tomb Of his self-love to stop posterity? Thou art thy mother ’ s glass and she in thee Calls back the lovely April of her prime, So thou through windows of thine age shalt see, Despite of wrinkles this thy golden time. But if thou live remembered not to be, Die single and thine image dies with thee. 4 Unthrifty loveliness why dost thou spend,'

In [30]:
model = MarkovModelWords(10)
model.fit(text_tok)
model.generate(200)

'serv ’ d thy beauty ’ s use, If thou couldst answer ‘ This fair child of mine Shall sum my count, and make my old excuse, ’ Proving his beauty by succession thine. This were to be new made when thou art old, And see thy blood warm when thou feel ’ st it cold. 3 Look in thy glass and tell the face thou viewest, Now is the time that face should form another, Whose fresh repair if now thou not renewest, Thou dost beguile the world, unbless some mother. For where is she so fair whose uneared womb Disdains the tillage of thy husbandry? Or who is he so fond will be the tomb Of his self-love to stop posterity? Thou art thy mother ’ s glass and she in thee Calls back the lovely April of her prime, So thou through windows of thine age shalt see, Despite of wrinkles this thy golden time. But if thou live remembered not to be, Die single and thine image dies with thee. 4 Unthrifty loveliness why dost thou spend,'

In [31]:
url = "http://www.gutenberg.org/files/64982/64982-0.txt"
bunny_tale = urlopen(data_url).read().decode("utf-8")

# sample 5000 
bunny_tale = bunny_tale[5000:]
bunny_text_tok = word_tokenize(bunny_tale)
bunny_text_tok = tuple(bunny_text_tok)  

In [32]:
model = MarkovModelWords(1)
model.fit(bunny_text_tok)
model.generate(200)

"f thy power; I have, my followers at a knave or at a most loving kiss. It is in my head. O ’ s day, in his power, Have glowed like an Egyptian That wounds? MESSENGER. I found, Being nurse of eyes would he comes weeping die, and knights Alow'st no day's now. HUME HUME] PHILARIO, lest I said—I will not, good education promises In whom you pardon, Even so valiant as his trade, So could not? CURTIS. During the foe as wantonly, this moon-calf ’ s there? MRS. FORD. Thanks, like this Before Prospero above all: the Florentine? MACBETH. They scatter his leaves the world? BOLINGBROKE. SICINIUS. That we fear. Enter DENNIS, today. DUCHESS. Was this while we ’ s heart good lord. Therefore I, in their tears! PRINCE. I think me. SURREY, bristle up, the Prince, that you: I will my revenge shall to me, I guess by? This onely in that by her"

In [33]:
model = MarkovModelWords(5)
model.fit(bunny_text_tok)
model.generate(200)

'f thy beauty ’ s legacy? Nature ’ s bequest gives nothing but doth lend, And being frank she lends to those are free: Then beauteous niggard why dost thou abuse, The bounteous largess given thee to give? Profitless usurer why dost thou use So great a sum of sums yet canst not live? For having traffic with thy self alone, Thou of thy self thy sweet self dost deceive, Then how when nature calls thee to be gone, What acceptable audit canst thou leave? Thy unused beauty must be tombed with thee, Which used lives th ’ executor to be. 5 Those hours that with gentle work did frame The lovely gaze where every eye doth dwell Will play the tyrants to the very same, And that unfair which fairly doth excel: For never-resting time leads summer on To hideous winter and confounds him there, Sap checked with frost and lusty leaves quite gone, Beauty o ’ er-snowed and bareness every where: Then were not summer ’ s distillation left A liquid prisoner pent in walls of glass, Beauty ’ s'

In [34]:
model = MarkovModelWords(10)
model.fit(bunny_text_tok)
model.generate(200)

'f thy beauty ’ s legacy? Nature ’ s bequest gives nothing but doth lend, And being frank she lends to those are free: Then beauteous niggard why dost thou abuse, The bounteous largess given thee to give? Profitless usurer why dost thou use So great a sum of sums yet canst not live? For having traffic with thy self alone, Thou of thy self thy sweet self dost deceive, Then how when nature calls thee to be gone, What acceptable audit canst thou leave? Thy unused beauty must be tombed with thee, Which used lives th ’ executor to be. 5 Those hours that with gentle work did frame The lovely gaze where every eye doth dwell Will play the tyrants to the very same, And that unfair which fairly doth excel: For never-resting time leads summer on To hideous winter and confounds him there, Sap checked with frost and lusty leaves quite gone, Beauty o ’ er-snowed and bareness every where: Then were not summer ’ s distillation left A liquid prisoner pent in walls of glass, Beauty ’ s'

> Remark: As we increase n, the model is performing better as we condition on more information. As such, we would expect the quality of the generated text to improve as we've seen above. However, increasing n *too much* would lead the model will start to generate data too similar to the training data.

<br><br>

### 4.3 Other improvements

> Other improvements I would consider including into my word-based Markov model so that the generated text looks like proper English include perhaps considering other tokenizers, stemming techniques, possibly add word embeddings to include information about context and grammer, add part of speech tags to words, etc. In this case, this seems maybe removing numbers and apostrephes from the vocabulary would make the quality of the generated text better.