# Bigram Shakespeare Language Model

This creates a very naive bigram language model based on term statistics in the Tiny Shakespeare's text data. 

## Build the Language Model

### 1. Unigram

Load text data, tokenize the text, and compute term frequencies. The unigram language model is simply a list of unique tokens and their probabilities, where the probability $p_t$ of term $t$ is computed by: 

$p(t) = \frac{tf(t)}{T}$ 

where $tf_t$ is the term frequence of $t$ and $T$ is the total sum of term frequencies in the text. 

Example: 
* `t=hope`: $p(t)$ is the likelihood that `hope` will occur.

### 2. Bigram

$p(t|t_0) = \frac{tf(t_0\to t)}{tf(t_0)}$

where $tf(t_0\to t)$ is the frequency of $t_0$ followed by $t$.

Example: 
* Given $t_0=I$, for $t=hope$, $p(t|t_0)$ is how likely the `hope` will occur after `I`.


### 3. Mixture of Unigram and Bigram

$\hat{p}(t|t_0) = r\cdot p(t|t_0) + (1-r)\cdot p(t)$

where $r$ is a constant between 0 and 1, e.g. $r=0.5$. 

In [36]:
import nltk
import numpy as np
from nltk import word_tokenize
from nltk.probability import FreqDist
from collections import defaultdict, Counter
import urllib.request
import csv

# Load the tiny Shakespeare text
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
response = urllib.request.urlopen(url)
long_txt = response.read().decode('utf-8-sig')

# Tokenize the text
tokens = word_tokenize(long_txt.lower())

## UNIGRAM
# Build a frequency distribution of the tokens
freq_dist = nltk.FreqDist(tokens)
# Normalize the frequencies to get probabilities
total_frequency = sum(freq_dist.values())
probabilities = {word: freq/total_frequency for word, freq in freq_dist.items()}

# BIGRAM
# Compute the bigrams
bigrams = list(nltk.bigrams(tokens))
total_bigrams = float(len(bigrams))
# Compute the frequency distribution of the bigrams
bigram_freq = nltk.FreqDist(bigrams)
# Initialize a two-dimensional default dictionary. This will store our model.
bi_model = defaultdict(Counter)

# Populate the model with the bigram probabilities
# Write token probabilities to a text file
with open('bigram_probabilities.txt', 'w') as f:
    writer = csv.writer(f)
    for word1, word2 in bigrams:
        bi_model[word1][word2] = (bigram_freq[(word1, word2)] / total_bigrams)
        writer.writerow([word1, word2, bi_model[word1][word2]])


## Sample a Word

Create a function to sample a word at a time based on the probability distribution.

In [37]:
# COMBINE the two probabilities
r = 0.7
for word1 in bi_model: 
    prob_sum = sum(bi_model[word1].values())
    for word, freq in freq_dist.items():
        if word in bi_model[word1]: 
            bi_model[word1][word] = bi_model[word1][word]*r/prob_sum + freq*(1-r)/total_frequency
        else: 
            bi_model[word1][word] = freq/total_frequency

# normalize probability to add up to 1
for word1 in bi_model: 
    prob_sum = sum(bi_model[word1].values())
    for word, freq in freq_dist.items():
        bi_model[word1][word] = bi_model[word1][word]/prob_sum


In [38]:
# Sample a word from the probability distribution
def sample_word(probabilities):
    words = list(probabilities.keys())
    probs = list(probabilities.values())
    word = np.random.choice(words, p=probs)
    return word

def sample_bi_gram(bi_model, word1): 
    probabilities = bi_model[word1]
    return sample_word(probabilities)
    

In [39]:
# Test the function
print(sample_word(probabilities))

honest


In [61]:
print("to", sample_bi_gram(bi_model, "to"))

to exceed


In [64]:
print("what", sample_bi_gram(bi_model, "what"))

what sweet


In [65]:
print("i", sample_bi_gram(bi_model, "i"))

i assist


## Repeat Word Sample

Repeat word sample to generate the next word, and the next, and the next, .., until it completes a sentence. 

In [45]:
import time
import re
import string
    
# Repeat word sampling from bi-grams to generate a sentence
def generate_bigram_sentence(model, start):
    word = start
    while word not in ['.', '!', '?']:
        if re.fullmatch(r'['+re.escape(string.punctuation)+']*$', word):
            print(f"{word}", end="")
        else:
            print(f" {word}", end="")
        word = sample_bi_gram(model, word)
        time.sleep(0.5)
    print(word)

## Does it generate anything meaningful? 

Makes sense or makes no sense? In Shakespeare's lanaguage/vocabulary? 

In [66]:
generate_bigram_sentence(bi_model, "what")

 what hath sent to as i pardon me well a to my reason with you elements to him may two worthy then is the butcher which with the trial, how, york to brittany me grave sir in the than good, not, itself against?


In [68]:
generate_bigram_sentence(bi_model, "romeo")

 romeo: o why sister.


In [69]:
generate_bigram_sentence(bi_model, "to")

 to urge the pale at your brother did with:?


In [71]:
generate_bigram_sentence(bi_model, "all")

 all depart to upon henry bolingbroke.


In [72]:
generate_bigram_sentence(bi_model, "thy")

 thy, we person from be the, and:, i heavens call i.


In [73]:
generate_bigram_sentence(bi_model, "hail")

 hail, let thou art thou leave.


In [50]:
generate_bigram_sentence(bi_model, "to")

 to hear no us and be good will not an, story; a gentleman so reputed in a bark what law, the treasons.


In [31]:
generate_bigram_sentence(bi_model, "hail")

 hail bidding task get live part me: whose gratitude this to and fly leave to his and the your enough tewksbury: 's ear incline, loss herself another love we make us!


In [35]:
generate_bigram_sentence(bi_model, "the")

 the stuff the 'd a: man, that he twenty of newness sebastian mercy?


In [75]:
generate_bigram_sentence(bi_model, "why")

 why present or so which ears 'll you for thou hast hit bone tailor for ireland, of why,!


In [76]:
generate_bigram_sentence(bi_model, "why")

 why flowers the abhorr my to want you, to speak be the her heavy of their very not say o asleep with a 't is into as on them!
