# Text Generation using NLP 

I will be creating a model which uses two approaches to generate sentences - Markov Chains, LSTMs. I could also use GPT2 but that takes roughly 5 minutes for inference alone and the training would take anywhere between 2-4 hours. 

## Markov Chains 

![image.png](attachment:image.png)

In [57]:
# Import necessary libraries
import numpy as np
import pandas as pd
import nltk

In [58]:
# Open the trump speech corpus 
with open('../data/trump_speeches.txt', encoding='utf8') as f:
    text = f.read()

print(len(text))

896270


In [59]:
# Text Cleaning
text = text.replace('\n',' ')
text = text.lower()
text = text.replace('\t',' ')
text = text.replace('...',' ')
text = text.replace('“', ' " ')
text = text.replace('”', ' " ')
for spaced in ['.','-',',','!','?','(','—',')']:
    text = text.replace(spaced, ' {0} '.format(spaced))

print(len(text))

948962


In [60]:
text[:3000]

"\ufeffspeech 1    thank you so much .   that's so nice .   isn't he a great guy .   he doesn't get a fair press; he doesn't get it .   it's just not fair .   and i have to tell you i'm here ,  and very strongly here ,  because i have great respect for steve king and have great respect likewise for citizens united ,  david and everybody ,  and tremendous resect for the tea party .   also ,  also the people of iowa .   they have something in common .   hard - working people .   they want to work ,  they want to make the country great .   i love the people of iowa .   so that's the way it is .   very simple .  with that said ,  our country is really headed in the wrong direction with a president who is doing an absolutely terrible job .   the world is collapsing around us ,  and many of the problems we've caused .   our president is either grossly incompetent ,  a word that more and more people are using ,  and i think i was the first to use it ,  or he has a completely different agenda 

In [61]:
# Tokenize the text 
tokens = text.split()
len(tokens)

195264

Define a function to give us all pairs of words in the speeches. We’re using lazy evaluation, and yielding a generator object instead of actually filling up our memory with every pair of words.

In [62]:
# Make a generator function
def make_pairs(corpus):
    for i in range(len(corpus)-1):
        yield (corpus[i], corpus[i+1])
        
pairs = make_pairs(tokens)

If the first word of the pair is already a key in the dictionary, simply append the next word to the list of words that follow that word. Otherwise, initialize a new entry in the dictionary with the key equal to the first word and the value a list of length one:

In [63]:
word_dict = {}
for word_1, word_2 in pairs:
    if word_1 in word_dict.keys():
        word_dict[word_1].append(word_2)
    else:
        word_dict[word_1] = [word_2]

Finally we pick some random word to kick off the chain, and choose the number of words we want to simulate:

In [67]:
first_word = np.random.choice(tokens)
chain = [first_word]
n_words = 30

In [68]:
for i in range(n_words):
    chain.append(np.random.choice(word_dict[chain[-1]]))

In [69]:
' '.join(chain)

'so much larger convention center for a total mess . i was right , he gets too high school choice , "i’m going to win" because they going to make america'

In [76]:
distinct_words = list(set(tokens))
word_idx_dict = {word: i for i, word in enumerate(distinct_words)}

In [78]:
k = 2 # adjustable
sets_of_k_words = [ ' '.join(tokens[i:i+k]) for i, _ in enumerate(tokens[:-k]) ]

from scipy.sparse import dok_matrix

sets_count = len(list(set(sets_of_k_words)))
next_after_k_words_matrix = dok_matrix((sets_count, len(distinct_words)))

distinct_sets_of_k_words = list(set(sets_of_k_words))
k_words_idx_dict = {word: i for i, word in enumerate(distinct_sets_of_k_words)}

for i, word in enumerate(sets_of_k_words[:-k]):

    word_sequence_idx = k_words_idx_dict[word]
    next_word_idx = word_idx_dict[tokens[i+k]]
    next_after_k_words_matrix[word_sequence_idx, next_word_idx] +=1

In [80]:
def sample_next_word_after_sequence(word_sequence, alpha = 0):
    next_word_vector = next_after_k_words_matrix[k_words_idx_dict[word_sequence]] + alpha
    likelihoods = next_word_vector/next_word_vector.sum()
    
    return weighted_choice(distinct_words, likelihoods.toarray())
    
def stochastic_chain(seed, chain_length=15, seed_length=2):
    current_words = seed.split(' ')
    if len(current_words) != seed_length:
        raise ValueError(f'wrong number of words, expected {seed_length}')
    sentence = seed

    for _ in range(chain_length):
        sentence+=' '
        next_word = sample_next_word_after_sequence(' '.join(current_words))
        sentence+=next_word
        current_words = current_words[1:]+[next_word]
    return sentence

# example use    
stochastic_chain('the world')

AttributeError: 'list' object has no attribute 'split'

## LSTM Based Text Generation

In [71]:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils