# Text Generation using N-grams

Project by Mark Kim (commenting and some code cleanup)

As implemented by Namrata Kapoor
On [Numpy
Ninja](https://www.numpyninja.com/post/n-gram-and-its-use-in-text-generation)

This is a simple program that generates sentences from an n-gram language model
trained from a corpus of Donald Trump tweets.  First, we download the corpus
from a [Kaggle](https://www.kaggle.com) dataset as a csv.  Then the csv is
pre-processed by being entered into a dataframe, tokenized, and then is analyzed
for n-grams (the code provided uses trigrams).  An Maximum Likelihood Estimate
model is used with the pre-processed training set used to fit the model via
supervised learning. Finally, the model is used along with a detokenizer to
generate sentences based on the history of tweets sent by Trump.

In [1]:
# import all packages used in this project

from nltk import word_tokenize, sent_tokenize, download

from nltk.util import pad_sequence
from nltk.util import bigrams
from nltk.util import ngrams
from nltk.util import everygrams

from nltk.lm import MLE
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import flatten
from nltk.lm.preprocessing import padded_everygram_pipeline

from nltk.tokenize.treebank import TreebankWordDetokenizer

import pandas as pd

In [2]:
# a quick test of a toy corpus to extract bigrams

text = [['I','need','to','book', 'ticket', 'to', 'Australia' ], 
['I', 'want', 'to' ,'read', 'a' ,'book', 'of' ,'Shakespeare']]
list(bigrams(text[0]))

[('I', 'need'),
 ('need', 'to'),
 ('to', 'book'),
 ('book', 'ticket'),
 ('ticket', 'to'),
 ('to', 'Australia')]

In [3]:
# a test of the same toy corpus to extract trigrams

list(ngrams(text[1], n=3))

[('I', 'want', 'to'),
 ('want', 'to', 'read'),
 ('to', 'read', 'a'),
 ('read', 'a', 'book'),
 ('a', 'book', 'of'),
 ('book', 'of', 'Shakespeare')]

### Preprocess Trump Corpus

This section of code loads the csv of Donald Trump tweets into a Pandas
Dataframe.  The contents of the csv is tokenized and entered into a list of
trump tweets (a list of lists).  This resulting corpus is then preprocessed to
produce a tuple of an iterator over the text as ngrams and an iterator over the
text as vocabulary data.

In [4]:
# Preprocess csv of Donald Trump tweets

df = pd.read_csv('./files/realdonaldtrump.csv')
trump_corpus = list(df['content'].apply(word_tokenize))
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, trump_corpus)

In [5]:
df.head()

Unnamed: 0,id,link,content,date,retweets,favorites,mentions,hashtags
0,1698308935,https://twitter.com/realDonaldTrump/status/169...,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,,
1,1701461182,https://twitter.com/realDonaldTrump/status/170...,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,,
2,1737479987,https://twitter.com/realDonaldTrump/status/173...,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 08:38:08,13,19,,
3,1741160716,https://twitter.com/realDonaldTrump/status/174...,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 15:40:15,11,26,,
4,1773561338,https://twitter.com/realDonaldTrump/status/177...,"""My persona will never be that of a wallflower...",2009-05-12 09:07:28,1375,1945,,


### Train Maximum Likelihood Estimator N-gram Model

Here, we instantiate the N-gram model and then train the model with the
preprocessed data from the previous section

In [6]:
# Instantiate and train (fit) the model with preprocessed data

trump_model = MLE(n)
trump_model.fit(train_data, padded_sents)

### Create a Sentence Generation Function

This next section of code instantiates a detokenizer that takes a list of
strings generated by the model and returns the resulting string sentence.

In [7]:
detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(model, num_words, random_seed=42):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in model.generate(num_words, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

### Test the Sentence Generation function

This generates a bunch of sentences from the generation function above.

In [60]:
import random
for i in range(15):
    print(i, " ", generate_sent(trump_model, num_words=40, random_seed=random.randint(1,10000)))

0   
1   to meet with the Clinton Campaign Organized Potential VPs By Race And Gender: http: //bit.ly/fLEtcl
2   job in the Polls . There will be on the President? The answer is--clean it with Presidential Privilege, I wonder what the next Miss USA pageant tomorrow night at Madison Square Garden I get more documentation
3   them...but he just loose to Michelle - dummy, never even discussed this with them and being violent, and ALL!
4   "@ BackOnTrackUSA: Not only did Egypt destroy its nuclear capability until such time as illegal migrants coming through their country ......
5   to the idea of Trump Attaché, @ lisarinna and @ BretBaier #WakeUpAmerica https: //www.donaldjtrump.com/iowa/caucus-finder/ …pic.twitter.com/1vjCHYYlzU
6   //bit.ly/100qRgF
7   with . Respect . #2020 because they think they can enter our Country, and totally showed their cards for everyone!
8   CelebApprentice @ realDonaldTrump Please run for president please"
9   you both in very close, dummy Jon Stewart . She w