# NLP Assignment: Generating Trump Tweets with N-Gram Models

In this assignment, you will use n-gram language models (LM) to model tweets (social media statements) from or about the former U.S. president Donald Trump. The goal will then be to generate new tweets, or do autocompletion, in the writing style of Trump's tweets. The tweets have been scraped from the Twitter social media (since then renamed "X").

Before starting this assignment, the appended `NLP_ngram_cheatsheet.ipynb` notebook provides a tutorial on n-grams and LM basics, using the `nltk` package.

Please code the necessary steps in python, and provide answers in Markdown format in this notebook, under the corresponding instructions and questions below.

Please rename your final file `NLP_Assignment_STUDENTID.ipynb` for submission on moodle, and make sure you "run all" with a fresh kernel, so that outputs show correctly and in order in your submission.

**STUDENT ID:** 19-320-563

### Provided Packages

In [1]:
import numpy as np
import pandas as pd
import nltk
nltk.download('popular', quiet=True)

True

### Additional Packages

In [20]:
import preprocessor as p #pip install tweet-preprocessor
import plotly.graph_objects as go

## Part 1: Import, inspect and preprocess the text data

- Import the provided dataset, `Trump_tweets.csv`. We are interested in the variable `Tweet_Text`, which gives the content of each tweet. 
- Before tokenizing, start by cleaning the tweets' format. You should at least normalize the different types of apostrophes and quotes (e.g. `` ’, ”, ` ``) to the corresponding ` ' ` or ` " `, remove line breaks `\n` (careful about not "merging" words), and remove multiple spacing. Also make sure urls (e.g. `https://t.co/wPk7QWpK8Z`) are not split into too many meaningless tokens. 
- (Facultative) Feel free to perform additional cleaning steps that you believe will improve the tokenization or the downstream LMs (in which case, briefly explain why).
- Tokenize the `Tweet_Text` corpus into a list of tokenized tweets (documents). The result should be a list of lists containing word-level tokens (e.g. words, punctuation, and other "special words").
- Show the result for the first five tweets of the corpus.

##### Answer

### Import CSV

In [40]:
trump_data = pd.read_csv('Trump_tweets.csv')

fig = go.Figure(data=[go.Table(
    header=dict(values=list(trump_data.columns),
                fill_color='#536B78',
                font=dict(color='white', size=12),
                align='left',
                line_color='black',
                line_width=1),
    cells=dict(values=[trump_data[col].tolist() for col in trump_data.columns],
                fill_color='white',
                line_color='black',
                line_width=1),)
])

fig.show()

In [18]:
trump_data.count()

Date                                         7375
Time                                         7375
Tweet_Text                                   7375
Type                                         7375
Media_Type                                   1225
Hashtags                                     2031
Tweet_Id                                     7375
Tweet_Url                                    7375
twt_favourites_IS_THIS_LIKE_QUESTION_MARK    7375
Retweets                                     7375
Unnamed: 10                                    26
Unnamed: 11                                    13
dtype: int64

### Select *Tweet_Tex* Only

In [47]:
trump_tweet_text = trump_data["Tweet_Text"].copy()

fig = go.Figure(data=[go.Table(
    header=dict(values=trump_tweet_text,
                fill_color='#536B78',
                font=dict(color='white', size=12),
                align='left',
                line_color='black',
                line_width=1),
    cells=dict(values=trump_tweet_text,
                fill_color='white',
                line_color='black',
                line_width=1),)
])

fig.show()

In [48]:
trump_tweet_text

0       Today we express our deepest gratitude to all ...
1       Busy day planned in New York. Will soon be mak...
2       Love the fact that the small groups of protest...
3       Just had a very open and successful presidenti...
4       A fantastic day in D.C. Met with President Oba...
                              ...                        
7370    I loved firing goofball atheist Penn @pennjill...
7371    I hear @pennjillette show on Broadway is terri...
7372    Irrelevant clown @KarlRove sweats and shakes n...
7373    "@HoustonWelder: Donald Trump is one of the se...
7374    RT @marklevinshow: Trump: Rove is a clown and ...
Name: Tweet_Text, Length: 7375, dtype: object

### Drop NA and Duplicates 

In [17]:
trump_tweet_text = trump_tweet_text.dropna()
trump_tweet_text = trump_tweet_text.drop_duplicates()
trump_tweet_text

0       Today we express our deepest gratitude to all ...
1       Busy day planned in New York. Will soon be mak...
2       Love the fact that the small groups of protest...
3       Just had a very open and successful presidenti...
4       A fantastic day in D.C. Met with President Oba...
                              ...                        
7370    I loved firing goofball atheist Penn @pennjill...
7371    I hear @pennjillette show on Broadway is terri...
7372    Irrelevant clown @KarlRove sweats and shakes n...
7373    "@HoustonWelder: Donald Trump is one of the se...
7374    RT @marklevinshow: Trump: Rove is a clown and ...
Name: Tweet_Text, Length: 7364, dtype: object

### Check one tweet

In [7]:
sample1 = trump_tweet_text.iloc[8]
sample1

'RT @IvankaTrump: Such a surreal moment to vote for my father for President of the United States! Make your voice heard and vote! #Election2_'

### Removing Tweet that is not Trump own Words

### Preprocess

In [11]:
#forming a separate feature for cleaned tweets
for i,v in enumerate(tweets['text']):
    tweets.loc[v,’text’] = p.clean(i)

': Such a surreal moment to vote for my father for President of the United States! Make your voice heard and vote!'

## Part 2: Fitting and Accessing a Trump Tweet LM

### Ex. 2.1: LM fitting function
Create a function that takes as arguments (at least) the desired order $n$ of the model and a tokenized training corpus, and that returns the "simple" Maximum Likelihood Estimator (MLE) language model, fitted on the given training corpus.  
Then, use your function to fit a MLE language model of order $n=3$ to the Trump Tweets corpus.

##### Answer

### Ex. 2.2: Vocabulary
- How many distinct tokens are in the model's vocabulary? Is that the same number of distinct tokens that appear in the tokenized corpus?
- Lookup the tokens of the sentence `"I love UNIGE students!"` in the model's vocabulary. Explain what you observe, and why. 

##### Answer

### Ex. 2.3: Token probabilities
- When it comes to ngram models the training boils down to counting the ngrams from the training corpus. Using your fitted model, how many times do the following appear in the training data: ``'America', 'Trump', 'I will', 'will never forget'``.
- Then, compute the following word occurrence probabilities ('scores') in the Trump Tweets corpus, and briefly explain what the returned numbers mean about the training data:
    - $\mathbb{P}($'America'$)$,
    - $\mathbb{P}($'Trump'$)$,
    - $\mathbb{P}($'will'$\vert $'I'$)$,
    - $\mathbb{P}($'forget'$\vert $'will never'$)$.
- Briefly explain, with a formula, how those probabilities are obtained from the n-gram counts.

##### Answer

## Part 3: Generation using N-gram Language Model

### Ex. 3.1: Tweet generator
Create a python function to generate new Trump Tweets. It should:
- take as input arguments: a fitted `nltk.lm.model`, a maximum number of words (integer), a text seed (initial context tokens), and a random "RNG" seed for generation,
- output a newly generated Trump Tweet, according to the input arguments, post-processed as a single text string that is formatted like a tweet.

*Hints:* `nltk.tokenize.treebank.TreebankWordDetokenizer()` and its `.detokenize()` method can help with post-processing. Pay attention to show things like `@user` mentions, urls, punctuation, etc... in a "correct" format.

##### Answer

### Ex. 3.2: Initial context
To generate a full tweet from a LM of order $n$, explain what should be the text seed (i.e. the initial context tokens). Set the default value for the relevant argument of your function in 3.1 accordingly.

##### Answer

### Ex. 3.3: Generate tweets
Generate a few new tweets using your new function and the LM fitted in Part 2. For reproducibility, use a random RNG seed to show them. 

*Facultative:* show a few examples that you find interesting, representative or funny.

##### Answer

## Part 4: Smoothing and model comparison

### Ex. 4.1: Smoothed LM alternatives to simple MLE
Modify the function that you defined in 2.1 by adding an argument that allows changing the `nltk.lm` language model that is fitted in the function (e.g. to fit a Laplace or a Lidstone model instead of the simple MLE). 
Also briefly explain what is the difference between Laplace, Lidstone and the simple MLE language models.

*Hint:* Your function might need more than a single additional argument, if some LM have hyperparameters.

##### Answer

### Ex. 4.2: Qualitative model comparison 
With $n=1,2,3,4$, fit and generate new tweets from the simple MLE and from the Laplace LM of orders $n$. 
- Compare the results between the different $n$ values and between the two models. 
- What are the main differences for generation? Which model(s) do you think might be the best options for generating new realistic tweets?
- Do you see hints of those differences in the generated tweets?

##### Answer

### Ex. 4.3: Quantitative evaluation and comparison
- Split the tokenized Trump Tweets corpus into a (reproducible) training set (80%) and a test set (20%). 
- Compute the train and test 3-gram perplexity scores of a simple MLE LM, a Laplace LM, and a Lidstone LM with $\gamma=0.1$. Use model order $n=3$ for each.
- Compare and discuss the obtained train and test perplexity scores of the three models. Argue which model might represent the Trump Tweets data best.

*Hint:* To compute the perplexity correctly, you might need to preprocess the relevant corpus documents to a list of padded $n$-grams.

##### Answer

### Ex. 4.4: Hyper-parameter tuning
- Perform a grid-search to select the best hyperparameter values for $n$ and $\gamma$, for the Lidstone LM. You want to select the model that generalizes best to new data.
- What do you observe in the obtained perplexity scores? Was it expected? Explain it in statistical terms.

*Hint:* Maybe try a few values for $n$ and $\gamma$ by hand to identify the general hyperparameter region of interest before defining a more thorough hyperparameter value grid.

##### Answer