# NLP Assignment: Generating Trump Tweets with N-Gram Models

In this assignment, you will use n-gram language models (LM) to model tweets (social media statements) from or about the former U.S. president Donald Trump. The goal will then be to generate new tweets, or do autocompletion, in the writing style of Trump's tweets. The tweets have been scraped from the Twitter social media (since then renamed "X").

Before starting this assignment, the appended `NLP_ngram_cheatsheet.ipynb` notebook provides a tutorial on n-grams and LM basics, using the `nltk` package.

Please code the necessary steps in python, and provide answers in Markdown format in this notebook, under the corresponding instructions and questions below.

Please rename your final file `NLP_Assignment_STUDENTID.ipynb` for submission on moodle, and make sure you "run all" with a fresh kernel, so that outputs show correctly and in order in your submission.

**STUDENT ID:** 19-320-563

### Provided Packages

In [331]:
# Import necessary libraries
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

# Import NLTK and its submodules
import nltk
from nltk import tokenize
from nltk.lm import MLE, Laplace, Lidstone
from nltk.lm.preprocessing import padded_everygram_pipeline, flatten
from nltk.util import ngrams

# Import other libraries
import preprocessor as p
import random
import re
import string as str
import unicodedata

# Download NLTK data
nltk.download('popular', quiet=True)

True

## Part 1: Import, inspect and preprocess the text data

- Import the provided dataset, `Trump_tweets.csv`. We are interested in the variable `Tweet_Text`, which gives the content of each tweet. 
- Before tokenizing, start by cleaning the tweets' format. You should at least normalize the different types of apostrophes and quotes (e.g. `` ’, ”, ` ``) to the corresponding ` ' ` or ` " `, remove line breaks `\n` (careful about not "merging" words), and remove multiple spacing. Also make sure urls (e.g. `https://t.co/wPk7QWpK8Z`) are not split into too many meaningless tokens. 
- (Facultative) Feel free to perform additional cleaning steps that you believe will improve the tokenization or the downstream LMs (in which case, briefly explain why).
- Tokenize the `Tweet_Text` corpus into a list of tokenized tweets (documents). The result should be a list of lists containing word-level tokens (e.g. words, punctuation, and other "special words").
- Show the result for the first five tweets of the corpus.

##### Answer

### Import Data

In [332]:
trump_data = pd.read_csv('Trump_tweets.csv')
trump_data

Unnamed: 0,Date,Time,Tweet_Text,Type,Media_Type,Hashtags,Tweet_Id,Tweet_Url,twt_favourites_IS_THIS_LIKE_QUESTION_MARK,Retweets,Unnamed: 10,Unnamed: 11
0,16-11-11,15:26:37,Today we express our deepest gratitude to all ...,text,photo,ThankAVet,7.970000e+17,https://twitter.com/realDonaldTrump/status/797...,127213,41112,,
1,16-11-11,13:33:35,Busy day planned in New York. Will soon be mak...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/797...,141527,28654,,
2,16-11-11,11:14:20,Love the fact that the small groups of protest...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/797...,183729,50039,,
3,16-11-11,2:19:44,Just had a very open and successful presidenti...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/796...,214001,67010,,
4,16-11-11,2:10:46,A fantastic day in D.C. Met with President Oba...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/796...,178499,36688,,
...,...,...,...,...,...,...,...,...,...,...,...,...
7370,15-07-16,13:10:00,I loved firing goofball atheist Penn @pennjill...,text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,953,431,,
7371,15-07-16,10:18:31,I hear @pennjillette show on Broadway is terri...,text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,1175,1086,,
7372,15-07-16,10:10:17,Irrelevant clown @KarlRove sweats and shakes n...,text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,1494,930,,
7373,15-07-16,9:44:07,"""@HoustonWelder: Donald Trump is one of the se...",text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,1800,1738,,


### Data Info

In [333]:
trump_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7375 entries, 0 to 7374
Data columns (total 12 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Date                                       7375 non-null   object 
 1   Time                                       7375 non-null   object 
 2   Tweet_Text                                 7375 non-null   object 
 3   Type                                       7375 non-null   object 
 4   Media_Type                                 1225 non-null   object 
 5   Hashtags                                   2031 non-null   object 
 6   Tweet_Id                                   7375 non-null   float64
 7   Tweet_Url                                  7375 non-null   object 
 8   twt_favourites_IS_THIS_LIKE_QUESTION_MARK  7375 non-null   int64  
 9   Retweets                                   7375 non-null   int64  
 10  Unnamed: 10             

### Select *Tweet_Tex* Only

In [334]:
trump_tweet_text = df = pd.DataFrame({'Tweet_Text': trump_data["Tweet_Text"]})
trump_tweet_text

Unnamed: 0,Tweet_Text
0,Today we express our deepest gratitude to all ...
1,Busy day planned in New York. Will soon be mak...
2,Love the fact that the small groups of protest...
3,Just had a very open and successful presidenti...
4,A fantastic day in D.C. Met with President Oba...
...,...
7370,I loved firing goofball atheist Penn @pennjill...
7371,I hear @pennjillette show on Broadway is terri...
7372,Irrelevant clown @KarlRove sweats and shakes n...
7373,"""@HoustonWelder: Donald Trump is one of the se..."


### Drop NA and Duplicates 

In [336]:
trump_tweet_text = trump_tweet_text.dropna()
trump_tweet_text = trump_tweet_text.drop_duplicates()
trump_tweet_text

Unnamed: 0,Tweet_Text
0,Today we express our deepest gratitude to all ...
1,Busy day planned in New York. Will soon be mak...
2,Love the fact that the small groups of protest...
3,Just had a very open and successful presidenti...
4,A fantastic day in D.C. Met with President Oba...
...,...
7370,I loved firing goofball atheist Penn @pennjill...
7371,I hear @pennjillette show on Broadway is terri...
7372,Irrelevant clown @KarlRove sweats and shakes n...
7373,"""@HoustonWelder: Donald Trump is one of the se..."


### Retweet VS Quote Tweets

<img src="IMAGES/Differences between Retweet and Quote Tweet.png" width=40% />

#### RT (Retweet)

In [337]:
sample1 = trump_tweet_text["Tweet_Text"].iloc[8]
sample1

'RT @IvankaTrump: Such a surreal moment to vote for my father for President of the United States! Make your voice heard and vote! #Election2_'

> Retweets are generally only a copy of another tweet, displayed as a post under the user profile (here Trump), this means that the way of speech of Trump won't appear in this category of tweets

#### QT (Quote Tweet)

In [338]:
sample2 = trump_tweet_text["Tweet_Text"].iloc[-2]
sample2

'"@HoustonWelder: Donald Trump is one of the sexiest men on this planet. Every woman dreams of a good man who tells it like it is." So true!'

> Quote Tweets may contain some original Trump says, mainly comments added after a quote of someone else. Here Trump only says "So true!", but we want to keep this dynamic of him quoting someone else and adding is own speech. 

### Tweets Types in our Data

In [339]:
trump_tweet_text['Type'] = np.where(trump_tweet_text["Tweet_Text"].str.startswith('RT'), 'RETWEET', 
                       np.where(trump_tweet_text["Tweet_Text"].str.startswith('"@'), 'QUOTE RETWEET', 'ORIGINAL'))

trump_tweet_text["Type"].value_counts()

ORIGINAL         4809
QUOTE RETWEET    2125
RETWEET           430
Name: Type, dtype: int64

> Here we can identify RT by the _**RT**_ words and QT by the fact that it should always starts with  _"@_  to mention the user we take the quote from.


### Removing Retweet and keeping Quote Retweet

> I believe that retaining the RETWEET feature may not be crucial in replicating Trump's tweet style. At best, it might generate tweets that he would endorse or deem worthy of sharing with his followers. However, it also risks attributing others' words to his own speech without capturing the essence of his tone or thought process. In contrast preserving Quotation Retweets could be more valuable, as they often reflect Trump's tendency to comment on or criticize existing tweets, which could help the model better understand this aspect of his behavior. 

In [340]:
trump_tweet_text = trump_tweet_text[trump_tweet_text['Type'] != 'RETWEET']
trump_tweet_text 

Unnamed: 0,Tweet_Text,Type
0,Today we express our deepest gratitude to all ...,ORIGINAL
1,Busy day planned in New York. Will soon be mak...,ORIGINAL
2,Love the fact that the small groups of protest...,ORIGINAL
3,Just had a very open and successful presidenti...,ORIGINAL
4,A fantastic day in D.C. Met with President Oba...,ORIGINAL
...,...,...
7369,I hope the boycott of @Macys continues forever...,ORIGINAL
7370,I loved firing goofball atheist Penn @pennjill...,ORIGINAL
7371,I hear @pennjillette show on Broadway is terri...,ORIGINAL
7372,Irrelevant clown @KarlRove sweats and shakes n...,ORIGINAL


### Some Cleaning

In [341]:
# Define a function to clean the tweets
def clean_tweet(tweet):
    
    # Normalize apostrophes and quotes
    tweet = unicodedata.normalize('NFKD', tweet).encode('ascii', 'ignore').decode('utf-8')
    
    # Remove line breaks
    tweet = tweet.replace('\n', ' ')
    
    # Remove multiple spacing
    tweet = re.sub(r'\s+', ' ', tweet)
    
    # Remove URLs 
    #tweet = re.sub(r'https?://\S+', '', tweet)
    
    # Remove special characters (optional)
    #tweet = re.sub(r'[^a-zA-Z0-9\s]', '', tweet)
    
    return tweet

In [355]:
# Apply the cleaning function to the Tweet_Text column
trump_tweet_text['Tweet_Text_Clean'] = trump_tweet_text['Tweet_Text'].apply(clean_tweet)
row = 438
print(" BEFORE CLEAN: \n\n",trump_tweet_text.iloc[row][0], "\n\n\n","AFTER CLEAN: \n\n",trump_tweet_text.iloc[row][2])

 BEFORE CLEAN: 

 @AC360: How can you unite a country if you۪ve written off tens of millions of Americans?۝ #Deplorables #BigLeagueTruth #Debate 


 AFTER CLEAN: 

 @AC360: How can you unite a country if youve written off tens of millions of Americans? #Deplorables #BigLeagueTruth #Debate




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



### Tokenization

####  On one sentence

In [356]:
text = trump_tweet_text.iloc[800][2]
text_tokens = nltk.casual_tokenize(text)
text_tokens

['Thank', 'you', 'Ohio', '!', '#AmericaFirst', 'https://t.co/p68GAJdhwu']

#### On whole Corpus

In [357]:
trump_tweet_text["Tweet_Tokens"] = trump_tweet_text["Tweet_Text_Clean"].apply(nltk.casual_tokenize)

# First 5 Sentences
for i in range(5):
    print("-"*150,"\n\n LINE",i,":",trump_tweet_text["Tweet_Text"].iloc[i], "\n\n", "TOKENS:",trump_tweet_text["Tweet_Tokens"].iloc[i], "\n")

------------------------------------------------------------------------------------------------------------------------------------------------------ 

 LINE 0 : Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z 

 TOKENS: ['Today', 'we', 'express', 'our', 'deepest', 'gratitude', 'to', 'all', 'those', 'who', 'have', 'served', 'in', 'our', 'armed', 'forces', '.', '#ThankAVet', 'https://t.co/wPk7QWpK8Z'] 

------------------------------------------------------------------------------------------------------------------------------------------------------ 

 LINE 1 : Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government! 

 TOKENS: ['Busy', 'day', 'planned', 'in', 'New', 'York', '.', 'Will', 'soon', 'be', 'making', 'some', 'very', 'important', 'decisions', 'on', 'the', 'people', 'who', 'will', 'be', 'running', 'our', 'government', '!'] 

---



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



## Part 2: Fitting and Accessing a Trump Tweet LM

### Ex. 2.1: LM fitting function
Create a function that takes as arguments (at least) the desired order $n$ of the model and a tokenized training corpus, and that returns the "simple" Maximum Likelihood Estimator (MLE) language model, fitted on the given training corpus.  

Then, use your function to fit a MLE language model of order $n=3$ to the Trump Tweets corpus.

##### Answer

### Corpus

In [358]:
corp = trump_tweet_text["Tweet_Tokens"].tolist()

for i in range(5):
    print(corp[i],"\n")

['Today', 'we', 'express', 'our', 'deepest', 'gratitude', 'to', 'all', 'those', 'who', 'have', 'served', 'in', 'our', 'armed', 'forces', '.', '#ThankAVet', 'https://t.co/wPk7QWpK8Z'] 

['Busy', 'day', 'planned', 'in', 'New', 'York', '.', 'Will', 'soon', 'be', 'making', 'some', 'very', 'important', 'decisions', 'on', 'the', 'people', 'who', 'will', 'be', 'running', 'our', 'government', '!'] 

['Love', 'the', 'fact', 'that', 'the', 'small', 'groups', 'of', 'protesters', 'last', 'night', 'have', 'passion', 'for', 'our', 'great', 'country', '.', 'We', 'will', 'all', 'come', 'together', 'and', 'be', 'proud', '!'] 

['Just', 'had', 'a', 'very', 'open', 'and', 'successful', 'presidential', 'election', '.', 'Now', 'professional', 'protesters', ',', 'incited', 'by', 'the', 'media', ',', 'are', 'protesting', '.', 'Very', 'unfair', '!'] 

['A', 'fantastic', 'day', 'in', 'D', '.', 'C', '.', 'Met', 'with', 'President', 'Obama', 'for', 'first', 'time', '.', 'Really', 'good', 'meeting', ',', 'great',

> Checking that our Corpus is correct for further processing...

### N-Grams and Padded Vocabulary

In [359]:
training_neverygrams, padded_vocab_stream = padded_everygram_pipeline(3, corp)

line_count = 0

print('==== n-everygram data (n=3) for each sequence in "Corpus": ====\n')
for ngramlize_sent in training_neverygrams:
    print(list(ngramlize_sent))
    print()
    line_count += 1
    if line_count > 5:
        break

print('==== Vocabulary data: ====\n')
print(list(padded_vocab_stream))

==== n-everygram data (n=3) for each sequence in "Corpus": ====

[('<s>',), ('<s>', '<s>'), ('<s>', '<s>', 'Today'), ('<s>',), ('<s>', 'Today'), ('<s>', 'Today', 'we'), ('Today',), ('Today', 'we'), ('Today', 'we', 'express'), ('we',), ('we', 'express'), ('we', 'express', 'our'), ('express',), ('express', 'our'), ('express', 'our', 'deepest'), ('our',), ('our', 'deepest'), ('our', 'deepest', 'gratitude'), ('deepest',), ('deepest', 'gratitude'), ('deepest', 'gratitude', 'to'), ('gratitude',), ('gratitude', 'to'), ('gratitude', 'to', 'all'), ('to',), ('to', 'all'), ('to', 'all', 'those'), ('all',), ('all', 'those'), ('all', 'those', 'who'), ('those',), ('those', 'who'), ('those', 'who', 'have'), ('who',), ('who', 'have'), ('who', 'have', 'served'), ('have',), ('have', 'served'), ('have', 'served', 'in'), ('served',), ('served', 'in'), ('served', 'in', 'our'), ('in',), ('in', 'our'), ('in', 'our', 'armed'), ('our',), ('our', 'armed'), ('our', 'armed', 'forces'), ('armed',), ('armed', 'forc

### Model Function for Fit - MLE of order 3

In [362]:
def train_language_model(n, corpus, model_type='mle'):
    
    model_classes = {'mle': MLE, 'laplace': Laplace}
    ModelClass = model_classes.get(model_type, MLE)

    ngrams, vocab = padded_everygram_pipeline(n, corpus)
    model = ModelClass(n)
    model.fit(ngrams, vocab)

    num_tokens_after = len(model.vocab)

    print("\nTokens after fitting:", num_tokens_after)

    print("\n",model)

    print("\nDifferences with Corpus: ", len(set(list(model.vocab))) - len(set(list(flatten(corp)))), "( Corpus has",len(set(list(flatten(corp)))),")", "\n\n Tokens differences between Model and Corpus:",set(list(model.vocab)) - set(list(flatten(corp))))

    return model

In [363]:
trump_model = train_language_model(n = 3 , corpus = corp, model_type = "mle")


Tokens after fitting: 16338

 <nltk.lm.models.MLE object at 0x283858640>

Differences with Corpus:  3 ( Corpus has 16335 ) 

 Tokens differences between Model and Corpus: {'<UNK>', '</s>', '<s>'}


### Ex. 2.2: Vocabulary
- How many distinct tokens are in the model's vocabulary? Is that the same number of distinct tokens that appear in the tokenized corpus?
- Lookup the tokens of the sentence `"I love UNIGE students!"` in the model's vocabulary. Explain what you observe, and why. 

##### Answer

In [364]:
trump_model = train_language_model(n = 3 , corpus = corp, model_type = "mle")


Tokens after fitting: 16338

 <nltk.lm.models.MLE object at 0x296298040>

Differences with Corpus:  3 ( Corpus has 16335 ) 

 Tokens differences between Model and Corpus: {'<UNK>', '</s>', '<s>'}


> We can see that our model has 3 distinct tokens more that our original Corpus, because we actually account for the **UNK**, **s** and **/s** tokens that are used to help the model start and end the sentences and also identify which tokens where remove during the cut-off process in the vocabulary.

`"I love UNIGE students!"`#

In [365]:
print(trump_model.vocab.lookup('I love UNIGE students!'.split()))

('I', 'love', '<UNK>', '<UNK>')


> UNIGE and students are not contained in the vocabulary of model MLE, thus returning the **UNK** word, we can check that by doing this:

In [366]:
"UNIGE" in trump_model.vocab

False

In [367]:
"Student" in trump_model.vocab

False

> Which confirms that they were not in the original Corpus

### Ex. 2.3: Token probabilities
- When it comes to ngram models the training boils down to counting the ngrams from the training corpus. Using your fitted model, how many times do the following appear in the training data: ``'America', 'Trump', 'I will', 'will never forget'``.
- Then, compute the following word occurrence probabilities ('scores') in the Trump Tweets corpus, and briefly explain what the returned numbers mean about the training data:
    - $\mathbb{P}($'America'$)$,
    - $\mathbb{P}($'Trump'$)$,
    - $\mathbb{P}($'will'$\vert $'I'$)$,
    - $\mathbb{P}($'forget'$\vert $'will never'$)$.
- Briefly explain, with a formula, how those probabilities are obtained from the n-gram counts.

##### Answer

### Unigrams

In [686]:
print("'America' count is:",trump_model.counts['America'], "\n'Trump' count is:",trump_model.counts['Trump'], "\n'I will' count is:",trump_model.counts['I will'], "\n'will never forget' count is:",trump_model.counts['will never forget'])

'America' count is: 250 
'Trump' count is: 912 
'I will' count is: 0 
'will never forget' count is: 0


### Bigrams

In [687]:
print("\n'I' + 'will' count is:",trump_model.counts[['I']]['will'], "\n'will never' + 'forget' count is:",trump_model.counts[['will never']]['forget'])


'I' + 'will' count is: 344 
'will never' + 'forget' count is: 0


### Trigrams

In [688]:
print("'will' + 'never' + 'forget' count is:",trump_model.counts[['will', 'never']]['forget'])

'will' + 'never' + 'forget' count is: 8


### Probabilities

$\mathbb{P}($'America'$)$

In [692]:
print(trump_model.score('America')*100,"%")

0.14618686189434785 %


> "*America*" is not often used in the Training Data

$\mathbb{P}($'america'$)$ if we also want lowercase

In [694]:
print(trump_model.score('america')*100,"%")

0.0011694948951547826 %


> neither in lowercase

$\mathbb{P}($'Trump'$)$

In [696]:
print(trump_model.score('Trump')*100,"%")

0.5332896721905809 %


> "*Trump*" is not often used in the Training Data, which make sense since he is talking as himself, rare occasion are when he quotes someone else Tweet (Quote Tweets), that is containing his name in it. 

$\mathbb{P}($'trump'$)$ if we also want lowercase

In [375]:
print(trump_model.score('trump')*100,"%")

0.01812717087489913 %


> neither here for lowercase

$\mathbb{P}($'will'$\vert $'I'$)$

In [376]:
print(trump_model.score('will', 'I'.split())*100,"%")

20.53731343283582 %


> we can see that the probabiltiy rise with *I* followed by "*will*", since he is the person writing, and "*will*" is a very common word for political speech to be followed after the pronoun "*I*".

$\mathbb{P}($'forget'$\vert $'will never'$)$

In [377]:
print(trump_model.score('forget', 'will never'.split())*100,"%")

28.57142857142857 %


"*will never*" negation followed by "*forget*" verb seems very common as well, Political speech again seems to often use the sentence "*I will never forget...*"

## Part 3: Generation using N-gram Language Model

### Ex. 3.1: Tweet generator
Create a python function to generate new Trump Tweets. It should:
- take as input arguments: a fitted `nltk.lm.model`, a maximum number of words (integer), a text seed (initial context tokens), and a random "RNG" seed for generation,
- output a newly generated Trump Tweet, according to the input arguments, post-processed as a single text string that is formatted like a tweet.

*Hints:* `nltk.tokenize.treebank.TreebankWordDetokenizer()` and its `.detokenize()` method can help with post-processing. Pay attention to show things like `@user` mentions, urls, punctuation, etc... in a "correct" format.

##### Answer

In [378]:
def generate_tweet(language_model, text_seed = None, max_words = 20, random_seed = 42):
    """
    Generates a sequence of words based on the given language model.

    Args:
        language_model: A trained n-gram language model.
        max_words: The maximum number of words to generate.
        seed: The random seed value for reproducibility.

    Returns:
        A string of generated text.
    """
    generated_text = []
    for word in language_model.generate(max_words,  text_seed, random_seed):
        if word == '<s>':  # Skip start token
            continue
        if word == '</s>':  # Stop generating at end token
            break
        generated_text.append(word)

    # Check if only one word is generated and force restart
    if len(generated_text) <= 1:
        return generate_tweet(language_model, text_seed, max_words, random_seed+1)

    # Detokenize the generated text
    detokenizer = nltk.tokenize.treebank.TreebankWordDetokenizer()
    tweet = detokenizer.detokenize(generated_text)

    # Capitalize the tweet
    tweet = tweet.capitalize()

    # Remove spaces before punctuations
    for punctuation in str.punctuation:
        if punctuation not in ['@', '#', '&']:
            tweet = tweet.replace(" " + punctuation, punctuation)

    while tweet and not tweet[0].isalnum():
        tweet = tweet[1:]

    return tweet


### Generated Tweet

In [379]:
generate_tweet(trump_model, max_words = 20, random_seed = 42)

'Cut us military veterans pensions. http://t.co/dfufutibxy"'

### Generated Tweet at Random

In [380]:
seed_random = random.randint(0,1000)

generate_tweet(trump_model, max_words = 20, random_seed = seed_random)

'Smart and very boring speech. do you ever notice that @cnn gives me an award, your so-called"'

### Ex. 3.2: Initial context
To generate a full tweet from a LM of order $n$, explain what should be the text seed (i.e. the initial context tokens). Set the default value for the relevant argument of your function in 3.1 accordingly.

##### Answer

In [381]:
seed_random = random.randint(0,1000)

generate_tweet(trump_model, max_words = 20,  text_seed = None,random_seed = seed_random)

'Also, tune in to watch bad product that only builds up crooked hillary clinton'

> We don't necessarly want to give a context when generating randomly some tweets by default, this will restrict the tweet generations too much. 

### Ex. 3.3: Generate tweets
Generate a few new tweets using your new function and the LM fitted in Part 2. For reproducibility, use a random RNG seed to show them. 

*Facultative:* show a few examples that you find interesting, representative or funny.

##### Answer

In [382]:
seed_random = 1000

generate_tweet(trump_model, max_words = 20,  text_seed = ["Russia"],random_seed = seed_random)

'Leaked the disastrous dnc e-mails, which are total losers!'

In [383]:
seed_random = 23

generate_tweet(trump_model, max_words = 20,  text_seed = ["America"],random_seed = seed_random)

'Needs you to @foxandfriends for the lies, and it is certainly my intention to be less predictable.'

In [384]:
seed_random = 20

generate_tweet(trump_model, max_words = 20,  text_seed = ["Clinton"],random_seed = seed_random)

'Supporter @alisonforky declare crooked hillary cant even send emails without putting entire nation at risk?'

## Part 4: Smoothing and model comparison

### Ex. 4.1: Smoothed LM alternatives to simple MLE
Modify the function that you defined in 2.1 by adding an argument that allows changing the `nltk.lm` language model that is fitted in the function (e.g. to fit a Laplace or a Lidstone model instead of the simple MLE). 
Also briefly explain what is the difference between Laplace, Lidstone and the simple MLE language models.

*Hint:* Your function might need more than a single additional argument, if some LM have hyperparameters.

##### Answer

The main differences between MLE, Laplace Smoothing, and Lidstone Smoothing are as follows: when it comes to handling unseen words, MLE assigns zero probability to them, which can lead to issues when encountering new words, whereas Laplace Smoothing adds a small constant value to each word's count, ensuring no word has zero probability, even if unseen, and Lidstone Smoothing also adds a fraction of the smoothing parameter to each word's count, allowing for customizable smoothing.

In terms of probability distribution, MLE estimates probability based on observed frequency, without considering unseen events, whereas Laplace Smoothing distributes the added probability mass evenly among all words, including unseen words, and Lidstone Smoothing allows for a customizable smoothing parameter, which controls the amount of smoothing applied.

Regarding bias and overestimation, MLE has no bias, but may struggle with unseen words, whereas Laplace Smoothing introduces a bias towards unseen events and can overestimate rare events, and Lidstone Smoothing offers more flexibility in controlling the bias and overestimation, depending on the chosen smoothing parameter.

In terms of flexibility, MLE has no flexibility, as it relies solely on observed frequencies, whereas Laplace Smoothing has limited flexibility, as it adds a fixed constant value, and Lidstone Smoothing offers the most flexibility, as it allows for a customizable smoothing parameter.

Finally, when it comes to tunability, MLE has no tunable parameters, whereas Laplace Smoothing has no tunable parameters, as the smoothing value is fixed, and Lidstone Smoothing allows the smoothing parameter (lambda) to be tuned based on the specific characteristics of the data.

In [385]:
def train_language_models(n, corpus, model_type='mle', gamma = 0.1):
    """
    Train a language model using maximum likelihood estimation (MLE),
    Laplace smoothing, or Lidstone smoothing.

    Args:
        n (int): The order of the language model.
        corpus (list): The training corpus as a list of sentences or tokens.
        model_type (str, optional): The type of model to train.
            Valid options: 'mle' (default), 'laplace', 'lidstone'.
        alpha (float, optional): The gamma Lidstone smoothing.
            Defaults to 0.1

    Returns:
        object: The trained language model.

    Raises:
        ValueError: If an invalid model_type is specified.

    """
    
    model_classes = {'mle': MLE, 'laplace': Laplace, 'lidstone': Lidstone}
    
    ModelClass = model_classes.get(model_type)

    ngrams, vocab = padded_everygram_pipeline(n, text = corpus)

    if model_type == "lidstone": 
        model = ModelClass(order = n, gamma = gamma)
    else:
        model = ModelClass(order = n)

    model.fit(text = ngrams, vocabulary_text = vocab)

    num_tokens_after = len(model.vocab)

    print("\nTokens after fitting:", num_tokens_after)

    return model

### Ex. 4.2: Qualitative model comparison 
With $n=1,2,3,4$, fit and generate new tweets from the simple MLE and from the Laplace LM of orders $n$. 
- Compare the results between the different $n$ values and between the two models. 
- What are the main differences for generation? Which model(s) do you think might be the best options for generating new realistic tweets?
- Do you see hints of those differences in the generated tweets?

##### Answer

### Fitting n = 1,2,3,4 for MLE

In [391]:
randomseed1 = 500

print("MLE MODEL\n--------------")
for i in range(1,5):
   print("Order:" , i, "\nTweet:", generate_tweet(train_language_models(n = i, corpus = corp, model_type='mle'), max_words = 20,  text_seed = None, random_seed = randomseed1),"\n-------------")

MLE MODEL
--------------

Tokens after fitting: 16336
Order: 1 
Tweet: Our was dishonest @realdonaldtrump nationally nafta out i days one! hillary rally crooked to #debate vote enjoy., 
-------------

Tokens after fitting: 16338
Order: 2 
Tweet: Of weeks i dont want our people are not run! 
-------------

Tokens after fitting: 16338
Order: 3 
Tweet: Never will." great! 
-------------

Tokens after fitting: 16338
Order: 4 
Tweet: Me was the highest rated show that they have long dreamed of- and no effective raise in years. 
-------------


### Fitting n = 1,2,3,4 for LaPlace

In [390]:
randomseed1 = 500

print("LAPLACE MODEL\n--------------")
for i in range(1,5):
   print("Order:" , i, "\nTweet:", generate_tweet(train_language_models(n = i, corpus = corp,  model_type = "laplace"), max_words = 20,  text_seed = None, random_seed = randomseed1),"\n-------------")

LAPLACE MODEL
--------------

Tokens after fitting: 16336
Order: 1 
Tweet: On voters dishonest @swamp_bug millions-we my open i deal on! hillary presidential cruz to #makeamericagreatagain us ethics.- 
-------------

Tokens after fitting: 16338
Order: 2 
Tweet: Numbers were me. thank you need change his show! 
-------------

Tokens after fitting: 16338
Order: 3 
Tweet: Must talk to will? he was done. these #veterans are living in poverty, education and safety of 
-------------

Tokens after fitting: 16338
Order: 4 
Tweet: Madness must be stopped in congress. stand up republicans! 
-------------


- Compare the results between the different $n$ values and between the two models. 

> As we increase the order of the Model, in both case, tweets generated looks nicer, with less weird artefacts and unlogical sentences. **LaPlace** Model seems to be more often coherent than **MLE**. 

- What are the main differences for generation? Which model(s) do you think might be the best options for generating new realistic tweets?

> **MLE** tends to generate very close words together based on observred frequency on the Corpus, which when the order is low, is very visible. On the opposite, even on same order, **LaPlace** seems to chose more general words that don't seems to follow the n-grams order too much and looks "smoother" and is less predictable.

- Do you see hints of those differences in the generated tweets?

> The 4th tweet of both model (order = 4) is a good example of how **MLE** is choosing very close words every 4 grams, but it looks like there is a unlogical blend between each. 

### Ex. 4.3: Quantitative evaluation and comparison
- Split the tokenized Trump Tweets corpus into a (reproducible) training set (80%) and a test set (20%). 
- Compute the train and test 3-gram perplexity scores of a simple MLE LM, a Laplace LM, and a Lidstone LM with $\gamma=0.1$. Use model order $n=3$ for each.
- Compare and discuss the obtained train and test perplexity scores of the three models. Argue which model might represent the Trump Tweets data best.

*Hint:* To compute the perplexity correctly, you might need to preprocess the relevant corpus documents to a list of padded $n$-grams.

##### Answer

### Custom Functions

In [392]:
def random_select_lists(data, percentage, seed=42):
    random.seed(seed)
    num_lists_to_select = int(len(data) * percentage)
    selected_lists = random.sample(data, num_lists_to_select)
    remaining_lists = [lst for lst in data if lst not in selected_lists]
    return selected_lists, remaining_lists

### Splitting the Corpus into Training and Test

In [393]:
Train_Corpus, Test_Corpus = random_select_lists(corp, 0.8)

print("\n Train Corpus:",Train_Corpus, "\n\n Test_Corpus:",Test_Corpus)

print("\n-------------------------------\n\n",len(Train_Corpus),"(80% Train)", "+",len(Test_Corpus),"(20% Test)", "=", len(Train_Corpus)+len(Test_Corpus), "\n\n Total Corpus:", len(corp))


 Train Corpus: [['"', '@PattiPav1', ':', '@realDonaldTrump', 'Great', 'interview', '!', '"', 'Thanks'], ['My', 'thoughts', 'and', 'prayers', 'are', 'with', 'the', 'victims', 'and', 'families', 'of', 'those', 'affected', 'by', 'two', 'powerful', 'earthquakes', 'in', 'Italy', 'and', 'Myanmar', '.'], ['Huma', 'Abedin', 'told', 'Clinton', 'her', 'secret', 'email', 'account', 'caused', 'problems', 'https://t.co/i4zN2QzKnf'], ['By', 'self-funding', 'my', 'campaign', ',', 'I', 'am', 'not', 'controlled', 'by', 'my', 'donors', ',', 'special', 'interests', 'or', 'lobbyists', '.', 'I', 'am', 'only', 'working', 'for', 'the', 'people', 'of', 'the', 'U', '.', 'S', '.', '!'], ['"', '@Mutual408Grace', ':', '@realDonaldTrump', '@gene70', 'California', 'women', 'love', 'Mr', 'Trump', 'too', '.', 'Will', 'make', 'it', 'happen', 'in', 'New', 'York', 'on', 'April', '19', '.', 'Go', 'out', '&', 'vote', '.', '"'], ['Lets', 'properly', 'check', 'goofy', 'Elizabeth', 'Warrens', 'records', 'to', 'see', 'if', '

### Sequence of Padded Ngram Tuples for Perplexity

According to documentation for Entropy, we should prepare our Train and Test corpus into N-grams padded as tuples to compute Perplexity (as it uses Entropy function).

This should look like this:

corpus = [["I", "love", "natural", "language", "processing"], ["This", "is", "another", "sentence"]]

https://www.nltk.org/_modules/nltk/lm/api.html#LanguageModel.entropy 

In [504]:
corpus_temp = [["I", "love", "natural", "language", "processing"], ["This", "is", "another", "sentence"]]
print("\nOUTPUT:")
[tuple(ngram) for sentence in corpus_temp for ngram in ngrams(pad_both_ends(sentence, n=n), n)]


OUTPUT:


[('<s>', '<s>', 'I'),
 ('<s>', 'I', 'love'),
 ('I', 'love', 'natural'),
 ('love', 'natural', 'language'),
 ('natural', 'language', 'processing'),
 ('language', 'processing', '</s>'),
 ('processing', '</s>', '</s>'),
 ('<s>', '<s>', 'This'),
 ('<s>', 'This', 'is'),
 ('This', 'is', 'another'),
 ('is', 'another', 'sentence'),
 ('another', 'sentence', '</s>'),
 ('sentence', '</s>', '</s>')]

### MLE LM

In [522]:
# Set Order
n = 3

#### Train

In [523]:
Train_MLE = train_language_models(n = n, corpus = Train_Corpus,  model_type = "mle")
print("Model:",Train_MLE)


Tokens after fitting: 14095
Model: <nltk.lm.models.MLE object at 0x287db18d0>


                Perplexity on Train             

In [524]:
corpus_temp = Train_Corpus

ngram_corpus =  [tuple(ngram) for sentence in corpus_temp for ngram in ngrams(pad_both_ends(sentence, n=n), n)]

Train_MLE.perplexity(ngram_corpus)

3.129051115907475

> We have a very low Perplexity, which may indicate an overfit of MLE on Train

                Perplexity on Test             

In [525]:
corpus_temp = Test_Corpus

ngram_corpus =  [tuple(ngram) for sentence in corpus_temp for ngram in ngrams(pad_both_ends(sentence, n=n), n)]

Train_MLE.perplexity(ngram_corpus)

inf

> This should be normal to have inf as we are dealing with unkown word and the log(0) when frequency of a word is none, return -inf. 

### Laplace LM

In [526]:
Train_LAPLACE = train_language_models(n = n, corpus = Train_Corpus,  model_type = "laplace")
print("Model:",Train_LAPLACE)


Tokens after fitting: 14095
Model: <nltk.lm.models.Laplace object at 0x284b255a0>


                Perplexity on Train          

In [527]:
corpus_temp = Train_Corpus

ngram_corpus =  [tuple(ngram) for sentence in corpus_temp for ngram in ngrams(pad_both_ends(sentence, n=n), n)]

Train_LAPLACE.perplexity(ngram_corpus)

3666.7017907563372

> We are less over fitting on Train with LaPlace this time

                Perplexity on Test             

In [528]:
corpus_temp = Test_Corpus

ngram_corpus =  [tuple(ngram) for sentence in corpus_temp for ngram in ngrams(pad_both_ends(sentence, n=n), n)]

Train_LAPLACE.perplexity(ngram_corpus)

6405.539194358373

> Test should be higher with its Perplexity as this set should be unseen by our model and thus perform less good in term of probabilities and patterns and be more confused with what next words and should put when not knowing it. This is the case here, and also on others models. 

### Lidstone LM with $\gamma=0.1$

In [529]:
Train_LIDSTONE = train_language_models(n = n, corpus = Train_Corpus,  model_type = "lidstone", gamma = 0.1)
print("Model:",Train_LIDSTONE, "\nGamma:",Train_LIDSTONE.gamma)


Tokens after fitting: 14095
Model: <nltk.lm.models.Lidstone object at 0x28f1182b0> 
Gamma: 0.1


                Perplexity on Train             

In [532]:
corpus_temp = Train_Corpus

ngram_corpus =  [tuple(ngram) for sentence in corpus_temp for ngram in ngrams(pad_both_ends(sentence, n=n), n)]

Train_LIDSTONE.perplexity(ngram_corpus)

640.6129488611198

> The smaller perplexity obtained with a lower gamma suggests that the Lidstone smoothing parameter is giving more weight to the observed words in the training set, potentially indicating a stronger reliance on the training data and a higher susceptibility to overfitting.

                Perplexity on Test             

In [533]:
corpus_temp = Test_Corpus

ngram_corpus =  [tuple(ngram) for sentence in corpus_temp for ngram in ngrams(pad_both_ends(sentence, n=n), n)]

Train_LIDSTONE.perplexity(ngram_corpus)

3749.5159479239624

> Compared to LaPlace Train VS Test, our Test set on Lidstone with $\gamma$ = 0.1 is proportionally way higher, since we may have overfit our Train Corpus. This is very difficult to choose which Model is best only based on the same Order and without Hyper-Parameter Tuning, but solely based on the Perplexity score we obtain from all models at $n$ = 3, Lidstone with $\gamma$ = 0.1 seems to be less prone to overfitting on the Test set and would performs generally better on unseen Corpus. 

### Ex. 4.4: Hyper-parameter tuning
- Perform a grid-search to select the best hyperparameter values for $n$ and $\gamma$, for the Lidstone LM. You want to select the model that generalizes best to new data.
- What do you observe in the obtained perplexity scores? Was it expected? Explain it in statistical terms.

*Hint:* Maybe try a few values for $n$ and $\gamma$ by hand to identify the general hyperparameter region of interest before defining a more thorough hyperparameter value grid.

##### Answer

---------------

### Custom Grid Search

#### Model Function

In [586]:
def train_language_models_grid(n, corpus, model_type='mle', gamma = 0.1):
    """
    Train a language model using maximum likelihood estimation (MLE),
    Laplace smoothing, or Lidstone smoothing.

    Args:
        n (int): The order of the language model.
        corpus (list): The training corpus as a list of sentences or tokens.
        model_type (str, optional): The type of model to train.
            Valid options: 'mle' (default), 'laplace', 'lidstone'.
        alpha (float, optional): The gamma Lidstone smoothing.
            Defaults to 0.1

    Returns:
        object: The trained language model.

    Raises:
        ValueError: If an invalid model_type is specified.

    """
    
    model_classes = {'mle': MLE, 'laplace': Laplace, 'lidstone': Lidstone}
    
    ModelClass = model_classes.get(model_type)

    ngrams, vocab = padded_everygram_pipeline(n, text = corpus)

    if model_type == "lidstone": 
        model = ModelClass(order = n, gamma = gamma)
    else:
        model = ModelClass(order = n)

    model.fit(text = ngrams, vocabulary_text = vocab)

    return model

#### Broad Search

> Testing for multiple $n$ and $\gamma$ before restricting our search
>
> We try to increment $\gamma$ by 0.5 and go over 1 to 5 for $n$

In [682]:
# Hyperparamters Init
order_max = 4
gamma_min = 0
gamma_max = 5
gamma_increment = 0.1

# Score Init
best_model_perplexity = 10**10
best_model_order = 0
best_model_gamma = 0

# Store each Values
results_array = np.empty((0, 3))

# Grid Search
for order in range(1,order_max+1):

    for gamma in np.arange(gamma_min+gamma_increment, gamma_max, gamma_increment).round(10):

        model = train_language_models_grid(n = order, corpus = Train_Corpus,  model_type = "lidstone", gamma = gamma)
        
        perplexity = model.perplexity([tuple(ngram) for sentence in Test_Corpus for ngram in ngrams(pad_both_ends(sentence, n=n), n)])

        new_row = np.array([[order, gamma, perplexity]])
        results_array = np.append(results_array, new_row, axis=0)

        if perplexity < best_model_perplexity:
            best_model_perplexity = perplexity
            best_model_order = order
            best_model_gamma = gamma


### Best Hyperparameters and Perplexity

In [684]:
def highlight_lowest_perplexity(row):
    if row['perplexity'] == results_df['perplexity'].min():
        return ['background-color: #a87d4c'] * len(row)
    else:
        return [''] * len(row)
    
results_df = pd.DataFrame(results_array, columns=['order', 'gamma', 'perplexity'])
results_df.style.apply(highlight_lowest_perplexity, axis=1)
results_df["order"] = results_df["order"].astype(int)
best_index = results_df['perplexity'].idxmin()

# Select the 10 rows around the best row
slice_start = max(0, best_index - 5)
slice_end = min(len(results_df), best_index + 6)
results_df_slice = results_df.iloc[slice_start:slice_end]

# Apply the styling function to the sliced DataFrame
styled_df_slice = results_df_slice.style.apply(highlight_lowest_perplexity, axis=1)

# Display the styled DataFrame
styled_df_slice

Unnamed: 0,order,gamma,perplexity
7,1,0.8,1118.88474
8,1,0.9,1115.380824
9,1,1.0,1113.026991
10,1,1.1,1111.59251
11,1,1.2,1110.90801
12,1,1.3,1110.845541
13,1,1.4,1111.306122
14,1,1.5,1112.211644
15,1,1.6,1113.49941
16,1,1.7,1115.118355


#### Plotting Grid Search

In [685]:
fig = px.line(results_df, x='gamma', y='perplexity', color='order', 
              color_discrete_sequence=px.colors.qualitative.Plotly,
              title = "Broad Grid Search with Lidstone")
fig.add_trace(go.Scatter(x=[results_df_slice.loc[best_index,'gamma']], y=[results_df_slice.loc[best_index,'perplexity']], mode='markers', marker=dict(color='#DDA15E', size=12),name='Min Perplexity'))
fig.update_layout(width=1000, height=600)
fig.show()

> With doing a Broader Grid Search, we can see that the preffered order is $n$ across the gamma paramter of Lidstone, meaning that the model performs best when we do one-gram computations, and thus considering the previous word only which may lead to less variance but more bias since we oversimplify the context of the word to only one instance. At the same time, since we are using Lidstone Smoothing, we are giving more weight to unseen words compared to observed words, thus the $\gamma$ being over 1 in this case will performs best on Test set.

#### Focused Search

> This time we want to focus on $n$ = 1, and check the smallest $\gamma$ possible around 0.5 and 2

In [679]:
# Hyperparamters Init
order_max = 1
gamma_min = 0.5
gamma_max = 2
gamma_increment = 0.01

# Score Init
best_model_perplexity = 10**10
best_model_order = 0
best_model_gamma = 0

# Store each Values
results_array = np.empty((0, 3))

# Grid Search
for order in range(1,order_max+1):

    for gamma in np.arange(gamma_min, gamma_max, gamma_increment).round(10):

        model = train_language_models_grid(n = order, corpus = Train_Corpus,  model_type = "lidstone", gamma = gamma)
        
        perplexity = model.perplexity([tuple(ngram) for sentence in Test_Corpus for ngram in ngrams(pad_both_ends(sentence, n=n), n)])

        new_row = np.array([[order, gamma, perplexity]])
        results_array = np.append(results_array, new_row, axis=0)

        if perplexity < best_model_perplexity:
            best_model_perplexity = perplexity
            best_model_order = order
            best_model_gamma = gamma


### Best Hyperparameters and Perplexity

In [680]:
def highlight_lowest_perplexity(row):
    if row['perplexity'] == results_df['perplexity'].min():
        return ['background-color: #a87d4c'] * len(row)
    else:
        return [''] * len(row)
    
results_df = pd.DataFrame(results_array, columns=['order', 'gamma', 'perplexity'])
results_df.style.apply(highlight_lowest_perplexity, axis=1)
results_df["order"] = results_df["order"].astype(int)
best_index = results_df['perplexity'].idxmin()

# Select the 10 rows around the best row
slice_start = max(0, best_index - 5)
slice_end = min(len(results_df), best_index + 6)
results_df_slice = results_df.iloc[slice_start:slice_end]

# Apply the styling function to the sliced DataFrame
styled_df_slice = results_df_slice.style.apply(highlight_lowest_perplexity, axis=1)

# Display the styled DataFrame
styled_df_slice

Unnamed: 0,order,gamma,perplexity
71,1,1.21,1110.8756
72,1,1.22,1110.849277
73,1,1.23,1110.828934
74,1,1.24,1110.814468
75,1,1.25,1110.805776
76,1,1.26,1110.802759
77,1,1.27,1110.80532
78,1,1.28,1110.813365
79,1,1.29,1110.826802
80,1,1.3,1110.845541


#### Plotting Grid Search

In [681]:
fig = px.line(results_df, x='gamma', y='perplexity', color='order', 
              color_discrete_sequence=px.colors.qualitative.Plotly,
              title = "Focused Grid Search with Lidstone")
fig.update_layout(width=1000, height=600)
fig.add_trace(go.Scatter(x=[results_df_slice.loc[best_index,'gamma']], y=[results_df_slice.loc[best_index,'perplexity']], mode='markers', marker=dict(color='#DDA15E', size=12),name='Min Perplexity'))
fig.show()

> As we can see, the best perplexity seems to be around 1110.8, with $\gamma$ = 1.26, thus confirming our Broad Search was on point, with a slight lower $\gamma$. 