# NLP Assignment: Generating Trump Tweets with N-Gram Models

In this assignment, you will use n-gram language models (LM) to model tweets (social media statements) from or about the former U.S. president Donald Trump. The goal will then be to generate new tweets, or do autocompletion, in the writing style of Trump's tweets. The tweets have been scraped from the Twitter social media (since then renamed "X").

Before starting this assignment, the appended `NLP_ngram_cheatsheet.ipynb` notebook provides a tutorial on n-grams and LM basics, using the `nltk` package.

Please code the necessary steps in python, and provide answers in Markdown format in this notebook, under the corresponding instructions and questions below.

Please rename your final file `NLP_Assignment_STUDENTID.ipynb` for submission on moodle, and make sure you "run all" with a fresh kernel, so that outputs show correctly and in order in your submission.

**STUDENT ID:** 19-320-563

## Packages

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

# Import NLTK and its submodules
import nltk
from nltk.lm import MLE, Laplace, Lidstone
from nltk.lm.preprocessing import padded_everygram_pipeline, flatten
from nltk.util import ngrams
from nltk.lm.preprocessing import pad_both_ends

# Import other libraries
import random
import re
import string as str
import unicodedata
from sklearn.model_selection import train_test_split

# Download NLTK data
nltk.download('popular', quiet=True)

True

----------------

## Part 1: Import, inspect and preprocess the text data

- Import the provided dataset, `Trump_tweets.csv`. We are interested in the variable `Tweet_Text`, which gives the content of each tweet. 
- Before tokenizing, start by cleaning the tweets' format. You should at least normalize the different types of apostrophes and quotes (e.g. `` ’, ”, ` ``) to the corresponding ` ' ` or ` " `, remove line breaks `\n` (careful about not "merging" words), and remove multiple spacing. Also make sure urls (e.g. `https://t.co/wPk7QWpK8Z`) are not split into too many meaningless tokens. 
- (Facultative) Feel free to perform additional cleaning steps that you believe will improve the tokenization or the downstream LMs (in which case, briefly explain why).
- Tokenize the `Tweet_Text` corpus into a list of tokenized tweets (documents). The result should be a list of lists containing word-level tokens (e.g. words, punctuation, and other "special words").
- Show the result for the first five tweets of the corpus.

##### Answer

### Import Data

In [2]:
trump_data = pd.read_csv('DATA/Trump_tweets.csv')
trump_data

Unnamed: 0,Date,Time,Tweet_Text,Type,Media_Type,Hashtags,Tweet_Id,Tweet_Url,twt_favourites_IS_THIS_LIKE_QUESTION_MARK,Retweets,Unnamed: 10,Unnamed: 11
0,16-11-11,15:26:37,Today we express our deepest gratitude to all ...,text,photo,ThankAVet,7.970000e+17,https://twitter.com/realDonaldTrump/status/797...,127213,41112,,
1,16-11-11,13:33:35,Busy day planned in New York. Will soon be mak...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/797...,141527,28654,,
2,16-11-11,11:14:20,Love the fact that the small groups of protest...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/797...,183729,50039,,
3,16-11-11,2:19:44,Just had a very open and successful presidenti...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/796...,214001,67010,,
4,16-11-11,2:10:46,A fantastic day in D.C. Met with President Oba...,text,,,7.970000e+17,https://twitter.com/realDonaldTrump/status/796...,178499,36688,,
...,...,...,...,...,...,...,...,...,...,...,...,...
7370,15-07-16,13:10:00,I loved firing goofball atheist Penn @pennjill...,text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,953,431,,
7371,15-07-16,10:18:31,I hear @pennjillette show on Broadway is terri...,text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,1175,1086,,
7372,15-07-16,10:10:17,Irrelevant clown @KarlRove sweats and shakes n...,text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,1494,930,,
7373,15-07-16,9:44:07,"""@HoustonWelder: Donald Trump is one of the se...",text,,,6.220000e+17,https://twitter.com/realDonaldTrump/status/621...,1800,1738,,


### Data Info

In [3]:
trump_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7375 entries, 0 to 7374
Data columns (total 12 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Date                                       7375 non-null   object 
 1   Time                                       7375 non-null   object 
 2   Tweet_Text                                 7375 non-null   object 
 3   Type                                       7375 non-null   object 
 4   Media_Type                                 1225 non-null   object 
 5   Hashtags                                   2031 non-null   object 
 6   Tweet_Id                                   7375 non-null   float64
 7   Tweet_Url                                  7375 non-null   object 
 8   twt_favourites_IS_THIS_LIKE_QUESTION_MARK  7375 non-null   int64  
 9   Retweets                                   7375 non-null   int64  
 10  Unnamed: 10             

### Select *Tweet_Tex* Only

In [4]:
trump_tweet_text = df = pd.DataFrame({'Tweet_Text': trump_data["Tweet_Text"]})
trump_tweet_text_temp = trump_tweet_text.copy()
trump_tweet_text

Unnamed: 0,Tweet_Text
0,Today we express our deepest gratitude to all ...
1,Busy day planned in New York. Will soon be mak...
2,Love the fact that the small groups of protest...
3,Just had a very open and successful presidenti...
4,A fantastic day in D.C. Met with President Oba...
...,...
7370,I loved firing goofball atheist Penn @pennjill...
7371,I hear @pennjillette show on Broadway is terri...
7372,Irrelevant clown @KarlRove sweats and shakes n...
7373,"""@HoustonWelder: Donald Trump is one of the se..."


### Retweet VS Quote Tweets

<img src="IMAGES/Differences between Retweet and Quote Tweet.png" width=40% />

Truly, A. (2020). Retweets Vs. Quote Tweets: Why Twitter’s Sharing Experiment Failed. [online] ScreenRant. Available at: https://screenrant.com/twitter-retweet-vs-quote-tweet-experiment-stopped/ [Accessed 4 May 2024].

‌

#### RT (Retweet)

In [5]:
sample1 = trump_tweet_text["Tweet_Text"].iloc[8]
sample1

'RT @IvankaTrump: Such a surreal moment to vote for my father for President of the United States! Make your voice heard and vote! #Election2_'

> Retweets are generally only a copy of another tweet, displayed as a post under the user profile (here Trump), this means that the way of speech of Trump won't appear in this category of tweets

#### QT (Quote Tweet)

In [6]:
sample2 = trump_tweet_text["Tweet_Text"].iloc[-2]
sample2

'"@HoustonWelder: Donald Trump is one of the sexiest men on this planet. Every woman dreams of a good man who tells it like it is." So true!'

> Quote Tweets may contain some original Trump says, mainly comments added after a quote of someone else. Here Trump only says "So true!", but we want to keep this dynamic of him quoting someone else and adding is own speech. 

#### Tweets Types in our Data

In [7]:
trump_tweet_text['Type'] = np.where(trump_tweet_text["Tweet_Text"].str.startswith('RT'), 'RETWEET', 
                       np.where(trump_tweet_text["Tweet_Text"].str.startswith('"@'), 'QUOTE RETWEET', 'ORIGINAL'))

trump_tweet_text["Type"].value_counts()

ORIGINAL         4819
QUOTE RETWEET    2126
RETWEET           430
Name: Type, dtype: int64

> Here we can identify RT by the _**RT**_ words and QT by the fact that it should always starts with  _"@_  to mention the user we take the quote from.


#### Removing Retweet and keeping Quote Retweet

> I believe that retaining the RETWEET feature may not be crucial in replicating Trump's tweet style. At best, it might generate tweets that he would endorse or deem worthy of sharing with his followers. However, it also risks attributing others' words to his own speech without capturing the essence of his tone or thought process. In contrast preserving Quotation Retweets could be more valuable, as they often reflect Trump's tendency to comment on or criticize existing tweets, which could help the model better understand this aspect of his behavior. 

In [8]:
trump_tweet_text = trump_tweet_text[trump_tweet_text['Type'] != 'RETWEET']
trump_tweet_text.reset_index(drop=True, inplace=True)
trump_tweet_text

Unnamed: 0,Tweet_Text,Type
0,Today we express our deepest gratitude to all ...,ORIGINAL
1,Busy day planned in New York. Will soon be mak...,ORIGINAL
2,Love the fact that the small groups of protest...,ORIGINAL
3,Just had a very open and successful presidenti...,ORIGINAL
4,A fantastic day in D.C. Met with President Oba...,ORIGINAL
...,...,...
6940,I hope the boycott of @Macys continues forever...,ORIGINAL
6941,I loved firing goofball atheist Penn @pennjill...,ORIGINAL
6942,I hear @pennjillette show on Broadway is terri...,ORIGINAL
6943,Irrelevant clown @KarlRove sweats and shakes n...,ORIGINAL


> Therefore we removed 430 Tweets from our Corpus

### Cleaning 

#### Find ’ ,  ”  and ` 

In [9]:
pattern_occurrences = trump_tweet_text['Tweet_Text'].str.contains(r'’', regex=True)
np.array(trump_tweet_text[pattern_occurrences][0:10]["Tweet_Text"])

array([], dtype=object)

In [10]:
pattern_occurrences = trump_tweet_text['Tweet_Text'].str.contains(r'”', regex=True)
np.array(trump_tweet_text[pattern_occurrences][0:10]["Tweet_Text"])

array([], dtype=object)

In [11]:
pattern_occurrences = trump_tweet_text['Tweet_Text'].str.contains(r'`', regex=True)
np.array(trump_tweet_text[pattern_occurrences][0:10]["Tweet_Text"])

array([], dtype=object)

> There is no other apostrophes type in our Tweets, except the standard "

In [12]:
pattern_occurrences = trump_tweet_text['Tweet_Text'].str.contains(r'"', regex=True)
np.array(trump_tweet_text[pattern_occurrences][0:10]["Tweet_Text"])

array(['Just out according to @CNN: "Utah officials report voting machine problems across entire country"',
       '"The Clinton Campaign at Obama Justice" #DrainTheSwamp\nhttps://t.co/LZkvFc071z',
       '"It pays to have friends in high places- like the Justice Department. Clearly the Clintons do."\n#DrainTheSwamp! https://t.co/KZXB4B156M',
       '"@PYNance: Evangelical women live at #trumptower @pdpryor1 @CissieGLynch @SaysGabrielle https://t.co/k5kGXPR2WA"',
       '"@Ravenrantz: #Billygrahams grand daughter #SupportsTrump https://t.co/sKz1SPHzDZ"  So nice, thank you Cissy Graham Lynch!!!!',
       '"@slh: I follow Mr.Trump at all of his rallies by watching them on https://t.co/biseaBESvS. He is a lion-hearted warrior, who inspires hope',
       '"@DeplorableCBTP: "In my mind, #DonaldTrump is the only way out of this mess." - #PhilRobertson of TVs #DuckDynasty"   Thank you Phil!',
       '"@piersmorgan: BOMBSHELL: FBI reopening its investigation into HillaryClintons email server a

> Here we have our standard "

#### Find "\n " - Back Return

In [13]:
pattern_occurrences = trump_tweet_text['Tweet_Text'].str.contains(r'\n', regex=True)
np.array(trump_tweet_text[pattern_occurrences][0:10]["Tweet_Text"])

array(['Watching the returns at 9:45pm.\n#ElectionNight #MAGA__ https://t.co/HfuJeRZbod',
       'Still time to #VoteTrump!\n#iVoted #ElectionNight https://t.co/UZtYAY1Ba6',
       'Unbelievable evening in New Hampshire - THANK YOU! Flying to Grand Rapids, Michigan now.\nWatch NH rally here:_ https://t.co/hP88anrfgk',
       'America must decide between failed policies or fresh perspective, a corrupt system or an outsider\nhttps://t.co/ll8QIW9SqW',
       'What I Like About Trump ... and Why You Need to Vote for Him\nhttps://t.co/6rVuDUehZq',
       'I love you North Carolina- thank you for your amazing support! Get out and https://t.co/HfihPERFgZ tomorrow!\nWatch:_ https://t.co/jZzfqUZNYh',
       'Starting tomorrow its going to be #AmericaFirst! Thank you for a great morning Sarasota, Florida!\nWatch here:_ https://t.co/ig62Kjkkvl',
       'Thank you Minnesota! It is time to #DrainTheSwamp &amp; #MAGA!\n#ICYMI- watch: https://t.co/fVThC7yIL6 https://t.co/e8SaXiJrxj',
       'MONDAY -

#### Find "\s" - 2 or more Spaces

In [14]:
pattern_occurrences = trump_tweet_text['Tweet_Text'].str.contains(r'\s{2,}', regex=True)
np.array(trump_tweet_text[pattern_occurrences][0:10]["Tweet_Text"])

array(['MONDAY - 11/7/2016\n\nScranton, Pennsylvania at 5:30pm.\nhttps://t.co/BcErCtsPdF\n\nGrand Rapids, Michigan at 11pm._ https://t.co/pgFMLp0173',
       'Van Jones: There Is A Crack in the Blue Wall۪  It Has to Do With Trade: https://t.co/BvEF9cC7o7',
       'JOIN ME TOMORROW!\nMINNESOTA ۢ 2pm\nhttps://t.co/WcgLh4prS7\n\nMICHIGAN ۢ 6pm\nhttps://t.co/9BqGVKNNrt\n\nVIRGINIA ۢ 9:30p_ https://t.co/A1oVhCrT6t',
       'Join me in Denver, Colorado tonight at 9:30pm: https://t.co/LJYGIK7Mri\n\nNEW- Scranton, Pennsylvania Monday @ 5:30pm: https://t.co/BcErCtsPdF',
       'Join me today in Wilmington, Ohio at 4pm: https://t.co/eCLECMkYLw\n\nTomorrow- Tampa, Florida at 10am: https://t.co/N9380pVmuM',
       '"@Ravenrantz: #Billygrahams grand daughter #SupportsTrump https://t.co/sKz1SPHzDZ"  So nice, thank you Cissy Graham Lynch!!!!',
       'Join me in Florida tomorrow!\n\nMIAMIۢ12pm\nhttps://t.co/A3X71Q6sG2\n\nORLANDOۢ4pm\nhttps://t.co/6BqTVoty5C\n\nPENSACOLAۢ7p_ https://t.co/kEQuuJeO1B',


> There is indeed 2 or more Spaces in our Corpus

#### Find "&AMP" - "&"for Ampersand

In [15]:
pattern_occurrences = trump_tweet_text['Tweet_Text'].str.contains(r'&\w+;', regex=True)
np.array(trump_tweet_text[pattern_occurrences][0:10]["Tweet_Text"])

array(['LIVE on #Periscope: Join me for a few minutes in Pennsylvania. Get out &amp; VOTE tomorrow. LETS #MAGA!! https://t.co/Ej0LmMK3YU',
       'Hey Missouri lets defeat Crooked Hillary &amp; @koster4missouri! Koster supports Obamacare &amp; amnesty! Vote outsider Navy SEAL @EricGreitens!',
       'Our American comeback story begins 11/8/16. Together, we will MAKE AMERICA SAFE &amp; GREAT again for everyone! Watch:_ https://t.co/ek8Cn3CgTr',
       'Thank you Minnesota! It is time to #DrainTheSwamp &amp; #MAGA!\n#ICYMI- watch: https://t.co/fVThC7yIL6 https://t.co/e8SaXiJrxj',
       'Thank you Iowa - Get out &amp; #VoteTrumpPence16!\nhttps://t.co/HfihPERFgZ https://t.co/QsukELQmKb',
       'Thank you Hershey, Pennsylvania. Get out &amp; VOTE on November 8th &amp; we will #MAGA! #RallyForRiley\n#ICYMI, watch here_ https://t.co/maWukVBTr8',
       'ICE OFFICERS WARN HILLARY IMMIGRATION PLAN WILL UNLEASH GANGS, CARTELS &amp; DRUG VIOLENCE NATIONWIDE_ https://t.co/09aSrBwQrv',
       'Th

#### Find "۝" - Arabic End of Ayah (Unicode is U+06DD)

In [16]:
pattern_occurrences = trump_tweet_text['Tweet_Text'].str.contains(r'۝', regex=True)
np.array(trump_tweet_text[pattern_occurrences][0:10]["Tweet_Text"])

array(['Just out: Neera Tanden, Hillary Clinton adviser said, Israel is depressing.\u06dd I think Israel is inspiring!',
       'Ron Fournier: "Clinton Used Secret Server To Protect #CircleOfEnrichment\u06dd\nhttps://t.co/4OGP3tPxyp',
       'A top Clinton Foundation official said he could name 500 different examples\u06dd of conflicts of interest.\nhttps://t.co/rtWhdYOyq7',
       '#CrookedHillary was at center of negotiating $12M commitment from King Mohammed VI of Morocco\u06dd to Clinton Fdn. https://t.co/HWOQ7jQWY2',
       'Moderator: Respectfully, you won۪t answer the pay-to-play question.\u06dd #Debate #BigLeagueTruth',
       '#CrookedHillary gives Obama an A\u06dd for an economic recovery that۪s the slowest since WWII... #BigLeagueTruth_ https://t.co/wVMFHdyCu2',
       'Hillary has called for 550% more Syrian immigrants, but won۪t even mention radical Islamic terrorists.\u06dd #Debate_ https://t.co/Rf48XkZWbu',
       'Moderator: Hillary paid $225,000 by a Brazilian bank for

> Generally used in Arabic tweets or context in arabic countries (Israel, Egypt etc...)

#### Find "&lt" for < and "&gt"; stands for the >

In [17]:
pattern_occurrences = trump_tweet_text['Tweet_Text'].str.contains(r'&gt', regex=True)
np.array(trump_tweet_text[pattern_occurrences][0:10]["Tweet_Text"])

array(['$25 Million+ raised online in just one week! RECORD WEEK. #DrainTheSwamp Today we set a bigger record. Contribute &gt;https://t.co/CZ1QmzCxwO',
       '"@NeilTurner_: @realDonaldTrump Cruz &amp; Rubio are scared! WATCH -&gt; https://t.co/pWjLW1QBKo https://t.co/W2r6mOzgkb"',
       '"@BreitbartNews: Ratings were HUGE for @realDonaldTrumps appearance on Saturday Night Live -&gt; https://t.co/wdQXRq36yF"',
       '"@SweetFreedom29: Hey @realDonaldTrump --&gt; FLASHBACK: Jeb Bush Admitted Leaky۪ Immigration Led to 9/11 http://t.co/Jmm7wd32UD #tcot"  WOW!',
       '"@HardcoreRepub:  @realDonaldTrump  AMERICA will be working again. BUSINESSMAN &gt; POLITICIAN. Private sector growth above all. #Trump2016"'],
      dtype=object)

> Only stand for the character ">"

In [18]:
pattern_occurrences = trump_tweet_text['Tweet_Text'].str.contains(r'&lt', regex=True)
np.array(trump_tweet_text[pattern_occurrences][0:10]["Tweet_Text"])

array([], dtype=object)

>We don't have the left "<" case

#### Find URLs - https://website

In [19]:
pattern_occurrences = trump_tweet_text['Tweet_Text'].str.contains(r'https?://\S+', regex=True)
np.array(trump_tweet_text[pattern_occurrences][0:10]["Tweet_Text"])

array(['Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z',
       'Happy 241st birthday to the U.S. Marine Corps! Thank you for your service!! https://t.co/Lz2dhrXzo4',
       'Watching the returns at 9:45pm.\n#ElectionNight #MAGA__ https://t.co/HfuJeRZbod',
       'Still time to #VoteTrump!\n#iVoted #ElectionNight https://t.co/UZtYAY1Ba6',
       '#ElectionDay https://t.co/MXrAxYnTjY https://t.co/FZhOncih21',
       'We need your vote. Go to the POLLS! Lets continue this MOVEMENT! Find your poll location: https://t.co/VMUdvi1tx1_ https://t.co/zGOx74Ebhw',
       'VOTE TODAY! Go to https://t.co/MXrAxYnTjY to find your polling location. We are going to Make America Great Again!_ https://t.co/KPQ5EY9VwQ',
       'Today we are going to win the great state of MICHIGAN and we are going to WIN back the White House! Thank you MI!_ https://t.co/onRpEvzHrW',
       'Unbelievable evening in New Hampshire - THANK YOU! Flyi

> We find a lot of website urls

#### Find "_" - Underscore

In [20]:
pattern_occurrences = trump_tweet_text['Tweet_Text'].str.contains(r'_', regex=True)
np.array(trump_tweet_text[pattern_occurrences][0:10]["Tweet_Text"])

array(['Watching the returns at 9:45pm.\n#ElectionNight #MAGA__ https://t.co/HfuJeRZbod',
       'We need your vote. Go to the POLLS! Lets continue this MOVEMENT! Find your poll location: https://t.co/VMUdvi1tx1_ https://t.co/zGOx74Ebhw',
       'VOTE TODAY! Go to https://t.co/MXrAxYnTjY to find your polling location. We are going to Make America Great Again!_ https://t.co/KPQ5EY9VwQ',
       'Today we are going to win the great state of MICHIGAN and we are going to WIN back the White House! Thank you MI!_ https://t.co/onRpEvzHrW',
       'Unbelievable evening in New Hampshire - THANK YOU! Flying to Grand Rapids, Michigan now.\nWatch NH rally here:_ https://t.co/hP88anrfgk',
       'Thank you Pennsylvania! Going to New Hampshire now and on to Michigan. Watch PA rally here: https://t.co/d29DLINGst_ https://t.co/zcH9crFIKM',
       'I love you North Carolina- thank you for your amazing support! Get out and https://t.co/HfihPERFgZ tomorrow!\nWatch:_ https://t.co/jZzfqUZNYh',
       'Start

> "-" underscore are often the product of a user_name type of words, or preprocessing formatting to replace "space" character. In this case this is difficult to choose between removing it or not, since it may remove the nature of use of this character on username or hashtag. For better generative tweets, losing the Underscore shouldn't be an issue for understanding the words used.

#### Cleaning All 

> Let's clean what we have seen !

In [21]:
# Define a function to clean the tweets
def clean_tweet(tweet):

    # Normalize apostrophes and quotes
    tweet = unicodedata.normalize('NFKD', tweet).encode('ascii', 'ignore').decode('utf-8')

    # Remove line breaks
    tweet = tweet.replace(r'\n', ' ')

    # Remove URLs 
    tweet = re.sub(r'https?://\S+', 'https://website', tweet)

    # Remove HTML entities
    tweet = tweet.replace(r'&lt', '').replace(r'&gt', '').replace(r'&amp', '')

    # Remove underscore
    tweet = re.sub(r"_", ' ', tweet)

    # Remove multiple spacing
    tweet = re.sub(r'\s{2,}', ' ', tweet)

    return tweet

> Removing most characters previously encountered, except we change any URLs with the equivalent unique URL: https://website/ to keep the goal of this word. 

In [22]:
trump_tweet_text = trump_tweet_text.assign(Tweet_Text_Clean=trump_tweet_text['Tweet_Text'].apply(clean_tweet))
row = 0

print(" \nBEFORE CLEAN: \n\n",trump_tweet_text.iloc[row][0], "\n\n\n","AFTER CLEAN: \n\n",trump_tweet_text.iloc[row][2])

 
BEFORE CLEAN: 

 Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://t.co/wPk7QWpK8Z 


 AFTER CLEAN: 

 Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://website


### Tokenization

####  On one sentence

In [23]:
text = trump_tweet_text.iloc[800][2]
text_tokens = nltk.casual_tokenize(text)
text_tokens 

['Thank', 'you', 'Ohio', '!', '#AmericaFirst', 'https://website']

#### On whole Corpus

In [24]:
trump_tweet_text

Unnamed: 0,Tweet_Text,Type,Tweet_Text_Clean
0,Today we express our deepest gratitude to all ...,ORIGINAL,Today we express our deepest gratitude to all ...
1,Busy day planned in New York. Will soon be mak...,ORIGINAL,Busy day planned in New York. Will soon be mak...
2,Love the fact that the small groups of protest...,ORIGINAL,Love the fact that the small groups of protest...
3,Just had a very open and successful presidenti...,ORIGINAL,Just had a very open and successful presidenti...
4,A fantastic day in D.C. Met with President Oba...,ORIGINAL,A fantastic day in D.C. Met with President Oba...
...,...,...,...
6940,I hope the boycott of @Macys continues forever...,ORIGINAL,I hope the boycott of @Macys continues forever...
6941,I loved firing goofball atheist Penn @pennjill...,ORIGINAL,I loved firing goofball atheist Penn @pennjill...
6942,I hear @pennjillette show on Broadway is terri...,ORIGINAL,I hear @pennjillette show on Broadway is terri...
6943,Irrelevant clown @KarlRove sweats and shakes n...,ORIGINAL,Irrelevant clown @KarlRove sweats and shakes n...


In [25]:
trump_tweet_text = trump_tweet_text.assign(Tweet_Tokens=trump_tweet_text["Tweet_Text_Clean"].apply(nltk.casual_tokenize))

# First 5 Sentences
for i in range(5):
    print("-"*20,"\n\n LINE",i,":",trump_tweet_text["Tweet_Text_Clean"].iloc[i], "\n\n", "TOKENS:",trump_tweet_text["Tweet_Tokens"].iloc[i], "\n")

-------------------- 

 LINE 0 : Today we express our deepest gratitude to all those who have served in our armed forces. #ThankAVet https://website 

 TOKENS: ['Today', 'we', 'express', 'our', 'deepest', 'gratitude', 'to', 'all', 'those', 'who', 'have', 'served', 'in', 'our', 'armed', 'forces', '.', '#ThankAVet', 'https://website'] 

-------------------- 

 LINE 1 : Busy day planned in New York. Will soon be making some very important decisions on the people who will be running our government! 

 TOKENS: ['Busy', 'day', 'planned', 'in', 'New', 'York', '.', 'Will', 'soon', 'be', 'making', 'some', 'very', 'important', 'decisions', 'on', 'the', 'people', 'who', 'will', 'be', 'running', 'our', 'government', '!'] 

-------------------- 

 LINE 2 : Love the fact that the small groups of protesters last night have passion for our great country. We will all come together and be proud! 

 TOKENS: ['Love', 'the', 'fact', 'that', 'the', 'small', 'groups', 'of', 'protesters', 'last', 'night', 'ha

----------------

## Part 2: Fitting and Accessing a Trump Tweet LM

### Ex. 2.1: LM fitting function
Create a function that takes as arguments (at least) the desired order $n$ of the model and a tokenized training corpus, and that returns the "simple" Maximum Likelihood Estimator (MLE) language model, fitted on the given training corpus.  

Then, use your function to fit a MLE language model of order $n=3$ to the Trump Tweets corpus.

##### Answer

#### Corpus

In [26]:
corp = trump_tweet_text["Tweet_Tokens"].tolist()

for i in range(5):
    print(corp[i],"\n")

['Today', 'we', 'express', 'our', 'deepest', 'gratitude', 'to', 'all', 'those', 'who', 'have', 'served', 'in', 'our', 'armed', 'forces', '.', '#ThankAVet', 'https://website'] 

['Busy', 'day', 'planned', 'in', 'New', 'York', '.', 'Will', 'soon', 'be', 'making', 'some', 'very', 'important', 'decisions', 'on', 'the', 'people', 'who', 'will', 'be', 'running', 'our', 'government', '!'] 

['Love', 'the', 'fact', 'that', 'the', 'small', 'groups', 'of', 'protesters', 'last', 'night', 'have', 'passion', 'for', 'our', 'great', 'country', '.', 'We', 'will', 'all', 'come', 'together', 'and', 'be', 'proud', '!'] 

['Just', 'had', 'a', 'very', 'open', 'and', 'successful', 'presidential', 'election', '.', 'Now', 'professional', 'protesters', ',', 'incited', 'by', 'the', 'media', ',', 'are', 'protesting', '.', 'Very', 'unfair', '!'] 

['A', 'fantastic', 'day', 'in', 'D', '.', 'C', '.', 'Met', 'with', 'President', 'Obama', 'for', 'first', 'time', '.', 'Really', 'good', 'meeting', ',', 'great', 'chemis

> Checking that our Corpus is correct for further processing...

#### N-Grams and Padded Vocabulary

In [27]:
training_neverygrams, padded_vocab_stream = padded_everygram_pipeline(3, corp)

line_count = 0

print('==== n-everygram data (n=3) for first sequence in "Corpus": ====\n')
for ngramlize_sent in training_neverygrams:
    print(list(ngramlize_sent))
    print()
    line_count += 1
    if line_count >= 1:
        break

print('==== Vocabulary data: ====\n')
print(list(padded_vocab_stream)[1:100])

==== n-everygram data (n=3) for first sequence in "Corpus": ====

[('<s>',), ('<s>', '<s>'), ('<s>', '<s>', 'Today'), ('<s>',), ('<s>', 'Today'), ('<s>', 'Today', 'we'), ('Today',), ('Today', 'we'), ('Today', 'we', 'express'), ('we',), ('we', 'express'), ('we', 'express', 'our'), ('express',), ('express', 'our'), ('express', 'our', 'deepest'), ('our',), ('our', 'deepest'), ('our', 'deepest', 'gratitude'), ('deepest',), ('deepest', 'gratitude'), ('deepest', 'gratitude', 'to'), ('gratitude',), ('gratitude', 'to'), ('gratitude', 'to', 'all'), ('to',), ('to', 'all'), ('to', 'all', 'those'), ('all',), ('all', 'those'), ('all', 'those', 'who'), ('those',), ('those', 'who'), ('those', 'who', 'have'), ('who',), ('who', 'have'), ('who', 'have', 'served'), ('have',), ('have', 'served'), ('have', 'served', 'in'), ('served',), ('served', 'in'), ('served', 'in', 'our'), ('in',), ('in', 'our'), ('in', 'our', 'armed'), ('our',), ('our', 'armed'), ('our', 'armed', 'forces'), ('armed',), ('armed', 'for

#### Model Function for Fit - MLE of order 3

In [28]:
def train_language_model(n, corpus, model_type='mle'):
    
    model_classes = {'mle': MLE, 'laplace': Laplace}
    ModelClass = model_classes.get(model_type, MLE)

    ngrams, vocab = padded_everygram_pipeline(n, corpus)
    model = ModelClass(n)
    model.fit(ngrams, vocab)

    num_tokens_after = len(model.vocab)

    print("\nTokens after fitting:", num_tokens_after)

    print("\n",model)

    print("\nDifferences with Corpus: ", len(set(list(model.vocab))) - len(set(list(flatten(corp)))), "( Corpus has",len(set(list(flatten(corp)))),")", "\n\n Tokens differences between Model and Corpus:",set(list(model.vocab)) - set(list(flatten(corp))))

    return model

In [29]:
trump_model = train_language_model(n = 3 , corpus = corp, model_type = "mle")


Tokens after fitting: 13717

 <nltk.lm.models.MLE object at 0x15e001840>

Differences with Corpus:  3 ( Corpus has 13714 ) 

 Tokens differences between Model and Corpus: {'<UNK>', '<s>', '</s>'}


### Ex. 2.2: Vocabulary
- How many distinct tokens are in the model's vocabulary? Is that the same number of distinct tokens that appear in the tokenized corpus?
- Lookup the tokens of the sentence `"I love UNIGE students!"` in the model's vocabulary. Explain what you observe, and why. 

##### Answer

In [30]:
trump_model = train_language_model(n = 3 , corpus = corp, model_type = "mle")


Tokens after fitting: 13717

 <nltk.lm.models.MLE object at 0x16938e380>

Differences with Corpus:  3 ( Corpus has 13714 ) 

 Tokens differences between Model and Corpus: {'<UNK>', '<s>', '</s>'}


> We can see that our model has 3 distinct tokens more that our original Corpus, because we actually account for the **UNK**, **s** and **/s** tokens that are used to help the model start and end the sentences and also identify which tokens where remove during the cut-off process in the vocabulary.

`"I love UNIGE students!"`

In [31]:
print(trump_model.vocab.lookup('I love UNIGE students !'.split()))

('I', 'love', '<UNK>', 'students', '!')


> UNIGE is not contained in the vocabulary of model MLE, thus returning the **UNK** word, we can check that by doing this:

In [32]:
"UNIGE" in trump_model.vocab

False

> Which confirms that it wasn't in the original Corpus, which would have been funny to watch, Trump talking about our University or Quote someone else who did. 

### Ex. 2.3: Token probabilities
- When it comes to ngram models the training boils down to counting the ngrams from the training corpus. Using your fitted model, how many times do the following appear in the training data: ``'America', 'Trump', 'I will', 'will never forget'``.
- Then, compute the following word occurrence probabilities ('scores') in the Trump Tweets corpus, and briefly explain what the returned numbers mean about the training data:
    - $\mathbb{P}($'America'$)$,
    - $\mathbb{P}($'Trump'$)$,
    - $\mathbb{P}($'will'$\vert $'I'$)$,
    - $\mathbb{P}($'forget'$\vert $'will never'$)$.
- Briefly explain, with a formula, how those probabilities are obtained from the n-gram counts.

##### Answer

#### Unigrams

In [33]:
print("'America' count is:",trump_model.counts['America'], "and in lowercase:",trump_model.counts['america'],"\n'Trump' count is:",trump_model.counts['Trump'],"and in lowercase:",trump_model.counts['trump'], "\n'I will' count is:",trump_model.counts['I will'], "\n'will never forget' count is:",trump_model.counts['will never forget'])

'America' count is: 250 and in lowercase: 2 
'Trump' count is: 918 and in lowercase: 34 
'I will' count is: 0 
'will never forget' count is: 0


#### Bigrams

In [34]:
print("\n'I' + 'will' count is:",trump_model.counts[['I']]['will'], "\n'will never' + 'forget' count is:",trump_model.counts[['will never']]['forget'])


'I' + 'will' count is: 344 
'will never' + 'forget' count is: 0


#### Trigrams

In [35]:
print("'will' + 'never' + 'forget' count is:",trump_model.counts[['will', 'never']]['forget'])

'will' + 'never' + 'forget' count is: 8


#### Probabilities

$\mathbb{P}($'America'$)$

In [36]:
print(trump_model.score('America')*100,"%")

0.14612277820315742 %


> "*America*" is approx 0.1% of the Corpus, which may counterintuitive, but Trump generally use important keywords in # hashtag, which is why only looking at "*Amercia*" and not "*#makeamericagreatagain*" or "*#America*" seems to be not that frequent. 

$\mathbb{P}($'america'$)$ if we also want lowercase

In [37]:
print(trump_model.score('america')*100,"%")

0.0011689822256252594 %


> lowercase is way less present than uppercase

$\mathbb{P}($'Trump'$)$

In [38]:
print(trump_model.score('Trump')*100,"%")

0.5365628415619941 %


> "*Trump*" word is approx 0.5% in the MLE model, which isn't surprising as we removed most Retweets, and in rare occasion Trump would actually use his own name in a sentence. What may happens more is when the Quote Tweet that Trump use mention himself. 

$\mathbb{P}($'trump'$)$ if we also want lowercase

In [39]:
print(trump_model.score('trump')*100,"%")

0.01987269783562941 %


> lowercase is way less present than uppercase

$\mathbb{P}($'will'$\vert $'I'$)$

In [40]:
print(trump_model.score('will', 'I'.split())*100,"%")

20.52505966587112 %


> we can see that the probabiltiy rise with *I* followed by "*will*", since he is the person writing, and "*will*" is a very common word for political speech to be followed after the pronoun "*I*".

$\mathbb{P}($'forget'$\vert $'will never'$)$

In [41]:
print(trump_model.score('forget', 'will never'.split())*100,"%")

28.57142857142857 %


"*will never*" negation followed by "*forget*" verb seems very common as well, Political speech again seems to often use the sentence "*I will never forget...*"

#### Formulas for N-Grams 

From "Introduction into NLP lecture - Slides 22 & 23"

To derive the probabilities from trigram counts, we employ the following formula:

$P(w_n | w_{1:(n-1)}) ≈ P(w_n | w_{(n-2):(n-1)})$


This equation embodies the **Markov assumption**, which posits that the probability of the current word $w_n$ is solely dependent on the preceding two words $w_{(n-2)}$ and $w_{(n-1)}$. We utilize **maximum likelihood estimation** to approximate this probability, where the transition matrix $P$ of size $V^3$ (with $V$ representing the vocabulary) is calculated from the bigram counts in the corpus.

More specifically, for any two vocabulary words $v_i$ and $v_j$, the transition probability $P_{ij}$ is estimated as:

$P_{ijk} = P(v_k | v_i, v_j) ≈ \frac{C(v_i v_jv_k)}{C(v_iv_j)}$

This estimation yields the probability of observing $v_k$ given that $v_i$ and $v_j$ have been previously observed.


----------------

## Part 3: Generation using N-gram Language Model

### Ex. 3.1: Tweet generator
Create a python function to generate new Trump Tweets. It should:
- take as input arguments: a fitted `nltk.lm.model`, a maximum number of words (integer), a text seed (initial context tokens), and a random "RNG" seed for generation,
- output a newly generated Trump Tweet, according to the input arguments, post-processed as a single text string that is formatted like a tweet.

*Hints:* `nltk.tokenize.treebank.TreebankWordDetokenizer()` and its `.detokenize()` method can help with post-processing. Pay attention to show things like `@user` mentions, urls, punctuation, etc... in a "correct" format.

##### Answer

In [42]:
def generate_tweet(language_model, text_seed = None, max_words = 20, random_seed = 42):
    """
    Generates a sequence of words based on the given language model.

    Args:
        language_model: A trained n-gram language model.
        max_words: The maximum number of words to generate.
        seed: The random seed value for reproducibility.

    Returns:
        A string of generated text.
    """
    generated_text = []
    for word in language_model.generate(max_words,  text_seed, random_seed):
        if word == '<s>':  # Skip start token
            continue
        if word == '</s>':  # Stop generating at end token
            break
        generated_text.append(word)

    # Check if only one word is generated and force restart
    if len(generated_text) <= 1:
        return generate_tweet(language_model, text_seed, max_words, random_seed+1)

    # Detokenize the generated text
    detokenizer = nltk.tokenize.treebank.TreebankWordDetokenizer()
    tweet = detokenizer.detokenize(generated_text)

    # Capitalize the tweet
    tweet = tweet.capitalize()

    # Remove spaces before punctuations
    for punctuation in str.punctuation:
        if punctuation not in ['@', '#', '&']:
            tweet = tweet.replace(" " + punctuation, punctuation)

    while tweet and not tweet[0].isalnum():
        tweet = tweet[1:]

    return tweet


#### Generated Tweet not Random

In [43]:
generate_tweet(trump_model, max_words = 20, random_seed = 73)

'I should have easily been won against obama in 2008, not just running against the wall street'

#### Generated Tweet at Random

In [44]:
seed_random = random.randint(0,1000)

generate_tweet(trump_model, max_words = 20, random_seed = seed_random)

'Very impressed, great to be the grand marshall- in chicago, illinois. #bigleaguetruth #debate'

### Ex. 3.2: Initial context
To generate a full tweet from a LM of order $n$, explain what should be the text seed (i.e. the initial context tokens). Set the default value for the relevant argument of your function in 3.1 accordingly.

##### Answer

In [45]:
random.seed(500)
seed_random = 2000

generate_tweet(trump_model, max_words = 20,  text_seed = None, random_seed = seed_random)

'I win, says will vote for him and his running mate @mike pence and family hanging @monteskitchen in dutchess'

> We don't necessarly want to give a context when generating randomly some tweets by default, this will restrict the tweet generations too much. Therefore we can set text_seed to **None**

### Ex. 3.3: Generate tweets
Generate a few new tweets using your new function and the LM fitted in Part 2. For reproducibility, use a random RNG seed to show them. 

*Facultative:* show a few examples that you find interesting, representative or funny.

##### Answer

In [46]:
random.seed(500)
seed_random = 1000

generate_tweet(trump_model, max_words = 20,  text_seed = ["Russia"],random_seed = seed_random)

'Leaked the disastrous dnc e-mails, which are total losers!'

In [47]:
random.seed(500)
seed_random = 23

generate_tweet(trump_model, max_words = 20,  text_seed = ["Hillary"],random_seed = seed_random)

'Type policy and management has done less in the debate. even paul knew it. if they dont.'

In [48]:
random.seed(500)
seed_random = 20

generate_tweet(trump_model, max_words = 20,  text_seed = ["Clinton"],random_seed = seed_random)

'Supporter @alisonforky declare crooked hillary cant even send emails without putting entire nation at risk?'

## Part 4: Smoothing and model comparison

### Ex. 4.1: Smoothed LM alternatives to simple MLE
Modify the function that you defined in 2.1 by adding an argument that allows changing the `nltk.lm` language model that is fitted in the function (e.g. to fit a Laplace or a Lidstone model instead of the simple MLE). 
Also briefly explain what is the difference between Laplace, Lidstone and the simple MLE language models.

*Hint:* Your function might need more than a single additional argument, if some LM have hyperparameters.

##### Answer

The main differences between **MLE**, **Laplace Smoothing**, and **Lidstone Smoothing** are as follows:

When it comes to handling unseen words, **MLE** assigns zero probability to them, which can lead to issues when encountering new words, whereas **Laplace Smoothing** adds a small constant value to each word's count, ensuring no word has zero probability, even if unseen, and **Lidstone Smoothing** also adds a fraction of the smoothing parameter to each word's count, allowing for customizable smoothing.

In terms of probability distribution, **MLE** estimates probability based on observed frequency, without considering unseen events, whereas **Laplace Smoothing** distributes the added probability mass evenly among all words, including unseen words, and **Lidstone Smoothing** allows for a customizable smoothing parameter, which controls the amount of smoothing applied.

Regarding bias and overestimation, **MLE** has no bias, but may struggle with unseen words, whereas **Laplace Smoothing** introduces a bias towards unseen events and can overestimate rare events, and **Lidstone Smoothing** offers more flexibility in controlling the bias and overestimation, depending on the chosen smoothing parameter.

In terms of flexibility, **MLE** has no flexibility, as it relies solely on observed frequencies, whereas **Laplace Smoothing** has limited flexibility, as it adds a fixed constant value, and **Lidstone Smoothing** offers the most flexibility, as it allows for a customizable smoothing parameter.

Finally, when it comes to tunability, **MLE** has no tunable parameters, whereas **Laplace Smoothing** has no tunable parameters, as the smoothing value is fixed, and **Lidstone Smoothing** allows the smoothing parameter ($\gamma$) to be tuned based on the specific characteristics of the data.

In [49]:
def train_language_models(n, corpus, model_type='mle', gamma = 0.1):
    """
    Train a language model using maximum likelihood estimation (MLE),
    Laplace smoothing, or Lidstone smoothing.

    Args:
        n (int): The order of the language model.
        corpus (list): The training corpus as a list of sentences or tokens.
        model_type (str, optional): The type of model to train.
            Valid options: 'mle' (default), 'laplace', 'lidstone'.
        alpha (float, optional): The gamma Lidstone smoothing.
            Defaults to 0.1

    Returns:
        object: The trained language model.

    Raises:
        ValueError: If an invalid model_type is specified.

    """
    
    model_classes = {'mle': MLE, 'laplace': Laplace, 'lidstone': Lidstone}
    
    ModelClass = model_classes.get(model_type)

    ngrams, vocab = padded_everygram_pipeline(n, text = corpus)

    if model_type == "lidstone": 
        model = ModelClass(order = n, gamma = gamma)
    else:
        model = ModelClass(order = n)

    model.fit(text = ngrams, vocabulary_text = vocab)

    num_tokens_after = len(model.vocab)

    print("\nTokens after fitting:", num_tokens_after)

    return model

### Ex. 4.2: Qualitative model comparison 
With $n=1,2,3,4$, fit and generate new tweets from the simple MLE and from the Laplace LM of orders $n$. 
- Compare the results between the different $n$ values and between the two models. 
- What are the main differences for generation? Which model(s) do you think might be the best options for generating new realistic tweets?
- Do you see hints of those differences in the generated tweets?

##### Answer

#### Fitting n = 1,2,3,4 for MLE

In [73]:
randomseed1 = 500

print("MLE MODEL\n--------------")
for i in range(1,5):
   print("Order:" , i, "\nTweet:", generate_tweet(train_language_models(n = i, corpus = corp, model_type='mle'), max_words = 20,  text_seed = None, random_seed = randomseed1),"\n-------------")

MLE MODEL
--------------

Tokens after fitting: 13715
Order: 1 
Tweet: Our was do @realdonaldtrump need nh out i debate one! house rnc cruz to #iacaucus vp explains.- 
-------------

Tokens after fitting: 13717
Order: 2 
Tweet: Of weeks i dont want our people are not run! 
-------------

Tokens after fitting: 13717
Order: 3 
Tweet: Never will." great! 
-------------

Tokens after fitting: 13717
Order: 4 
Tweet: Me was the highest rated show that they have long dreamed of- and no effective raise in years. 
-------------


#### Fitting n = 1,2,3,4 for LaPlace

In [74]:
randomseed1 = 500

print("LAPLACE MODEL\n--------------")
for i in range(1,5):
   print("Order:" , i, "\nTweet:", generate_tweet(train_language_models(n = i, corpus = corp,  model_type = "laplace"), max_words = 20,  text_seed = None, random_seed = randomseed1),"\n-------------")

LAPLACE MODEL
--------------

Tokens after fitting: 13715
Order: 1 
Tweet: Open w dangerous @rnull65 movement matter our i consistent on! heading party contract to #makeamericagreatagain trump dumb.. 
-------------

Tokens after fitting: 13717
Order: 2 
Tweet: Of ways she ever run against me and lost so! 
-------------

Tokens after fitting: 13717
Order: 3 
Tweet: Nations to pay fair taxes, while trump solidifies his dominating lead https://website 
-------------

Tokens after fitting: 13717
Order: 4 
Tweet: Match for putin or if the truth be told even for hilary. usa needs a winner." 
-------------


- Compare the results between the different $n$ values and between the two models. 

> As we increase the order of the Model, in both case, tweets generated looks nicer, with less weird artefacts and unlogical sentences. **LaPlace** Model seems to be more often coherent than **MLE**. 

- What are the main differences for generation? Which model(s) do you think might be the best options for generating new realistic tweets?

> **MLE** tends to generate very close words together based on observed frequency on the Corpus, which when the order is low, is very visible. On the opposite, even on same order, **LaPlace** seems to chose more general words that don't seems to follow the n-grams order too much and looks "smoother" and is less predictable.

- Do you see hints of those differences in the generated tweets?

> The 4th tweet of both model (order = 4) is a good example of how **MLE** is choosing very close words every 4 grams, but it looks like there is a unlogical blend between each. 

### Ex. 4.3: Quantitative evaluation and comparison
- Split the tokenized Trump Tweets corpus into a (reproducible) training set (80%) and a test set (20%). 
- Compute the train and test 3-gram perplexity scores of a simple MLE LM, a Laplace LM, and a Lidstone LM with $\gamma=0.1$. Use model order $n=3$ for each.
- Compare and discuss the obtained train and test perplexity scores of the three models. Argue which model might represent the Trump Tweets data best.

*Hint:* To compute the perplexity correctly, you might need to preprocess the relevant corpus documents to a list of padded $n$-grams.

##### Answer

#### Splitting the Corpus into Training and Test

In [52]:
Train_Corpus, Test_Corpus = train_test_split(corp, test_size=0.2, random_state=42)

print("\n Train Corpus:\n\n",Train_Corpus[1:10], "\n\n----------------","\n\n Test_Corpus:\n\n",Test_Corpus[1:10])

print("\n-------------------------------\n\n",len(Train_Corpus),"(80% Train)", "+",len(Test_Corpus),"(20% Test)", "=", len(Train_Corpus)+len(Test_Corpus), "\n\n Total Corpus:", len(corp))


 Train Corpus:

 [['Thank', 'you', 'for', 'the', 'kind', 'words', 'tonight', ',', '@OMAROSA', '.', 'You', 'were', 'great', '!', 'See', 'you', 'soon', '!'], ['I', 'will', 'do', 'far', 'more', 'for', 'women', 'than', 'Hillary', ',', 'and', 'I', 'will', 'keep', 'our', 'country', 'safe', ',', 'something', 'which', 'she', 'will', 'not', 'be', 'able', 'to', 'do-no', 'strength', '/', 'stamina', '!'], ['#MakeAmericaSafeAgain', '#ImWithYou', 'https://website'], ['"', '@teed', 'chris', ':', '@Loyal2Trump2016', '@TrumpAlabama', '@FoxNews', 'Look', 'when', 'you', 'try', 'to', 'kill', 'Your', 'mom', ',', 'thats', 'it', 'for', 'me', ',', 'no', 'walking', 'on', 'water', '"'], ['New', 'Iowa', 'poll', '.', 'Thank', 'you', '!', '#MakeAmericaGreatAgain', '#Trump2016', 'https://website'], ['"', '@sparkey03', ':', '@realDonaldTrump', 'Go', '#Trump2016', '"'], ['.', '@megynkelly', 'must', 'have', 'had', 'a', 'terrible', 'vacation', ',', 'she', 'is', 'really', 'off', 'her', 'game', '.', 'Was', 'afraid', 'to

#### Sequence of Padded Ngram Tuples for Perplexity

According to documentation for Entropy, we should prepare our Train and Test corpus into N-grams padded as tuples to compute Perplexity (as it uses Entropy function).

This should look like this:

corpus = [["I", "love", "natural", "language", "processing"], ["This", "is", "another", "sentence"]]

https://www.nltk.org/_modules/nltk/lm/api.html#LanguageModel.entropy 

In [53]:
n = 3
corpus_temp = [["I", "love", "natural", "language", "processing"], ["This", "is", "another", "sentence"]]
print("\nOUTPUT:")
[tuple(ngram) for sentence in corpus_temp for ngram in ngrams(pad_both_ends(sentence, n=n), n)]


OUTPUT:


[('<s>', '<s>', 'I'),
 ('<s>', 'I', 'love'),
 ('I', 'love', 'natural'),
 ('love', 'natural', 'language'),
 ('natural', 'language', 'processing'),
 ('language', 'processing', '</s>'),
 ('processing', '</s>', '</s>'),
 ('<s>', '<s>', 'This'),
 ('<s>', 'This', 'is'),
 ('This', 'is', 'another'),
 ('is', 'another', 'sentence'),
 ('another', 'sentence', '</s>'),
 ('sentence', '</s>', '</s>')]

#### MLE LM

In [54]:
# Set Order
n = 3

##### Train

In [55]:
Train_MLE = train_language_models(n = n, corpus = Train_Corpus,  model_type = "mle")
print("Model:",Train_MLE)


Tokens after fitting: 12136
Model: <nltk.lm.models.MLE object at 0x16e048640>


                Perplexity on Train             

In [56]:
corpus_temp = Train_Corpus

ngram_corpus =  [tuple(ngram) for sentence in corpus_temp for ngram in ngrams(pad_both_ends(sentence, n=n), n=n)]

Train_MLE.perplexity(ngram_corpus)

3.117910105346997

> We have a very low Perplexity, which may indicate an overfit of MLE on Train

                Perplexity on Test             

In [57]:
corpus_temp = Test_Corpus

ngram_corpus =  [tuple(ngram) for sentence in corpus_temp for ngram in ngrams(pad_both_ends(sentence, n=n), n=n)]

Train_MLE.perplexity(ngram_corpus)

inf

> This should be normal to have inf as we are dealing with unkown word and the log(0) when frequency of a word is none, return -inf. 

#### Laplace LM

In [58]:
Train_LAPLACE = train_language_models(n = n, corpus = Train_Corpus,  model_type = "laplace")
print("Model:",Train_LAPLACE)


Tokens after fitting: 12136
Model: <nltk.lm.models.Laplace object at 0x16de84310>


                Perplexity on Train          

In [59]:
corpus_temp = Train_Corpus

ngram_corpus =  [tuple(ngram) for sentence in corpus_temp for ngram in ngrams(pad_both_ends(sentence, n=n), n=n)]

Train_LAPLACE.perplexity(ngram_corpus)

2752.4119200804357

> We are less over fitting on Train with LaPlace this time

                Perplexity on Test             

In [60]:
corpus_temp = Test_Corpus

ngram_corpus =  [tuple(ngram) for sentence in corpus_temp for ngram in ngrams(pad_both_ends(sentence, n=n), n=n)]

Train_LAPLACE.perplexity(ngram_corpus)

4571.644399941378

> Test should be higher with its Perplexity as this set should be unseen by our model and thus perform less good in term of probabilities and patterns and be more confused with what next words he should put. This is the case here, and also on others models. 

#### Lidstone LM with $\gamma=0.1$

In [61]:
Train_LIDSTONE = train_language_models(n = n, corpus = Train_Corpus,  model_type = "lidstone", gamma = 0.1)
print("Model:",Train_LIDSTONE, "\nGamma:",Train_LIDSTONE.gamma)


Tokens after fitting: 12136
Model: <nltk.lm.models.Lidstone object at 0x16cbe16c0> 
Gamma: 0.1


                Perplexity on Train             

In [62]:
corpus_temp = Train_Corpus

ngram_corpus =  [tuple(ngram) for sentence in corpus_temp for ngram in ngrams(pad_both_ends(sentence, n=n), n=n)]

Train_LIDSTONE.perplexity(ngram_corpus)

481.3861204952867

> The smaller perplexity obtained with a lower gamma suggests that the Lidstone smoothing parameter is giving more weight to the observed words in the training set, potentially indicating a stronger reliance on the training data and a higher susceptibility to overfitting.

                Perplexity on Test             

In [63]:
corpus_temp = Test_Corpus

ngram_corpus =  [tuple(ngram) for sentence in corpus_temp for ngram in ngrams(pad_both_ends(sentence, n=n), n=n)]

Train_LIDSTONE.perplexity(ngram_corpus)

2491.5433199218714

> Compared to LaPlace Train VS Test, our Test set on Lidstone with $\gamma$ = 0.1 is proportionally way higher, since we may have overfit our Train Corpus. This is very difficult to choose which Model is best only based on the same Order and without Hyper-Parameter Tuning, but solely based on the Perplexity score we obtain from all models at $n$ = 3, Lidstone with $\gamma$ = 0.1 seems to be less prone to overfitting on the Test set and would performs generally better on unseen Corpus. 

### Ex. 4.4: Hyper-parameter tuning
- Perform a grid-search to select the best hyperparameter values for $n$ and $\gamma$, for the Lidstone LM. You want to select the model that generalizes best to new data.
- What do you observe in the obtained perplexity scores? Was it expected? Explain it in statistical terms.

*Hint:* Maybe try a few values for $n$ and $\gamma$ by hand to identify the general hyperparameter region of interest before defining a more thorough hyperparameter value grid.

##### Answer

---------------

##### Model Function

In [64]:
def train_language_models_grid(n, corpus, model_type='mle', gamma = 0.1):
    """
    Train a language model using maximum likelihood estimation (MLE),
    Laplace smoothing, or Lidstone smoothing.

    Args:
        n (int): The order of the language model.
        corpus (list): The training corpus as a list of sentences or tokens.
        model_type (str, optional): The type of model to train.
            Valid options: 'mle' (default), 'laplace', 'lidstone'.
        alpha (float, optional): The gamma Lidstone smoothing.
            Defaults to 0.1

    Returns:
        object: The trained language model.

    Raises:
        ValueError: If an invalid model_type is specified.

    """
    
    model_classes = {'mle': MLE, 'laplace': Laplace, 'lidstone': Lidstone}
    
    ModelClass = model_classes.get(model_type)

    ngrams, vocab = padded_everygram_pipeline(n, text = corpus)

    if model_type == "lidstone": 
        model = ModelClass(order = n, gamma = gamma)
    else:
        model = ModelClass(order = n)

    model.fit(text = ngrams, vocabulary_text = vocab)

    return model

##### Broad Search

> Testing for multiple $n$ and $\gamma$ before restricting our search.
>
> We try to increment $\gamma$ by 0.1 and go over 1 to 5 for $n$, testing the perplexity score on the Test set

In [65]:
# Hyperparamters Init
order_max = 5
gamma_min = 0
gamma_max = 1.5
gamma_increment = 0.1

# Score Init
best_model_perplexity = 10**10
best_model_order = 0
best_model_gamma = 0

# Store each Values
results_array = np.empty((0, 3))

# Grid Search
for order in range(1,order_max+1):

    for gamma in np.arange(gamma_min+gamma_increment, gamma_max, gamma_increment).round(10):

        model = train_language_models_grid(n = order, corpus = Train_Corpus,  model_type = "lidstone", gamma = gamma)
        
        perplexity = model.perplexity([tuple(ngram) for sentence in Test_Corpus for ngram in ngrams(pad_both_ends(sentence, n=order), n=order)])

        new_row = np.array([[order, gamma, perplexity]])
        results_array = np.append(results_array, new_row, axis=0)

        if perplexity < best_model_perplexity:
           best_model_perplexity = perplexity
           best_model_order = order
           best_model_gamma = gamma


#### Best Hyperparameters and Perplexity

In [66]:
def highlight_lowest_perplexity(row):
    if row['perplexity'] == results_df['perplexity'].min():
        return ['background-color: #a87d4c'] * len(row)
    else:
        return [''] * len(row)
    
results_df = pd.DataFrame(results_array, columns=['order', 'gamma', 'perplexity'])
results_df.style.apply(highlight_lowest_perplexity, axis=1)
results_df["order"] = results_df["order"].astype(int)
best_index = results_df['perplexity'].idxmin()

# Select the 10 rows around the best row
slice_start = max(0, best_index - 5)
slice_end = min(len(results_df), best_index + 6)
results_df_slice = results_df.iloc[slice_start:slice_end]

# Apply the styling function to the sliced DataFrame
styled_df_slice = results_df_slice.style.apply(highlight_lowest_perplexity, axis=1)

# Display the styled DataFrame
styled_df_slice

Unnamed: 0,order,gamma,perplexity
9,1,1.0,937.394128
10,1,1.1,937.047833
11,1,1.2,937.174801
12,1,1.3,937.696749
13,1,1.4,938.552803
14,2,0.1,791.432359
15,2,0.2,1020.758654
16,2,0.3,1206.49082
17,2,0.4,1366.983337
18,2,0.5,1510.04851


> The Highlighted row in this table is our best Perplexity on Test, which shows that Lidstone of $n$ = 2 and $\gamma$ = 0.1 seems the best for our case.

#### Plotting Grid Search

In [67]:
fig = px.line(results_df, x='gamma', y='perplexity', color='order', 
              color_discrete_sequence=px.colors.qualitative.Plotly,
              title = "Broad Grid Search with Lidstone")
fig.add_trace(go.Scatter(x=[results_df_slice.loc[best_index,'gamma']], y=[results_df_slice.loc[best_index,'perplexity']], mode='markers', marker=dict(color='#DDA15E', size=12),name='Min Perplexity'))
fig.update_layout(width=1000, height=600)
fig.show()

> With broader search, we find the trends in the perplexity of each $n$ and $\gamma$, we would pursue the search on the left hand-side, where gamma is closer to 0.00 and with order = 2

#### Focused Search

> This time we want to focus on $n$ = 2, and check the smallest $\gamma$ possible around 0 and 0, every 0.01

In [68]:
# Hyperparamters Init
order_max = 2
gamma_min = 0
gamma_max = 0.2
gamma_increment = 0.01

# Score Init
best_model_perplexity = 10**10
best_model_order = 0
best_model_gamma = 0

# Store each Values
results_array = np.empty((0, 3))

# Grid Search
for order in range(2,order_max+1):

    for gamma in np.arange(gamma_min+gamma_increment, gamma_max, gamma_increment).round(10):

        model = train_language_models_grid(n = order, corpus = Train_Corpus,  model_type = "lidstone", gamma = gamma)
        
        perplexity = model.perplexity([tuple(ngram) for sentence in Test_Corpus for ngram in ngrams(pad_both_ends(sentence, n=order), n=order)])

        new_row = np.array([[order, gamma, perplexity]])
        results_array = np.append(results_array, new_row, axis=0)

        if perplexity < best_model_perplexity:
            best_model_perplexity = perplexity
            best_model_order = order
            best_model_gamma = gamma


#### Best Hyperparameters and Perplexity

In [69]:
def highlight_lowest_perplexity(row):
    if row['perplexity'] == results_df['perplexity'].min():
        return ['background-color: #a87d4c'] * len(row)
    else:
        return [''] * len(row)
    
results_df = pd.DataFrame(results_array, columns=['order', 'gamma', 'perplexity'])
results_df.style.apply(highlight_lowest_perplexity, axis=1)
results_df["order"] = results_df["order"].astype(int)
best_index = results_df['perplexity'].idxmin()

# Select the 10 rows around the best row
slice_start = max(0, best_index - 5)
slice_end = min(len(results_df), best_index + 6)
results_df_slice = results_df.iloc[slice_start:slice_end]

# Apply the styling function to the sliced DataFrame
styled_df_slice = results_df_slice.style.apply(highlight_lowest_perplexity, axis=1)

# Display the styled DataFrame
styled_df_slice

Unnamed: 0,order,gamma,perplexity
0,2,0.01,484.943266
1,2,0.02,529.076891
2,2,0.03,570.386228
3,2,0.04,608.205574
4,2,0.05,643.212079
5,2,0.06,675.974686


> The Highlighted row in this table is our best Perplexity on Test, which shows that Lidstone of $n$ = 2 and $\gamma$ = 0.01 seems the best for our case.

#### Plotting Grid Search

In [70]:
fig = px.line(results_df, x='gamma', y='perplexity', color='order', 
              color_discrete_sequence=px.colors.qualitative.Plotly,
              title = "Focused Grid Search with Lidstone")
fig.update_layout(width=1000, height=600)
fig.add_trace(go.Scatter(x=[results_df_slice.loc[best_index,'gamma']], y=[results_df_slice.loc[best_index,'perplexity']], mode='markers', marker=dict(color='#DDA15E', size=12),name='Min Perplexity'))
fig.show()

> The optimal perplexity appears to hover around 484, achieved with $\gamma$ = 0.01. This confirms the effectiveness of our Broad Search approach, emphasizing lower values of $\gamma$.

#### Chosen Model for better Generalisation on new Words

*Lidstone* with $n$ = 2 and $\gamma$ = 0.01

In [71]:
model_final = train_language_models_grid(n = 2, corpus = Train_Corpus,  model_type = "lidstone", gamma = 0.01)
print(model_final.vocab, "\nwith gamma:",model_final.gamma, "\nof order:", model_final.order)

<Vocabulary with cutoff=1 unk_label='<UNK>' and 12136 items> 
with gamma: 0.01 
of order: 2


> The order of 2 or 3 was expected, given the brevity of most Trump tweets, often relying on punchlines and key words like hashtags or names. It appears that the 2-gram pattern performs better on the Test set, with the benefit of a small $\gamma$ as a smoothing parameter.
> With Lidstone $n$ = 2 and $\gamma$ = 0.01, the achieved Perplexity stands at 484, indicating the model's hesitation among 484 words within a vocabulary set of 12,136 words. This represents roughly 3% of the trained tokens, a promising result. We're constantly navigating the trade-off between Bias and Variance in our model. By tuning the flexible parameter $\gamma$, we control the balance between introducing bias and reducing variance on a Test set. If we aim for fewer hesitations on new words, we decrease variance but at the expense of introducing bias.
> Perplexity, though useful, has limitations in assessing Language Learning Models (LLMs). It tends to prioritize immediate context over broader understanding, overlook ambiguity and creativity, and is sensitive to vocabulary size. Furthermore, achieving low perplexity doesn't guarantee effective generalization to real-world data. 
> 
> (Sourabh (2023). Decoding Perplexity and its significance in LLMs. [online] UpTrain AI. Available at: https://blog.uptrain.ai/decoding-perplexity-and-its-significance-in-llms/)

-----------------
## <center> END </center>