# Exercise 4.1A: Text Generation Using N-Grams
Kevin King (Kevin.M.King.24@dartmouth.edu)<br>
Dartmouth College, LING48, Spring 2023

Please study the links below and attempt to modify the program according to the homework instructions.

Documentation of the NLTK.LM package:<br>
https://www.nltk.org/api/nltk.lm.html

How to extract n-gram probabilities:<br>
https://stackoverflow.com/questions/54962539/how-to-get-the-probability-of-bigrams-in-a-text-of-sentences

My Implementation (`generate_text` function below): 
* In the code below, we start by downloading the necessary packages and resources for the analysis
* Preprocessed the tokenized text for language modelling
* Trained an n-gram maximum likelihood estimation model
* Used `model.generate()` to get a 100-word sequence out of an n-gram
* Got the counts of a selected unigram, bigram, and trigram using the `model.counts()` function
* Got the probabilities of a selected unigram, bigram, and trigram using the `model.score()` function 

In [2]:
# Upgrade from version in the VM
!pip install -U nltk==3.4
import nltk
nltk.download('punkt')


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


[nltk_data] Downloading package punkt to /Users/kevin/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
import os
import requests
import io 
import random
from nltk.lm.preprocessing import pad_both_ends, padded_everygram_pipeline
from nltk.lm import MLE, NgramCounter, Vocabulary
from nltk.util import ngrams
from collections import Counter
from nltk import word_tokenize, sent_tokenize, bigrams, trigrams
import gdown

In [4]:
# Download and decompress corpora
url = "https://drive.google.com/uc?id=1pW_rn9oZi3Ax23-ezP8K3AcR5QhtLR-a"
output = 'ling28-corpora.tar.gz'
gdown.download(url, output, quiet=False)
!tar -xf ling28-corpora.tar.gz

Downloading...
From: https://drive.google.com/uc?id=1pW_rn9oZi3Ax23-ezP8K3AcR5QhtLR-a
To: /Users/kevin/Desktop/Dartmouth/2022-23/23S/CS72/HW4/templates-hw4/ling28-corpora.tar.gz
100%|██████████████████████████████████████| 22.8M/22.8M [00:01<00:00, 16.9MB/s]


In [5]:
# Open file
file = io.open('english - shakespeare.txt', encoding='utf8')
text = file.read()

#### Preprocess the tokenized text for language modelling
https://stackoverflow.com/questions/54959340/nltk-language-modeling-confusion

In [83]:
# Generates 100-word sequences out of an n-gram
def generate_text(ngrams):
    # Preprocess the tokenized text for language modelling
    paddedLine = [list(pad_both_ends(word_tokenize(text.lower()), ngrams))]
    train, vocab = padded_everygram_pipeline(ngrams, paddedLine)

    # Train an n-gram maximum likelihood estimation model.
    model = MLE(n) 
    model.fit(train, vocab)
    
    # Add word tokens into array 
    generated_text = model.generate(100)
    
    return model, generated_text


#### Unigram

In [96]:
model1, text1 = generate_text(1)
print("Unigram Sequence: \n" + ' '.join(text1))

Unigram Sequence: 
by to are vnder , 'd all for for of montague to i i too precisely so ; my in , i rosencrantz severally one bolingbroke are mrs. april disturbed o , device own off than ! am made it in king silken farewell child comes more ' a , for will it nature of plummet to what that , you and richmond , ? . my ; will nobleman men , [ . rich from and all with that it . the his will better nose not this all do him these my not thee thine . hast be


#### Bigram

In [97]:
model2, text2 = generate_text(2)
print("Bigram Sequence: \n" + ' '.join(text2))

Bigram Sequence: 
slender accident , and not ; which wear hair at 'em be a liberal rewarder of saucy friar lodowick , they shall we see the devil ! we go home to leave you be not ours of that is a wicked , and such reasons . ] saturninus . no such inevitable prosecution of flesh—you have employ a bargain . king of heaven , i pleas 'd against the wit going back , and life-preserving rest ? soothsayer . perge , which rather give me , to be not . he too far as levels with your pardon me so


#### Trigram

In [98]:
model3, text3 = generate_text(3)
print("Trigram Sequence: \n" + ' '.join(text3))

Trigram Sequence: 
not ; they have privilege to live , shall we dine . this business soundly . duke . my masters , for here comes the better at proverbs by how much i am very well , bully doctor ! shallow . i go ; i can not fight ; the duke he shall feel , to smile again ; for whose sake did i ne'er endured . cerimon . madam , the king , unto the worms were hallow 'd that ; and easy it is not thy kindness last longer telling than thy master here i am not gamesome


#### Four-gram

In [93]:
model4, text4 = generate_text(4)
print("Four-gram Sequence: \n" + ' '.join(text4))

Four-gram Sequence: 
doth run his course . if heaven do dwell . exeunt clown , who commands them , for a search , seek , but now it is not . portia . is there , diomed . call him hither . re-enter troilus . what , out of his thoughts , wherein i see on thee , prithee , pretty youth , and courtezan say now , sir , stands in record , and , that all , that comes a-wooing , _priami_ , is done ; and let poor volke pass . [ within ] who 's here ! let


#### Counts

In [99]:
print("== Counts ==")

unigram_count = model1.counts['my']
print("Unigram Count ('my'): " + str(unigram_count))

bigram_count = model2.counts[['my']]['good']
print("Bigram Count ('my good'): " + str(bigram_count))

trigram_count = model3.counts[['good', 'my']]['lord']
print("Trigram Count ('my good lord'): " + str(trigram_count))

== Counts ==
Unigram Count ('my'): 12618
Bigram Count ('my good'): 228
Trigram Count ('my good lord'): 77


#### Probabilities (using the trigram model)

In [100]:
print("== Probabilities ==")

unigram_score = model3.score('my')
print("Unigram Prob ('my'): " + str(unigram_score))

bigram_score = model3.score('good', 'my'.split())
print("Bigram Prob ('my good'): " + str(bigram_score))

trigram_score = model3.score('lord', 'my good'.split())
print("Trigram Prob ('my good lord'): " + str(trigram_score))

== Probabilities ==
Unigram Prob ('my'): 0.010839898868846502
Bigram Prob ('my good'): 0.01806942463147884
Trigram Prob ('my good lord'): 0.5614035087719298
