## Ngrams lab
LLM's and ChatGPT | Fall 2023 | McSweeney | CUNY Graduate Center

**Due:** September 17


### Background
The purpose of this lab is to explore ngram models. Ngram models are a good introduction to language models generally. Language models are probabilistic representations of language. Ngrams have the benefit of being easy to interrogate and relatively easy to understand (as compared to neural networks). 

In this lab, you will build an ngram model from the corpus of your choosing. The example is with 'The Great Gatsby' from Project Gutenberg, but there's a code block for any text file on your computer  


#### Notes
This lab is based heavily on the [nltk documentation](https://www.nltk.org/api/nltk.lm.html)

In [1]:
import numpy as np
import re

import nltk
# if you haven't downloaded punkt before, you only need to run the line below once 
# nltk.download('punkt')
from nltk import word_tokenize
from nltk import sent_tokenize

from nltk.util import bigrams
from nltk.lm.preprocessing import padded_everygram_pipeline

import requests

# Part 1
An example of how ngrams are generated

In [2]:
# you will need to leverage the requests package
r = requests.get(r'https://www.gutenberg.org/cache/epub/64317/pg64317.txt')
great_gatsby = r.text

# first, remove unwanted new line and tab characters from the text
for char in ["\n", "\r", "\d", "\t"]:
    great_gatsby = great_gatsby.replace(char, " ")

# check
print(great_gatsby[:100])


The Project Gutenberg eBook of The Great Gatsby        This ebook is for the use of anyone anywhere


In [194]:
# remove the metadata at the beginning - this is slightly different for each book
great_gatsby = great_gatsby[983:]
print(great_gatsby[:100])

the East last autumn I felt that I wanted  the world to be in uniform and at a sort of moral attenti


#### Txt locally
If you'd rather use a file on your computer, here's the code -- you just need to save the text file in your local directory, and change the variables throughout. 

The example is a report from the [Congressional Research Service](https://www.everycrsreport.com/files/2020-11-10_R45178_62d6238caecf6c02ddf495be33b3439f09eed744.pdf) on AI and National Security.

In [195]:
f = open("the-wizard-of-oz.txt", 'r').read()

for char in ['\n', '\r', '\d', '\t']:
    f = f.replace(char, ' ')

wiz_of_oz = f[3110:]
print(wiz_of_oz[:100])

     Chapter I The Cyclone   Dorothy lived in the midst of the great Kansas prairies, with Uncle Hen


In [82]:
# this is simplified for demonstration
def sample_clean_text(text: str):
    # lowercase
    text = text.lower()
    
    # remove punctuation from text
    text = re.sub(r"[^\w\s]", "", text)
    
    # tokenize the text
    tokens = nltk.word_tokenize(text)
    
    # return your tokens
    return tokens

# call the function
sample_tokens = sample_clean_text(text = wiz_of_oz)

# check
print(sample_tokens[:50])

['chapter', 'i', 'the', 'cyclone', 'dorothy', 'lived', 'in', 'the', 'midst', 'of', 'the', 'great', 'kansas', 'prairies', 'with', 'uncle', 'henry', 'who', 'was', 'a', 'farmer', 'and', 'aunt', 'em', 'who', 'was', 'the', 'farmers', 'wife', 'their', 'house', 'was', 'small', 'for', 'the', 'lumber', 'to', 'build', 'it', 'had', 'to', 'be', 'carried', 'by', 'wagon', 'many', 'miles', 'there', 'were', 'four']


In [83]:
# create bigrams from the sample tokens
my_bigrams = bigrams(sample_tokens)

# check
list(my_bigrams)[:10]

[('chapter', 'i'),
 ('i', 'the'),
 ('the', 'cyclone'),
 ('cyclone', 'dorothy'),
 ('dorothy', 'lived'),
 ('lived', 'in'),
 ('in', 'the'),
 ('the', 'midst'),
 ('midst', 'of'),
 ('of', 'the')]

# Part 2 - creating an ngram model


In [84]:
# 2 is for bigrams
n = 2
#specify the text you want to use
text = wiz_of_oz


Now we are going to use an NLTK shortcut for preprocessing. This will:
* pad all of the sentences with `<s>` and `</s>` to train on sentence boundaries, too.
* create both unigrams and bigrams
* create a training set and a full vocab to train on

We need to give it a pre-tokenized text (we'll use nltk's tokenizer)

In [44]:
# step 1: tokenize the text into sentences
sentences = nltk.sent_tokenize(text)

# step 2: tokenize each sentence into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# step 3: convert each word to lowercase
tokenized_text = [[word.lower() for word in sent] for sent in tokenized_sentences]

#notice the sentence breaks and what the first 10 items of the tokenized text
print(tokenized_text[0])

['chapter', 'i', 'the', 'cyclone', 'dorothy', 'lived', 'in', 'the', 'midst', 'of', 'the', 'great', 'kansas', 'prairies', ',', 'with', 'uncle', 'henry', ',', 'who', 'was', 'a', 'farmer', ',', 'and', 'aunt', 'em', ',', 'who', 'was', 'the', 'farmer', '’', 's', 'wife', '.']


Why tokenize sentences and words?
We want to be able to retain sentence boundaries to encode that, too.

In [86]:
# notice what the first 10 items are of the vocabulary
print(text[:102])

     Chapter I The Cyclone   Dorothy lived in the midst of the great Kansas prairies, with Uncle Henry


In [46]:
# we imported this function from nltk
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

In [47]:
from nltk.lm import MLE
# we imported this function from nltk linear models (lm) 
# it is for Maximum Likelihood Estimation

# MLE is the model we will use
lm = MLE(n)

In [48]:
# currently the vocab length is 0: it has no prior knowledge
len(lm.vocab)

0

In [49]:
# fit the model 
# training data is the bigrams and unigrams 
# the vocab is all the sentence tokens in the corpus 

lm.fit(train_data, padded_sents)
len(lm.vocab)

3574

In [50]:
# inspect the model's vocabulary. 
# be sure that a sentence you know exists (from tokenized_text) is in the 
print(lm.vocab.lookup(tokenized_text[0]))

('chapter', 'i', 'the', 'cyclone', 'dorothy', 'lived', 'in', 'the', 'midst', 'of', 'the', 'great', 'kansas', 'prairies', ',', 'with', 'uncle', 'henry', ',', 'who', 'was', 'a', 'farmer', ',', 'and', 'aunt', 'em', ',', 'who', 'was', 'the', 'farmer', '’', 's', 'wife', '.')


In [51]:
# see what happens when we include a word that is not in the vocab. 
print(lm.vocab.lookup('then wear the gold hat iphone .'.split()))

('then', 'wear', 'the', 'gold', 'hat', '<UNK>', '.')


What did the model replace 'iphone' with? 

Given that it didn't just return an "out of vocab" error, what does that mean about our model? 

In [52]:
# how many times does dorothy appear in the model?
print(lm.counts['dorothy'])

# what is the probability of dorothy appearing? 
# this is technically the relative frequency of dorothy appearing 
lm.score('dorothy')

367


0.0068670009729810645

In [53]:
# how often does (dorothy, and) occur and what is the relative frequency?
print(lm.counts[['dorothy']]['and'])
lm.score('and', 'dorothy'.split())

20


0.05449591280653951

In [54]:
# what is the score of 'UNK'? 

lm.score("<UNK>")

0.0

Does the relative frequency of 'UNK' change your assumption about how the model behaves? 

How should we change our model to account for the fact the `<UNK>` words are not accounted for by the model?

Note: *Programmatically implementing this solution is beyond the scope of this course.*

## Generate text
We want to start our sentence with a word, and use that to predict all the words that come after that. We'll specify how long it should be. 

There is a certain amount of randomness encoded into n-gram models. This prevents a model from becoming entirely deterministic. Maximum Likelihood Estimation without some degree of randomness will only produce the most likely result every time. Setting Random Seed means we will get the same result every time. 

In [74]:
# generate a 20 word sentence starting with the word, 'dorothy'

print(lm.generate(20, text_seed= 'dorothy'))

['the', 'little', 'black', 'eyes', ',', 'who', 'the', 'trees', 'were', 'everywhere', ',', 'i', 'am', 'stuffed', 'man', 'who', 'had', 'been', 'afraid', 'we']


This next code block is just to clean up the tokenized words and make them easier on human eyes. It is literally a detokenizer, which removes some extraneous text markup and reconciles some words back together. 

In [79]:
from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(lm, num_words, text_seed, random_seed=42):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in lm.generate(num_words, text_seed=text_seed, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

In [96]:
# Now generate sentences that look much nicer. 
generate_sent(lm, 10, text_seed='dorothy', random_seed = 48)

'me, while his tears by the emerald green to'

Try a few more sentences, and try out another text. Once you are satisfied with what ngrams can (and cannot) do - post your code to your Github or another site. 

In [99]:
generate_sent(lm, 20, text_seed='dorothy', random_seed = 53)

'out the means of various formats will oil was a coward to the girl lived.'

In [100]:
generate_sent(lm, 20, text_seed='dorothy', random_seed = 47)

'electronic works even if the lion, however, and get to live in this way off all ready.'

In [101]:
generate_sent(lm, 10, text_seed='dorothy', random_seed = 34)

'little man spoke to the laughter that is fit better'

## Parse Lines from SRT files

modified from [pablo-var/learn-english-words-from-srt](https://github.com/pablo-var/learn-english-words-from-srt) repository

In [152]:
def is_time_stamp(l):
  if l[:2].isnumeric() and l[2] == ':':
    return True
  return False

def has_letters(line):
  if re.search('[a-zA-Z]', line):
    return True
  return False

def remove_non_ascii(text):
    return ''.join(i for i in text if ord(i)<128)

def has_no_text(line):
  l = line.strip()
  if not len(l):
    return True
  if l.isnumeric():
    return True
  if is_time_stamp(l):
    return True
  if l[0] == '(' and l[-1] == ')':
    return True
  if not has_letters(line):
    return True
  return False

def is_lowercase_letter_or_comma(letter):
  if letter.isalpha() and letter.lower() == letter:
    return True
  if letter == ',':
    return True
  return False

def clean_up(lines):
  """
  Get rid of all non-text lines and
  try to combine text broken into multiple lines
  """
  new_lines = []
  for line in lines[1:]:
    line = remove_non_ascii(line)
    for char in ["\n", "\r", "\d", "\t"]:
        line = line.replace(char, "")
    if has_no_text(line):
      continue
    elif len(new_lines) and is_lowercase_letter_or_comma(line[0]):
      #combine with previous line
      new_lines[-1] = new_lines[-1].strip() + ' ' + line
    else:
      #append line
      new_lines.append(line)
  return new_lines

Sample sentence tokenization of 'A Nightmare on Elm Street' srt file.

In [196]:
elm_street = open("1984-srts/A_Nightmare_on_Elm_Street-1984.srt", 'r').readlines()
elm_clean = clean_up(elm_street)

print(elm_clean[:20])

['Tina.', 'Tina.', 'Tina.', 'Tina. Tina.', 'You okay, Tina?', 'Just a dream, Ma.', 'Some dream, judging from that.', '- You coming back to the sack or what?', '- Hold your horses.', 'Tina, honey, you gotta cut your fingernails or you gotta stop that kind of dreaming.', 'One or the other.', 'One, two', "Freddy's coming for you", 'Three, four', 'Better lock your door', 'Five, six', 'Grab your crucifix', 'Seven, eight', 'Gonna stay up late', 'Nine, ten']


sample word tokenization of tokenized sentences

In [155]:
# step 2: tokenize each sentence into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in elm_clean]

# step 3: convert each word to lowercase
tokenized_text = [[word.lower() for word in sent] for sent in tokenized_sentences]

#notice the sentence breaks and what the first 10 items of the tokenized text
print(tokenized_text[10])

['one', 'or', 'the', 'other', '.']


All SRT files of American films from 1984

In [156]:
# Directory containing the files
directory = '1984-srts'

# Initialize an empty list to store the combined lines from all files
combined_lines = []

# Check if the directory exists
if os.path.exists(directory) and os.path.isdir(directory):
    # List all files in the directory
    file_list = os.listdir(directory)

    # Loop through each file in the directory
    for filename in file_list:
        # Construct the full path to the file
        file_path = os.path.join(directory, filename)

        # Check if the file is a regular file (not a directory)
        if os.path.isfile(file_path):
            try:
                # Open the file and read its lines
                with open(file_path, 'r', encoding='utf-8') as file:
                    lines = file.readlines()

                # Clean up the lines using your clean_up function
                cleaned_lines = clean_up(lines)

                # Extend the combined_lines list with the cleaned lines from the current file
                combined_lines.extend(cleaned_lines)
            except Exception as e:
                print(f"Error reading or processing file {file_path}: {e}")

# Now, combined_lines contains all the cleaned-up lines from all the files


In [158]:
# step 2: tokenize each sentence into words
tokenized_lines = [nltk.word_tokenize(sent) for sent in combined_lines]

# step 3: convert each word to lowercase
tokenized_words = [[word.lower() for word in sent] for sent in tokenized_lines]

#notice the sentence breaks and what the first 10 items of the tokenized text
print(tokenized_words[10])

['17', 'years', 'and', '15', 'albums', 'later', ',']


In [164]:
train_data2, padded_sents2 = padded_everygram_pipeline(n, tokenized_words)

In [165]:
lm2 = MLE(n)
lm2.fit(train_data2, padded_sents2)
len(lm2.vocab)

37029

In [185]:
# how many times does dorothy appear in the model?
print(lm2.counts['money'])

# what is the probability of dorothy appearing? 
# this is technically the relative frequency of dorothy appearing 
lm2.score('money')

771


0.00044021498021316457

In [186]:
print(lm.generate(20, text_seed= 'money'))

[':', '</s>', 'outweighs', 'me', 'what', 'you', 'up', '.', '</s>', 'important', 'to', 'move', 'the', 'problem', '?', '</s>', '3', '.', '</s>', "'ll"]


In [193]:
generate_sent(lm2, 20, text_seed='money', random_seed = 44)

': i tell him.'