## **Ngrams Lab**
LLMs and ChatGPT | Fall 2023 | McSweeney | CUNY Graduate Center

**Public link to this Google Colab Notebook:
https://colab.research.google.com/drive/1HebbqSpe5WXT45j9Oh1y7vOfHk6RO_nw**

**Matthew Stanton** | pingstanton@gmail.com | mstanton@gradcenter.cuny.edu | [Lab List on CUNY Academic Commons](https://pingstanton.commons.gc.cuny.edu/2023/09/21/labs-for-data-78000-large-language-models-and-chat-gpt/)

**Due:** September 17, 2023; corrections submitted September 21, 2023





---


**Large Language Models and Chat GPT**
*(Mondays 6:30p, Room 5417, CUNY Graduate Center, New York, NY)*

Instructor: Michelle McSweeney, [michelleamcsweeney.com](https://michelleamcsweeney.com)

Course Site: https://github.com/michellejm/LLMs-fall-23

Importing assigned **wordvectors-lab.ipynb** Jupyter workbook from:
https://github.com/michellejm/LLMs-fall-23/blob/main/week2-ngrams-tokenizers-wordvectors/ngrams/ngrams-lab.ipynb

---

This lab is based heavily on the [nltk documentation](https://www.nltk.org/api/nltk.lm.html)

Code annotations copied from OpenAI. (2023). ChatGPT (August 3 Version) [Large language model]. https://chat.openai.com

### Background
The purpose of this lab is to explore ngram models. Ngram models are a good introduction to language models generally. Language models are probabilistic representations of language. Ngrams have the benefit of being easy to interrogate and relatively easy to understand (as compared to neural networks).

In this lab, you will build an ngram model from the corpus of your choosing. The example is with 'The Great Gatsby' from Project Gutenberg, but there's a code block for any text file on your computer  

#### Notes
This lab is based heavily on the [nltk documentation](https://www.nltk.org/api/nltk.lm.html)

In [None]:
# Start by loading up the Stanton's usual suspects...
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats as scp

In [None]:
import re

import nltk
# if you haven't downloaded punkt before, you only need to run the line below once
nltk.download('punkt')
from nltk import word_tokenize
from nltk import sent_tokenize

from nltk.util import bigrams
from nltk.lm.preprocessing import padded_everygram_pipeline

import requests

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Part 1
An example of how ngrams are generated

In [None]:
# you will need to leverage the requests package
r = requests.get(r'https://www.gutenberg.org/cache/epub/64317/pg64317.txt')
great_gatsby = r.text

# first, remove unwanted new line and tab characters from the text
for char in ["\n", "\r", "\d", "\t"]:
    great_gatsby = great_gatsby.replace(char, " ")

# check
print(great_gatsby[:100])


﻿The Project Gutenberg eBook of The Great Gatsby        This ebook is for the use of anyone anywhere


In [None]:
# remove the metadata at the beginning - this is slightly different for each book
great_gatsby = great_gatsby[983:]

#### Txt locally
If you'd rather use a file on your computer, here's the code -- you just need to save the text file in your local directory, and change the variables throughout.

The example is a report from the [Congressional Research Service](https://www.everycrsreport.com/files/2020-11-10_R45178_62d6238caecf6c02ddf495be33b3439f09eed744.pdf) on AI and National Security.

In [None]:
# this is simplified for demonstration
def sample_clean_text(text: str):
    # lowercase
    text = text.lower()

    # remove punctuation from text
    text = re.sub(r"[^\w\s]", "", text)

    # tokenize the text
    tokens = nltk.word_tokenize(text)

    # return your tokens
    return tokens

# call the function
sample_tokens = sample_clean_text(text = great_gatsby)

# check
print(sample_tokens[:50])

['in', 'college', 'i', 'was', 'unjustly', 'accused', 'of', 'being', 'a', 'politician', 'because', 'i', 'was', 'privy', 'to', 'the', 'secret', 'griefs', 'of', 'wild', 'unknown', 'men', 'most', 'of', 'the', 'confidences', 'were', 'unsoughtfrequently', 'i', 'have', 'feigned', 'sleep', 'preoccupation', 'or', 'a', 'hostile', 'levity', 'when', 'i', 'realized', 'by', 'some', 'unmistakable', 'sign', 'that', 'an', 'intimate', 'revelation', 'was', 'quivering']


In [None]:
# create bigrams from the sample tokens
my_bigrams = bigrams(sample_tokens)

# check
list(my_bigrams)[:10]

[('in', 'college'),
 ('college', 'i'),
 ('i', 'was'),
 ('was', 'unjustly'),
 ('unjustly', 'accused'),
 ('accused', 'of'),
 ('of', 'being'),
 ('being', 'a'),
 ('a', 'politician'),
 ('politician', 'because')]

# Part 2 - creating an ngram model


**From Chat-GPT:**

Bigrams are a type of n-gram in natural language processing (NLP) and computational linguistics. N-grams are contiguous sequences of n items (or words) from a given sample of text or speech. In the case of bigrams, n equals 2, so bigrams are sequences of two adjacent words in a text or speech corpus.

For example, consider the sentence: "I love programming in Python." In this sentence, the bigrams would be:

1. "I love"
2. "love programming"
3. "programming in"
4. "in Python"

Bigrams are often used in various NLP tasks, including:

**Text Analysis:** Bigrams help in understanding the co-occurrence patterns of words. Analyzing bigrams can reveal insights about which words frequently appear together in a given text, which can be useful for tasks like sentiment analysis, text classification, and topic modeling.

**Language Modeling:** Bigrams are used in language models to predict the probability of a word based on the previous word. This can be helpful in tasks like speech recognition, machine translation, and text generation.

**Information Retrieval:** In information retrieval systems, bigrams are used to index and search for phrases and multi-word expressions.

**Text Compression:** Bigrams can be used in text compression algorithms to represent frequently occurring pairs of words more efficiently.

In addition to bigrams, there are also **trigrams** (n=3), **4-grams** (n=4), and **n-grams** for larger values of n. The choice of the "n" value depends on the specific NLP task and the desired level of granularity in text analysis.

In [None]:
# 2 is for bigrams
n = 2
#specify the text you want to use
text = great_gatsby


Now we are going to use an NLTK shortcut for preprocessing. This will:
* pad all of the sentences with `<s>` and `</s>` to train on sentence boundaries, too.
* create both unigrams and bigrams
* create a training set and a full vocab to train on

We need to give it a pre-tokenized text (we'll use nltk's tokenizer)

In [None]:
# step 1: tokenize the text into sentences
sentences = nltk.sent_tokenize(text)

# step 2: tokenize each sentence into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# step 3: convert each word to lowercase
tokenized_text = [[word.lower() for word in sent] for sent in tokenized_sentences]

#notice the sentence breaks and what the first 10 items of the tokenized text
print(tokenized_text[0])

['then', 'wear', 'the', 'gold', 'hat', ',', 'if', 'that', 'will', 'move', 'her', ';', 'if', 'you', 'can', 'bounce', 'high', ',', 'bounce', 'for', 'her', 'too', ',', 'till', 'she', 'cry', '“', 'lover', ',', 'gold-hatted', ',', 'high-bouncing', 'lover', ',', 'i', 'must', 'have', 'you', '!', '”', 'thomas', 'parke', 'd', '’', 'invilliers', 'i', 'in', 'my', 'younger', 'and', 'more', 'vulnerable', 'years', 'my', 'father', 'gave', 'me', 'some', 'advice', 'that', 'i', '’', 've', 'been', 'turning', 'over', 'in', 'my', 'mind', 'ever', 'since', '.']


Why tokenize sentences and words?
We want to be able to retain sentence boundaries to encode that, too.

In [None]:
# notice what the first 10 items are of the vocabulary
print(text[:10])

 Then wear


In [None]:
# we imported this function from nltk
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

In [None]:
from nltk.lm import MLE
# we imported this function from nltk linear models (lm)
# it is for Maximum Likelihood Estimation

# MLE is the model we will use
lm = MLE(n)

In [None]:
# currently the vocab length is 0: it has no prior knowledge
len(lm.vocab)

0

In [None]:
# fit the model
# training data is the bigrams and unigrams
# the vocab is all the sentence tokens in the corpus

lm.fit(train_data, padded_sents)
len(lm.vocab)

6953

In [None]:
# inspect the model's vocabulary.
# be sure that a sentence you know exists (from tokenized_text) is in the
print(lm.vocab.lookup(tokenized_text[0]))

('then', 'wear', 'the', 'gold', 'hat', ',', 'if', 'that', 'will', 'move', 'her', ';', 'if', 'you', 'can', 'bounce', 'high', ',', 'bounce', 'for', 'her', 'too', ',', 'till', 'she', 'cry', '“', 'lover', ',', 'gold-hatted', ',', 'high-bouncing', 'lover', ',', 'i', 'must', 'have', 'you', '!', '”', 'thomas', 'parke', 'd', '’', 'invilliers', 'i', 'in', 'my', 'younger', 'and', 'more', 'vulnerable', 'years', 'my', 'father', 'gave', 'me', 'some', 'advice', 'that', 'i', '’', 've', 'been', 'turning', 'over', 'in', 'my', 'mind', 'ever', 'since', '.')


In [None]:
# see what happens when we include a word that is not in the vocab.
print(lm.vocab.lookup('then wear the gold hat iphone .'.split()))

('then', 'wear', 'the', 'gold', 'hat', '<UNK>', '.')


What did the model replace 'iphone' with?
**UNK**

Given that it didn't just return an "out of vocab" error, what does that mean about our model?
**Our model accounts for unknown words by using the UNK as a placeholder variable.**

In [None]:
# how many times does daisy appear in the model?
print(lm.counts['daisy'])

# what is the probability of daisy appearing?
# this is technically the relative frequency of daisy appearing
lm.score('daisy')

183


0.0026549057726065954

**From Chat-GPT:** In NLTK (Natural Language Toolkit), the **lm.counts()** method is used in the context of language modeling. Specifically, it is used to count the occurrences of n-grams (sequences of n words) within a given text or corpus. This method is part of the NLTK's language modeling module.

The **lm.score()** method is used in the context of language modeling to calculate the log-likelihood (log probability) of a given sentence or sequence of words based on a trained language model. This method is used to evaluate how likely a particular sequence of words is according to the language model.

In [None]:
# how often does (daisy, and) occur and what is the relative frequency?
print(lm.counts[['daisy']]['and'])
lm.score('and', 'daisy'.split())

14


0.07650273224043716

In [None]:
# what is the score of 'UNK'?

lm.score("<UNK>")

0.0

Does the relative frequency of 'UNK' change your assumption about how the model behaves?
**The text and the training data were a close match.**

How should we change our model to account for the fact the `<UNK>` words are not accounted for by the model?
**Expand or change the training data corpus.**

Note: *Programmatically implementing this solution is beyond the scope of this course.*

## Generate text
We want to start our sentence with a word, and use that to predict all the words that come after that. We'll specify how long it should be.

There is a certain amount of randomness encoded into n-gram models. This prevents a model from becoming entirely deterministic. Maximum Likelihood Estimation without some degree of randomness will only produce the most likely result every time. Setting Random Seed means we will get the same result every time.

In [None]:
# generate a 20 word sentence starting with the word, 'daisy'

print(lm.generate(20, text_seed= 'daisy', random_seed=42))

['other', ',', 'but', 'he', 'turned', 'to', 'their', 'cars', 'blocking', 'the', 'dull', 'light', ',', 'and', 'separated', 'only', 'building', 'in', 'the', 'adventitious']


This next code block is just to clean up the tokenized words and make them easier on human eyes. It is literally a detokenizer, which removes some extraneous text markup and reconciles some words back together.

In [None]:
from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(lm, num_words, text_seed, random_seed=42):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in lm.generate(num_words, text_seed=text_seed, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

**From ChatGPT:** The **TreebankWordDetokenizer** is used for the reverse process of tokenization, which is called detokenization. Detokenization is the process of taking a list of tokens (words or subword units) and reconstructing the original sentence or text from them.



In [None]:
# Now generate sentences that look much nicer.
generate_sent(lm, 20, text_seed='daisy', random_seed = 42)

'other, but he turned to their cars blocking the dull light, and separated only building in the adventitious'

Try a few more sentences, and try out another text. Once you are satisfied with what ngrams can (and cannot) do - post your code to your Github or another site.

In [None]:
generate_sent(lm, 20, text_seed='Carraways', random_seed = 42)
# recognizes Carraways is a name, fails to generate ngrams

'names) 596-1887.'

In [None]:
generate_sent(lm, 20, text_seed='dust', random_seed = 40)
# parses "ll" - probably from a contraction using "will" - as a word?

'know where? ” “ she did. ” “ if all this agreement by that it ’ ll make a'

In [None]:
generate_sent(lm, 20, text_seed='abortive', random_seed = 50)


'in england who gives large uncertain dancing individually or a cricket bat in half a word or not located in'

In [None]:
generate_sent(lm, 20, text_seed='sorrow', random_seed = 50)
# I don't understand why it generates the same sentence off a different seed word, although "abortive sorrow" are a pair in the source text (end of fourth paragraph in Chapter 1).

'in england who gives large uncertain dancing individually or a cricket bat in half a word or not located in'