## **Tokenization Lab**
LLMs and ChatGPT | Fall 2023 | McSweeney | CUNY Graduate Center

**Public link to this Google Colab Notebook:
https://colab.research.google.com/drive/1YXfrKuSNtG1HuWTiQ_-Qh87ru276Bwuu**

**Matthew Stanton** | pingstanton@gmail.com | mstanton@gradcenter.cuny.edu | [Lab List on CUNY Academic Commons](https://pingstanton.commons.gc.cuny.edu/2023/09/21/labs-for-data-78000-large-language-models-and-chat-gpt/) | [Lab List on GitHub](https://github.com/pingstanton/DATA-78000-Large-Language-Models-and-Chat-GPT)

**Due:** October 8, 2023

### Background
The purpose of this lab is to explore different tokenization methods. On their own, tokenization methods don't do much. However, they are the starting place for all natural language processing.

#### Notes
This is a short lab using the same dataset throughout. Feel free to switch it up, but once you are comfortable with how the different alogorithms approach the task of breaking up text, move on.

You will be using the `datasets` package. You can [install the package](https://pypi.org/project/datasets/) with `$ pip install datasets`. If you do not have `pip` or `conda` installed on your machine, please install it now.

In [1]:
# Adding !pip install nltk datasets because Google Colab
!pip install nltk datasets

import nltk
nltk.download('punkt')
import timeit



Collecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.14.0 (from datasets)
  Downloading huggingface_hub-0.17.3-py3-none-a

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
from datasets import load_dataset

The next cell is just downloading the dataset. You need to be connected to the internet for this to work.

This dataset is hosted by [Hugging Face](https://huggingface.co). Hugging Face hosts machine learning models, datasets, and more. We will reference them again. It's a great place to find corpora.


The dataset is called [American Stories](https://huggingface.co/datasets/dell-research-harvard/AmericanStories). Please skim the Dataset Card. All models and datasets on the Hugging Face hub have these associated cards.

In [3]:
# Decide what year you want between 1810 and 1963

my_year = "1960"

# Decide how many articles you want to work with (keep this small - it's slow)
num_articles = 10

#  Download data for your choice of year (1810 to 1963)
dataset = load_dataset("dell-research-harvard/AmericanStories",
    "subset_years",
    year_list=[my_year]
)

# Get the first n articles from that year
# instantiate the counter
i=0
# instantiate the string
my_articles = ''
# loop through each article for that year
for article in dataset[my_year]:
    #the article is a dictionary,
    #we're getting the text of the article by accessing the key, "article"
    my_articles += article.get('article')
    #add one to our counter
    i+=1
    #if the counter is greater than num_articles-1, stop looping
    if i>(num_articles-1): break

#validate that it is what we expect by checking on first 100 characters
print(my_articles[:1000])


Downloading builder script:   0%|          | 0.00/8.91k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.02k [00:00<?, ?B/s]

Only taking a subset of years. Change name to 'all_years' to use all years in the dataset.
{'1960': 'https://huggingface.co/datasets/dell-research-harvard/AmericanStories/resolve/main/faro_1960.tar.gz'}


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/256M [00:00<?, ?B/s]

Generating 1960 split: 0 examples [00:00, ? examples/s]

Loading associated
SAN FRANCISCO. Nov. 10
(AP).-Alvin Dark made his
first decisions yesterday as
manager Of the San Francisco
Giants. He hired two former
teammates as coaches.


Dark was signed last week,
Yesterday he selected Larry
Jansen and Whitley Lockman
and retained Yves Westrum and
Salty Parker for his coaching
staff. Bill Posedel was re-
leased to make way for Jansen
as boss Of the bullpen.


Dark. Jansen and Lockman-
stars when they played for the
Giants have a lot in common
They have regulations al
gentlemen, quiet craftsmen whc
let their feats on the field speal
for them.


Dark hit 1922 in his rookie
season with the Boston Brave.
and was named rookie of thu
year l948 by the major league
baseball writer's.


AS lean. smiling youngstel
of 18, Lockman stepped intC
Mel Otis No. 3 batting spot IL
midsummer Of 1945 anchead last Saturday when Mon
treal lost in q cup playoff with
out throwing a pass In the last
half. Moss said it was because
of Etcheverry's sore arm. The
player sai

This section is for formatting. It removes almost all the markup in these articles. It's a fairly standard set of character encodings.

In [5]:
#remove new line and other formatting characters
for char in ["\n", "\r", "\d", "\t"]:
    my_articles = my_articles.replace(char, " ")
my_articles[:1000]

"SAN FRANCISCO. Nov. 10 (AP).-Alvin Dark made his first decisions yesterday as manager Of the San Francisco Giants. He hired two former teammates as coaches.   Dark was signed last week, Yesterday he selected Larry Jansen and Whitley Lockman and retained Yves Westrum and Salty Parker for his coaching staff. Bill Posedel was re- leased to make way for Jansen as boss Of the bullpen.   Dark. Jansen and Lockman- stars when they played for the Giants have a lot in common They have regulations al gentlemen, quiet craftsmen whc let their feats on the field speal for them.   Dark hit 1922 in his rookie season with the Boston Brave. and was named rookie of thu year l948 by the major league baseball writer's.   AS lean. smiling youngstel of 18, Lockman stepped intC Mel Otis No. 3 batting spot IL midsummer Of 1945 anchead last Saturday when Mon treal lost in q cup playoff with out throwing a pass In the last half. Moss said it was because of Etcheverry's sore arm. The player said his arm was SOUN

# Whitespace tokenization


First we'll just break up the words using whitespace. As noted in class, this is a really common first pass.

In [4]:
%%time
#this is a magic function to determine how long a cell takes to run.
#It MUST be the first thing in a cell

#split the whole string on spaces. This returns a list
whitespace_tokens = my_articles.split(' ')

#check the list
whitespace_tokens[:20]

CPU times: user 155 µs, sys: 4 µs, total: 159 µs
Wall time: 163 µs


['SAN',
 'FRANCISCO.',
 'Nov.',
 '10\n(AP).-Alvin',
 'Dark',
 'made',
 'his\nfirst',
 'decisions',
 'yesterday',
 'as\nmanager',
 'Of',
 'the',
 'San',
 'Francisco\nGiants.',
 'He',
 'hired',
 'two',
 'former\nteammates',
 'as',
 'coaches.\n\n\nDark']

Note: "µs" is microseconds, or a millionth of a second 1/1,000,000

# Morphological Tokenization

Lemmatizing is the process of breaking down text into tokens by first breaking it up into "words" and then using syntactic knowledge of the language (in this case, English) to break up the words.

Princeton maintains the [morphy project](https://wordnet.princeton.edu/documentation/morphy7wn#:~:text=Morphology%20in%20WordNet%20uses%20two,word%20that%20is%20in%20WordNet.), which powers `nltk`'s [WordNet Lemmatizer](https://www.nltk.org/api/nltk.stem.wordnet.html). You do NOT need to read this entire documentation, just acknowledge that it requires a significant amount of knowledge about English in order to make it work.

In [5]:
#This lemmatizer is based on the Morphy project above
from nltk.stem import WordNetLemmatizer

#Uncomment these two lines - you may need to download these, maybe not.
nltk.download('wordnet')
nltk.download('omw-1.4')
wn_lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [13]:
%%time

#first we have to split the string on spaces to get "words"
whitespace_tokens = my_articles.split(' ')

my_lemmas = []
for word in whitespace_tokens:
    w = wn_lemmatizer.lemmatize(word)
    my_lemmas.append(w)
my_lemmas[:20]

CPU times: user 1.68 s, sys: 73.7 ms, total: 1.76 s
Wall time: 1.81 s


['SAN',
 'FRANCISCO.',
 'Nov.',
 '10',
 '(AP).-Alvin',
 'Dark',
 'made',
 'his',
 'first',
 'decision',
 'yesterday',
 'a',
 'manager',
 'Of',
 'the',
 'San',
 'Francisco',
 'Giants.',
 'He',
 'hired']

Notice how much time it takes to tokenize on whitespace versus using morphological rules. Also notice if it produced the output you expected. Sometimes it doesn't.

ms is a millisecond, or one one thousandth of a second 1/1,000

# Byte Pair Encoding

There are two implementations of BPE here. The first [uses a package (bpe)](https://github.com/soaxelbrooke/python-bpe) that you will have to install using `pip` (see above).

This will implement the algorithm we covered in class and that you can review at [Hugging Face](https://youtu.be/HEikzVL-lZU).

In [8]:
# adding !pip install bpe because Google Colab
# adding !pip install subword-nmt because Google Colab

!pip install bpe
# !pip install --upgrade subword-nmt


Collecting bpe
  Downloading bpe-1.0-py3-none-any.whl (6.8 kB)
Collecting hypothesis (from bpe)
  Downloading hypothesis-6.87.3-py3-none-any.whl (420 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m420.8/420.8 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Collecting mypy (from bpe)
  Downloading mypy-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m67.8 MB/s[0m eta [36m0:00:00[0m
Collecting mypy-extensions>=1.0.0 (from mypy->bpe)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Installing collected packages: mypy-extensions, hypothesis, mypy, bpe
Successfully installed bpe-1.0 hypothesis-6.87.3 mypy-1.5.1 mypy-extensions-1.0.0


In [13]:
from bpe import Encoder




In [None]:
%%time
whitespace_tokens = my_articles.split(' ')

# calling the Encoder algorithm
# we've specified 100 token vocab and 95% to be tokenized
# the other 5% is transformed into UNK
encoder = Encoder(100, pct_bpe=0.95)
encoder.fit(whitespace_tokens)

In [12]:
#print(encoder.tokenize(my_articles))

print(next(encoder.inverse_transform(encoder.transform([my_articles]))))

san francisco . nov . 1__unk __unk ap __unk__unk- alvin dark made his first decisions yesterday as manager of the san francisco giants . he hired two former teammates as coaches . dark was signed last week , yesterday he selected larry __unkansen and whitley lockman and retained yves westrum and salty parker for his coaching staff . bill posedel was re - leased to make way for __unkansen as boss of the bullpen . dark . __unkansen and lockman - stars when they played for the giants have a lot in common they have regulations al gentlemen , __unkuiet craftsmen whc let their feats on the field speal for them . dark hit 1__unk__unk__unk in his rookie season with the boston brave . and was named rookie of thu year l__unk__unk__unk by the ma__unkor league baseball writer ' s . as lean . smiling youngstel of 1__unk , lockman stepped intc mel otis no . __unk batting spot il midsummer of 1__unk__unk__unk anchead last saturday when mon treal lost in __unk cup playoff with out throwing a pass in t