# Tokenization Lab 
September 17 - Exploring different tokenization methods

In [12]:
import nltk
import timeit

In [6]:
%pip install datasets

Collecting datasets
  Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/09/7e/fd4d6441a541dba61d0acb3c1fd5df53214c2e9033854e837a99dd9e0793/datasets-2.14.5-py3-none-any.whl.metadata
  Downloading datasets-2.14.5-py3-none-any.whl.metadata (19 kB)
Collecting xxhash (from datasets)
  Obtaining dependency information for xxhash from https://files.pythonhosted.org/packages/46/14/0302669d5d983ce23dc3870f4f2b16ab1d757a1d7e54a5cfe7a5df37f8e2/xxhash-3.3.0-cp311-cp311-win_amd64.whl.metadata
  Downloading xxhash-3.3.0-cp311-cp311-win_amd64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Obtaining dependency information for multiprocess from https://files.pythonhosted.org/packages/e7/41/96ac938770ba6e7d5ae1d8c9cafebac54b413549042c6260f0d0a6ec6622/multiprocess-0.70.15-py311-none-any.whl.metadata
  Downloading multiprocess-0.70.15-py311-none-any.whl.metadata (7.2 kB)
Collecting huggingface-hub<1.0.0,>=0.14.0 (from datasets)
  Obtaining dependenc

In [7]:
from datasets import load_dataset

In [8]:
# Decide what year you want between 1810 and 1963

my_year = "1960"

# Decide how many articles you want to work with
num_articles = 10

#  Download data for your choice of year (1810 to 1963)
dataset = load_dataset("dell-research-harvard/AmericanStories",
    "subset_years",
    year_list=[my_year]
)

# Get the first n articles from that year
# instantiate the counter
i=0
# instantiate the string
my_articles = ''
# loop through each article for that year
for article in dataset[my_year]:
    #the article is a dictionary, 
    #we're getting the text of the article by accessing the key, "article"
    my_articles += article.get('article')
    #add one to our counter
    i+=1
    #if the counter is greater than num_articles-1, stop looping
    if i>(num_articles-1): break
    
#validate that it is what we expect by checking on first 100 characters
print(my_articles[:1000])

Downloading builder script:   0%|          | 0.00/8.91k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.02k [00:00<?, ?B/s]

Only taking a subset of years. Change name to 'all_years' to use all years in the dataset.
{'1960': 'https://huggingface.co/datasets/dell-research-harvard/AmericanStories/resolve/main/faro_1960.tar.gz'}


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/256M [00:00<?, ?B/s]

Generating 1960 split: 0 examples [00:00, ? examples/s]

Loading associated
SAN FRANCISCO. Nov. 10
(AP).-Alvin Dark made his
first decisions yesterday as
manager Of the San Francisco
Giants. He hired two former
teammates as coaches.


Dark was signed last week,
Yesterday he selected Larry
Jansen and Whitley Lockman
and retained Yves Westrum and
Salty Parker for his coaching
staff. Bill Posedel was re-
leased to make way for Jansen
as boss Of the bullpen.


Dark. Jansen and Lockman-
stars when they played for the
Giants have a lot in common
They have regulations al
gentlemen, quiet craftsmen whc
let their feats on the field speal
for them.


Dark hit 1922 in his rookie
season with the Boston Brave.
and was named rookie of thu
year l948 by the major league
baseball writer's.


AS lean. smiling youngstel
of 18, Lockman stepped intC
Mel Otis No. 3 batting spot IL
midsummer Of 1945 anchead last Saturday when Mon
treal lost in q cup playoff with
out throwing a pass In the last
half. Moss said it was because
of Etcheverry's sore arm. The
player sai

In [9]:
#remove new line and other formatting characters
for char in ["\n", "\r", "\d", "\t"]:
    my_articles = my_articles.replace(char, " ")
my_articles[:1000]

"SAN FRANCISCO. Nov. 10 (AP).-Alvin Dark made his first decisions yesterday as manager Of the San Francisco Giants. He hired two former teammates as coaches.   Dark was signed last week, Yesterday he selected Larry Jansen and Whitley Lockman and retained Yves Westrum and Salty Parker for his coaching staff. Bill Posedel was re- leased to make way for Jansen as boss Of the bullpen.   Dark. Jansen and Lockman- stars when they played for the Giants have a lot in common They have regulations al gentlemen, quiet craftsmen whc let their feats on the field speal for them.   Dark hit 1922 in his rookie season with the Boston Brave. and was named rookie of thu year l948 by the major league baseball writer's.   AS lean. smiling youngstel of 18, Lockman stepped intC Mel Otis No. 3 batting spot IL midsummer Of 1945 anchead last Saturday when Mon treal lost in q cup playoff with out throwing a pass In the last half. Moss said it was because of Etcheverry's sore arm. The player said his arm was SOUN

# Whitespace tokenization

In [10]:
%%time
#this is a magic function to determine how long a cell takes to run. 
#It MUST be the first thing in a cell

#split the whole string on spaces. This returns a list
whitespace_tokens = my_articles.split(' ')

#check the list
whitespace_tokens[:20]

CPU times: total: 0 ns
Wall time: 996 µs


['SAN',
 'FRANCISCO.',
 'Nov.',
 '10',
 '(AP).-Alvin',
 'Dark',
 'made',
 'his',
 'first',
 'decisions',
 'yesterday',
 'as',
 'manager',
 'Of',
 'the',
 'San',
 'Francisco',
 'Giants.',
 'He',
 'hired']

# Morphological Tokenization

In [13]:
#This lemmatizer is based on the Morphy project above
from nltk.stem import WordNetLemmatizer
 
#Uncomment these two lines - you may need to download these, maybe not. 
nltk.download('wordnet')
nltk.download('omw-1.4')
wn_lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ismer\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ismer\AppData\Roaming\nltk_data...


In [14]:
%%time

#first we have to split the string on spaces to get "words"
whitespace_tokens = my_articles.split(' ')

my_lemmas = []
for word in whitespace_tokens:
    w = wn_lemmatizer.lemmatize(word)
    my_lemmas.append(w)
my_lemmas[:20]

CPU times: total: 1.16 s
Wall time: 1.24 s


['SAN',
 'FRANCISCO.',
 'Nov.',
 '10',
 '(AP).-Alvin',
 'Dark',
 'made',
 'his',
 'first',
 'decision',
 'yesterday',
 'a',
 'manager',
 'Of',
 'the',
 'San',
 'Francisco',
 'Giants.',
 'He',
 'hired']

Interestingly, the time taken to tokenize was seconds not milliseconds.

# Byte Pair Encoding (BPE)

In [16]:
%pip install bpe
from bpe import Encoder

Collecting bpe
  Downloading bpe-1.0-py3-none-any.whl (6.8 kB)
Collecting hypothesis (from bpe)
  Obtaining dependency information for hypothesis from https://files.pythonhosted.org/packages/48/4b/126dba4fbc20143aeb599d7b227e8d9c5f8deb31994731a886f381cdac3e/hypothesis-6.87.0-py3-none-any.whl.metadata
  Downloading hypothesis-6.87.0-py3-none-any.whl.metadata (5.9 kB)
Collecting mypy (from bpe)
  Obtaining dependency information for mypy from https://files.pythonhosted.org/packages/4e/11/ac861ca5d9b16fd5b781c1941254d4e382e8eaab90e11f41f193d9222b7e/mypy-1.5.1-cp311-cp311-win_amd64.whl.metadata
  Downloading mypy-1.5.1-cp311-cp311-win_amd64.whl.metadata (1.8 kB)
Collecting mypy-extensions>=1.0.0 (from mypy->bpe)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Downloading hypothesis-6.87.0-py3-none-any.whl (420 kB)
   ---------------------------------------- 0.0/420.7 kB ? eta -:--:--
   ---------------------------------------- 420.7/420.7 kB ? eta 0:00:00
Downloading mypy-1.5

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
python-lsp-black 1.2.1 requires black>=22.3.0, but you have black 0.0 which is incompatible.


In [17]:
%%time
whitespace_tokens = my_articles.split(' ')

# calling the Encoder algorithm
# we've specified 100 token vocab and 95% to be tokenized
# the other 5% is transformed into UNK
encoder = Encoder(100, pct_bpe=0.95)
encoder.fit(whitespace_tokens)

CPU times: total: 0 ns
Wall time: 22.6 ms


In [18]:
#print(encoder.tokenize(my_articles))

print(next(encoder.inverse_transform(encoder.transform([my_articles]))))

san francisco . nov . 1__unk __unk ap __unk__unk- alvin dark made his first decisions yesterday as manager of the san francisco giants . he hired two former teammates as coaches . dark was signed last week , yesterday he selected larry __unkansen and whitley lockman and retained yves westrum and salty parker for his coaching staff . bill posedel was re - leased to make way for __unkansen as boss of the bullpen . dark . __unkansen and lockman - stars when they played for the giants have a lot in common they have regulations al gentlemen , __unkuiet craftsmen whc let their feats on the field speal for them . dark hit 1__unk__unk__unk in his rookie season with the boston brave . and was named rookie of thu year l__unk__unk__unk by the ma__unkor league baseball writer ' s . as lean . smiling youngstel of 1__unk , lockman stepped intc mel otis no . __unk batting spot il midsummer of 1__unk__unk__unk anchead last saturday when mon treal lost in __unk cup playoff with out throwing a pass in t