Download the BookCorpus dataset. Take every 7-th sample (the indices are multiple of 7:[0,7,14,21,...]) from the entire dataset. This will result in a dataset with 10 million samples (exactly, 10,572,033). Use these samples to build a tokenizer with the BPE tokenization algorithm by varying the vocabulary size.

Normalizer: LowerCase

PreTokenizer: WhiteSpace

Model: BPE

Special tokens: [GO],[UNK],[PAD],[EOS]

PostProcessing: None

Tokenize the input text: “SEBI study finds 93% of individual F&O traders made losses between FY22 and FY24.” using the following configurations.

In [1]:
from datasets import load_dataset
ds = load_dataset("bookcorpus", split="all")
samples = ds[::7]
len(samples)

Downloading builder script:   0%|          | 0.00/3.25k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

The repository for bookcorpus contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/bookcorpus.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


Downloading data:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/74004228 [00:00<?, ? examples/s]

1

In [2]:
samples

{'text': ['usually , he would be tearing around the living room , playing with his toys .',
  'mason barely acknowledged her .',
  'mason was already registering off the charts in height and weight according to his pediatrician .',
  'she never wanted anything in the world to hurt him , and she knew that being rejected by his father would .',
  "aidan was her mother 's baby brother and only son of the family .",
  "while it had been no question that she wanted him as godfather for mason , she had been extremely honored when he and his wife , emma , had asked her to be their son , noah 's , godmother .",
  "`` beau ? ''",
  "`` oomph , '' she muttered , as they started up the basement stairs .",
  'she asked .',
  'sean acknowledged her with a two finger salute before cranking up and pulling down the driveway .',
  'at that time , she had her heart set on going to medical school and becoming a doctor .',
  "sadly , she could n't say that her first love was davis , mason 's father .",
  

In [3]:
len(samples['text'])

10572033

In [4]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import Lowercase

In [5]:
model = BPE(unk_token='[UNK]')
tokenizer = Tokenizer(model)
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = Whitespace()

In [7]:
# Keep the vocabulary size at 5000 and tokenize the input text using the learned vocabulary. Choose the number of tokens returned by the tokenizer.
import time

from tokenizers.trainers import BpeTrainer
text = "SEBI study finds 93% of individual F&O traders made losses between FY22 and FY24."

trainer = BpeTrainer(vocab_size=5000, special_tokens=["[GO]", "[UNK]", "[PAD]", "[EOS]"], continuing_subword_prefix="##")


def get_batch(bs=10_000):
    for i in range(0, len(samples['text']), bs):
        yield samples['text'][i: i + bs]

start = time.time()
tokenizer.train_from_iterator(get_batch(), trainer=trainer, length=len(samples['text']))
stop = time.time()
print(stop - start)




13.508091688156128


In [8]:
enc = tokenizer.encode(text)
len(enc.tokens)

32

In [10]:
# Increase the vocabulary size to 10K, 15K and 32K. For each case, tokenize the same input with the newly learned vocabulary. Choose all the correct statements
t10k = BpeTrainer(vocab_size=10_000, special_tokens=["[GO]", "[UNK]", "[PAD]", "[EOS]"], continuing_subword_prefix="##")
t15k = BpeTrainer(vocab_size=15_000, special_tokens=["[GO]", "[UNK]", "[PAD]", "[EOS]"], continuing_subword_prefix="##")
t32k = BpeTrainer(vocab_size=32_000, special_tokens=["[GO]", "[UNK]", "[PAD]", "[EOS]"], continuing_subword_prefix="##")

In [11]:
tokenizer.train_from_iterator(get_batch(), trainer=t10k, length=len(samples['text']))
enc = tokenizer.encode(text)






In [12]:
len(enc.tokens)

32

In [13]:
tokenizer.train_from_iterator(get_batch(), trainer=t15k, length=len(samples['text']))
enc = tokenizer.encode(text)
len(enc.tokens)






32

In [14]:
tokenizer.train_from_iterator(get_batch(), trainer=t32k, length=len(samples['text']))
enc = tokenizer.encode(text)
len(enc.tokens)






32

In [15]:
# See what happens when everything is created from scratch

def train(vocab_size):
    model = BPE(unk_token='[UNK]')
    tokenizer = Tokenizer(model)
    tokenizer.normalizer = Lowercase()
    tokenizer.pre_tokenizer = Whitespace()
    trainer = BpeTrainer(vocab_size=vocab_size, special_tokens=["[GO]", "[UNK]", "[PAD]", "[EOS]"], continuing_subword_prefix="##")
    tokenizer.train_from_iterator(get_batch(), trainer=trainer, length=len(samples['text']))
    enc = tokenizer.encode(text)
    return len(enc.tokens)

train(10_000), train(15_000), train(32_000)












(28, 28, 25)

In [19]:
# Download the pre-trained tokenizer file “hopper.json” used in the lecture, from here .
# The tokenizer was trained on all 70 million samples in the BookCorpus dataset. Tokenize the same input text using this “hopper” tokenizer. How many tokens are there?
# [After finding the answer, take a moment to compare the hopper tokenizer with the previous one]

pt_tokenizer = Tokenizer(BPE())
pt_tokenizer = pt_tokenizer.from_file('hopper.json')
enc = pt_tokenizer.encode(text)
len(enc.tokens)

25

In [21]:
# Suppose we know that the acronym “FY” will likely appear very frequently in most of the input text (assume the text comes from the financial domain).
# Therefore, we hope that adding it manually to the vocabulary might help. Add the token “FY” to the vocabulary and tokenize the input text. Enter the number of tokens produced.
v = pt_tokenizer.get_vocab()

In [25]:
pt_tokenizer.add_tokens(['FY'])

1

In [26]:
enc = pt_tokenizer.encode(text)
len(enc.tokens)

22

In [27]:
# Load the “bert-base-uncased” and "gpt2” tokenizers (use AutoTokenizer function from transformers). Which of the following special tokens are used in these tokenizers?
from transformers import AutoTokenizer

In [28]:
bbu = AutoTokenizer.from_pretrained('bert-base-uncased')
gpt2 = AutoTokenizer.from_pretrained('gpt2')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [29]:
bbu.special_tokens_map

{'unk_token': '[UNK]',
 'sep_token': '[SEP]',
 'pad_token': '[PAD]',
 'cls_token': '[CLS]',
 'mask_token': '[MASK]'}

In [30]:
gpt2.special_tokens_map

{'bos_token': '<|endoftext|>',
 'eos_token': '<|endoftext|>',
 'unk_token': '<|endoftext|>'}

In [70]:
# By now, we have four tokenizers.

# 1. Custom tokenizer (vocab size 32K, trained on 10 million samples)
# 2. bert-base-uncased
# 3. gpt2
# 4. hopper

# Use these four tokenizers to count the number of tokens for the entire “imdb” dataset (drop the “unsupervised” part of the dataset).
# Enter the tokenizers in order such that the size of the dataset (measured in tokens) as returned by the tokenizers is in decreasing order.
# For example, if the first tokenizer yields the smallest number of tokens and the fourth tokenizer yields the largest, you would enter 1234 (without any spaces).”
def train(vocab_size):
    model = BPE(unk_token='[UNK]')
    tokenizer = Tokenizer(model)
    tokenizer.normalizer = Lowercase()
    tokenizer.pre_tokenizer = Whitespace()
    trainer = BpeTrainer(vocab_size=vocab_size, special_tokens=["[GO]", "[UNK]", "[PAD]", "[EOS]"], continuing_subword_prefix="##")
    tokenizer.train_from_iterator(get_batch(), trainer=trainer, length=len(samples['text']))
    return tokenizer

tokenizer = train(32_000)
toks = [tokenizer, bbu, gpt2, pt_tokenizer]






In [54]:
from tqdm import tqdm

In [71]:
train_ds, test_ds = load_dataset('imdb', split=['train', 'test'])
n_tokens = []
for i, tok in tqdm(enumerate(toks, start=1)):
    c_tokens = 0
    for sample in train_ds['text']:
        c_tokens += len(tok.encode(sample))
    for sample in test_ds['text']:
        c_tokens += len(tok.encode(sample))
    n_tokens.append((i, c_tokens))

4it [01:26, 21.54s/it]


In [72]:
sorted(n_tokens, key=lambda x: x[1])

[(4, 13530397), (3, 14812432), (1, 15352840), (2, 15516058)]

In [83]:
# The statement that the special tokens and their respective token ids are model-specific (model here refers to a language model) is
sorted(tokenizer.get_vocab().items(), key=lambda x: x[1])[:10]

[('[GO]', 0),
 ('[UNK]', 1),
 ('[PAD]', 2),
 ('[EOS]', 3),
 ('\x14', 4),
 ('\x18', 5),
 ('\x19', 6),
 ('\x1c', 7),
 ('\x1d', 8),
 ('\x1f', 9)]

In [80]:
sorted(bbu.get_vocab().items(), key=lambda x: x[1])[:10]

[('[PAD]', 0),
 ('[unused0]', 1),
 ('[unused1]', 2),
 ('[unused2]', 3),
 ('[unused3]', 4),
 ('[unused4]', 5),
 ('[unused5]', 6),
 ('[unused6]', 7),
 ('[unused7]', 8),
 ('[unused8]', 9)]

In [90]:
sorted(gpt2.get_vocab().items(), key=lambda x: x[1])[-20:]

[('Revolution', 50237),
 ('Ġsnipers', 50238),
 ('Ġreverted', 50239),
 ('Ġconglomerate', 50240),
 ('Terry', 50241),
 ('794', 50242),
 ('Ġharsher', 50243),
 ('Ġdesolate', 50244),
 ('ĠHitman', 50245),
 ('Commission', 50246),
 ('Ġ(/', 50247),
 ('âĢ¦."', 50248),
 ('Compar', 50249),
 ('Ġamplification', 50250),
 ('ominated', 50251),
 ('Ġregress', 50252),
 ('ĠCollider', 50253),
 ('Ġinformants', 50254),
 ('Ġgazed', 50255),
 ('<|endoftext|>', 50256)]

In [82]:
sorted(pt_tokenizer.get_vocab().items(), key=lambda x: x[1])[:10]

[('[PAD]', 0),
 ('[UNK]', 1),
 ('\x13', 2),
 ('\x14', 3),
 ('\x18', 4),
 ('\x19', 5),
 ('\x1c', 6),
 ('\x1d', 7),
 ('\x1f', 8),
 ('!', 9)]

In [86]:
#  Suppose that the context length of the model is 128. Assume that a mini-batch of size 8 samples is passed to a tokenizer that corresponds to a model from hub.
# After tokenization, the maximum length of sample in the batch is 64. The statement that zero is appended to the “input ids” of the remaining samples to make the length 64 is

from transformers import T5TokenizerFast
t5 = T5TokenizerFast.from_pretrained("google-t5/t5-small")

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [88]:
sorted(pt_tokenizer.get_vocab().items(), key=lambda x: x[1])[:10]

[('[PAD]', 0),
 ('[UNK]', 1),
 ('\x13', 2),
 ('\x14', 3),
 ('\x18', 4),
 ('\x19', 5),
 ('\x1c', 6),
 ('\x1d', 7),
 ('\x1f', 8),
 ('!', 9)]

In [91]:
xlnet = AutoTokenizer.from_pretrained("xlnet/xlnet-base-cased")
sorted(xlnet.get_vocab().items(), key=lambda x: x[1])[:10]

config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]



[('<unk>', 0),
 ('<s>', 1),
 ('</s>', 2),
 ('<cls>', 3),
 ('<sep>', 4),
 ('<pad>', 5),
 ('<mask>', 6),
 ('<eod>', 7),
 ('<eop>', 8),
 ('.', 9)]

In [None]:
# Multiple models have different IDs for the pad token, not necessarily zero.