Download the BookCorpus dataset. Take every 7-th sample (the indices are multiple of 7:[0,7,14,21,...]) from the entire dataset. This will result in a dataset with 10 million samples (exactly, 10,572,033). Use these samples to build a tokenizer with the BPE tokenization algorithm by varying the vocabulary size.

Normalizer: LowerCase

PreTokenizer: WhiteSpace

Model: BPE

Special tokens: [GO],[UNK],[PAD],[EOS]

PostProcessing: None

Tokenize the input text: “SEBI study finds 93% of individual F&O traders made losses between FY22 and FY24.” using the following configurations.



In [None]:
ga_text="SEBI study finds 93% of individual F&O traders made losses between FY22 and FY24."

In [None]:
## Dataset
from pprint import pprint
from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.normalizers import Lowercase
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import  BpeTrainer
from transformers import PreTrainedTokenizerFast
from copy import deepcopy

In [None]:
# downloading the bookcorpus dataset

ds = load_dataset("bookcorpus", split="all")



In [None]:
# select every 7th sample, (exactly, 10,572,033)
ids = range(0, len(ds), 7)
ds_new = ds.select(ids)

In [None]:
len(ds_new)

10572033

In [None]:
# build the BPE tokenizer
model = BPE(unk_token="[UNK]")
tokenizer = Tokenizer(model)
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = Whitespace()

In [None]:
def get_batch(batch_size=1000):
    for i in range(0, len(ds_new), batch_size):
        yield ds_new[i: i+batch_size]["text"]

In [None]:
from multiprocessing import cpu_count
print(cpu_count())

8


# Q1

Keep the vocabulary size at 5000 and tokenize the input text using the learned vocabulary. Choose the number of tokens returned by the tokenizer.

In [None]:
trainer = BpeTrainer(vocab_size=5000,
                     special_tokens=["[UNK]","[GO]","[PAD]","[EOS]"],
                     continuing_subword_prefix="##"
                    )

In [None]:
tokenizer1 = deepcopy(tokenizer)
tokenizer1.train_from_iterator(get_batch(batch_size=10000),
                              trainer=trainer,
                              length=len(ds_new)
                            )






In [None]:
encoded = tokenizer1.encode(ga_text).tokens
len(encoded)

32

# Q2

Increase the vocabulary size to 10K, 15K and 32K. For each case, tokenize the same input with the newly learned vocabulary. Choose all the correct statements

In [None]:
# vocab_size 10K

tokenizer2 = deepcopy(tokenizer)

trainer = BpeTrainer(vocab_size=10000,
                     special_tokens=["[UNK]","[GO]","[PAD]","[EOS]"],
                     continuing_subword_prefix="##"
                    )

tokenizer2.train_from_iterator(get_batch(batch_size=10000),
                              trainer=trainer,
                              length=len(ds_new))

encoded = tokenizer2.encode(ga_text).tokens
len(encoded)






28

In [None]:
tokenizer2.get_vocab_size()

10000

In [None]:
# vocab_size 15K

tokenizer3 = deepcopy(tokenizer)


trainer = BpeTrainer(vocab_size=15000,
                     special_tokens=["[UNK]","[GO]","[PAD]","[EOS]"],
                     continuing_subword_prefix="##"
                    )

tokenizer3.train_from_iterator(get_batch(batch_size=10000),
                              trainer=trainer,
                              length=len(ds_new))

encoded = tokenizer3.encode(ga_text).tokens
len(encoded)






28

In [None]:
tokenizer3.get_vocab_size()

15000

In [None]:
# vocab_size 32K

tokenizer4 = deepcopy(tokenizer)


trainer = BpeTrainer(vocab_size=32000,
                     special_tokens=["[UNK]","[GO]","[PAD]","[EOS]"],
                     continuing_subword_prefix="##"
                    )

tokenizer4.train_from_iterator(get_batch(batch_size=10000),
                              trainer=trainer,
                              length=len(ds_new))

encoded = tokenizer4.encode(ga_text).tokens
len(encoded)






25

In [None]:
tokenizer4.get_vocab_size()

32000

# Q3


Download the pre-trained tokenizer file “hopper.json” used in the lecture, from [here](https://drive.google.com/file/d/1QNnyh8iMN-IqW_h1w8gAMtw09Em7-e1e/view) . The tokenizer was trained on all 70 million samples in the BookCorpus dataset. Tokenize the same input text using this “hopper” tokenizer. How many tokens are there?

[After finding the answer, take a moment to compare the hopper tokenizer with the previous one]

In [None]:
pt_tokenizer = PreTrainedTokenizerFast(tokenizer_file="hopper.json",
                                       unk_token="[UNK]",
                                       pad_token="[PAD]",
                                       model_input_names=["input_ids", "token_type_ids", "attention_mask"]
                                    )

In [None]:
tokens = pt_tokenizer.encode(ga_text)
print(len(tokens))

25


# Q4

Suppose we know that the acronym “FY” will likely appear very frequently in most of the input text (assume the text comes from the financial domain). Therefore, we hope that adding it manually to the vocabulary might help. Add the token “FY” to the vocabulary and tokenize the input text. Enter the number of tokens produced.

[Question to ponder: Does reducing the number of tokens helpful?]

In [None]:
pt_tokenizer.add_tokens(new_tokens=["FY"])

1

In [None]:
tokens = pt_tokenizer.encode(ga_text)
print(len(tokens))

22


# Q5

Load the “bert-base-uncased” and "gpt2” tokenizers (use AutoTokenizer function from transformers). Which of the following special tokens are used in these tokenizers?



In [None]:
from transformers import AutoTokenizer

In [None]:
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [None]:
gpt2_tokenizer.special_tokens_map

{'bos_token': '<|endoftext|>',
 'eos_token': '<|endoftext|>',
 'unk_token': '<|endoftext|>'}

In [None]:
bert_tokenizer.special_tokens_map

{'unk_token': '[UNK]',
 'sep_token': '[SEP]',
 'pad_token': '[PAD]',
 'cls_token': '[CLS]',
 'mask_token': '[MASK]'}

# Q6

By now, we have four tokenizers.

1. Custom tokenizer (vocab size 32K, trained on 10 million samples)
2. bert-base-uncased
3. gpt2
4. hopper

Use these four tokenizers to count the number of tokens for the entire “imdb” dataset (drop the “unsupervised” part of the dataset). Enter the tokenizers in order such that the size of the dataset (measured in tokens) as returned by the tokenizers is in decreasing order. For example, if the first tokenizer yields the smallest number of tokens and the fourth tokenizer yields the largest, you would enter 1234 (without any spaces).”


In [None]:
# imds dataset

imdb_ds = load_dataset("stanfordnlp/imdb", split="train+test")
imdb_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 50000
})

In [None]:
def count_tokens(tokenizer, text):
    num_tokens = len(tokenizer.encode(text))
    return num_tokens

In [None]:
# 32K trained on 10 million

token_count = imdb_ds.map(lambda x: {"token_count": count_tokens(tokenizer4, x["text"])})

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
sum(token_count["token_count"])

15352840

In [None]:
# bert-base-cased tokens

tokens_bert = []

token_count_bert = imdb_ds.map(lambda x:
                          {"token_count": count_tokens(bert_tokenizer, x["text"])}
                        )

In [None]:
sum(token_count_bert["token_count"])

15959815

In [None]:
bert_tokenizer?

In [None]:
# gpt2 tokens


token_count_gpt2 = imdb_ds.map(lambda x:
                          {"token_count": count_tokens(gpt2_tokenizer, x["text"])}
                         )

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1168 > 1024). Running this sequence through the model will result in indexing errors


In [None]:
sum(token_count_gpt2["token_count"])

14812432

In [None]:
x = {"tokenizer4": 15352840,
"bert": 15959815,
"gpt2": 14812432,
"hopper": 15347982}

sorted_x = dict(sorted(x.items(), key=lambda item: item[1]))
sorted_x

{'gpt2': 14812432,
 'hopper': 15347982,
 'tokenizer4': 15352840,
 'bert': 15959815}

3 gpt2
4 hopper
1 Custom tokenizer (vocab size 32K, trained on 10 million samples)
2 bert-base-uncased

In [None]:
# hopper

token_count_hopper = imdb_ds.map(lambda x:
                          {"token_count": count_tokens(pt_tokenizer, x["text"])}
                         )

In [None]:
sum(token_count_hopper["token_count"])

15347982

# Q7

The statement that the special tokens and their respective token ids are model-specific (model here refers to a language model) is



In [None]:
# YES

# Q8

Suppose that the context length of the model is 128. Assume that a mini-batch of size 8 samples is passed to a tokenizer that corresponds to a model from hub. After tokenization, the maximum length of sample in the batch is 64. The statement that zero is appended to the “input ids” of the remaining samples to make the length 64 is

