A note for how to use tokenizers through an example of how BERT solves SQuAD

Reference: https://github.com/adensur/blog/blob/main/nlp/02_reading_bert_source_code/bert_sandbox.ipynb

In [2]:
'''
Using the huggingface packages like transformers and datasets
'''

from datasets import load_dataset
import torch
from transformers import AutoModelForQuestionAnswering
import numpy as np
import evaluate
from tqdm.auto import tqdm

raw_datasets = load_dataset("squad")    # dataset in the huggingface hub


In [3]:
'''
The structure of the squad dataset
'''
print(raw_datasets)

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


In [4]:
'''
for the answers: the answer text is from the context, 
so there will be a start index of the word in the sentence
'''
example = raw_datasets["train"][0]
example

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

In [5]:
example["context"][515:]

'Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

In [None]:
'''
Using the BERT model as tokenizer
by giving the model checkpoint, it will find the pretrained tokenizer

Tokenizers tranform natural language to tokens, which are some integers
'''
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [7]:
test = "In a shocking finding, scientists a herd of unicorns. Some spaces:     . Some Chinese: 边界. "
token_ids = tokenizer.encode(test)
token_ids

[101,
 1130,
 170,
 19196,
 4006,
 117,
 6479,
 170,
 17804,
 1104,
 8362,
 23941,
 1116,
 119,
 1789,
 6966,
 131,
 119,
 1789,
 1922,
 131,
 100,
 100,
 119,
 102]

In [None]:
'''
1. there is automatically added special tokens, like [CLS] and [SEP]
2. punctuation is includedddd
3. the BERT tokenizer is case-sensitive, so "Some" and "some" are different tokens
4. all the spaces, no matter how many, are reduced
5. There are some [UNK] tokens, which are unknown tokens that the tokenizer does not recognize 
'''

for token in token_ids:
    print(tokenizer.decode(token))
tokenizer.decode(token_ids)

[CLS]
In
a
shocking
finding
,
scientists
a
herd
of
un
##icorn
##s
.
Some
spaces
:
.
Some
Chinese
:
[UNK]
[UNK]
.
[SEP]


'[CLS] In a shocking finding, scientists a herd of unicorns. Some spaces :. Some Chinese : [UNK] [UNK]. [SEP]'

In [9]:
'''
The tokenizer only understand ~29k words, others will be [UNK]
'''
print(tokenizer.vocab_size)

28996


In [10]:
'''
How to use the tokenizer in real process
1. Can input multiple sentences, will automatic concat with [SEP]
2. Truncation:
    max_length=40 – The total length of the tokenized output will be at most 40 tokens (including special tokens).
    truncation="only_second" – If the combined input is too long, only the second sequence (the context: test2) will be truncated.
    stride=5 – When truncation happens and the context is split into multiple chunks, each chunk will overlap the previous by 5 tokens. This helps preserve context across splits.
    return_overflowing_tokens=True – If the context is too long and gets split into multiple chunks, this returns all the resulting tokenized chunks, not just the first one.
    padding="max_length" – Pads the output to exactly max_length tokens.padding="max_length" – Pads the output to exactly max_length tokens.
''' 

test2 = "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English."
tokens = tokenizer(
    "Here is a question. ",
    test2,
    max_length=40,
    truncation="only_second",
    stride=5,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    padding="max_length",
)


In [11]:
'''
Explanation of the return value `tokens`: 
-- input_ids:
The list of token IDs (integers) representing the tokenized input text, including special tokens like [CLS] and [SEP].
-- token_type_ids:
Indicates which segment each token belongs to (e.g., 0 for the first sentence/question, 1 for the second/context). Used in tasks like question answering.
-- attention_mask:
A mask with 1s for real tokens and 0s for padding tokens. This tells the model which tokens should be attended to.
-- offset_mapping:
For each token, gives a tuple (start, end) indicating the character positions in the original input string. Useful for mapping tokens back to the original text.
eg: (0, 0), (0, 4): the first token is [CLS], correspond to 0~0th char, second token is "Here", correspond to 0~4th char.
-- overflow_to_sample_mapping:
When the input is too long and split into multiple chunks, this maps each chunk back to the original sample index it came from.
(?) eg the [0, 0] here means both 1st and 2nd chunk are from 0th sentence (the question).
'''

print('tokens: ', tokens.keys())
# tokens:  dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])
print('tokens["input_ids"]: ', tokens["input_ids"])     # a list of list, each element is the id
print('tokens["token_type_ids"]:', tokens["token_type_ids"])
print('tokens["attention_mask"]:', tokens["attention_mask"])
print('tokens["offset_mapping"]:', tokens["offset_mapping"])
print('tokens["overflow_to_sample_mapping"]:', tokens["overflow_to_sample_mapping"])
# len(tokens["input_ids"]), token_ids

tokens:  dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])
tokens["input_ids"]:  [[101, 3446, 1110, 170, 2304, 119, 102, 1130, 170, 19196, 4006, 117, 7482, 2751, 170, 17804, 1104, 8362, 23941, 1116, 1690, 1107, 170, 6456, 117, 2331, 25731, 1775, 1643, 21425, 1181, 4524, 117, 1107, 1103, 19505, 5249, 119, 2431, 102], [101, 3446, 1110, 170, 2304, 119, 102, 1103, 19505, 5249, 119, 2431, 1167, 11567, 1106, 1103, 6962, 1108, 1103, 1864, 1115, 1103, 8362, 23941, 1116, 2910, 3264, 1483, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
tokens["token_type_ids"]: [[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
tokens["attention_mask"]: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [12]:
'''
When separated, the first sentence (the question) is always there, 
the second sentence (the context) is truncated, with a window of 5 tokens overlapping. 
'''
print(tokenizer.decode(tokens["input_ids"][0]))
print(tokenizer.decode(tokens["input_ids"][1]))

[CLS] Here is a question. [SEP] In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even [SEP]
[CLS] Here is a question. [SEP] the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]


In [13]:
inputs = tokenizer(
    raw_datasets["train"][2:6]["question"],
    raw_datasets["train"][2:6]["context"],
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
)

print(f"The 4 examples gave {len(inputs['input_ids'])} features.")
print(f"Here is where each comes from: {inputs['overflow_to_sample_mapping']}.")

The 4 examples gave 19 features.
Here is where each comes from: [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3].


In [14]:
max_length = 384
stride = 128


def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [15]:
train_dataset = raw_datasets["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)
len(raw_datasets["train"]), len(train_dataset)

(87599, 88729)

In [16]:
train_dataset[2:4]

{'input_ids': [[101,
   1109,
   19349,
   1104,
   1103,
   11373,
   1762,
   1120,
   10360,
   8022,
   1110,
   3148,
   1106,
   1134,
   2401,
   136,
   102,
   22182,
   1193,
   117,
   1103,
   1278,
   1144,
   170,
   2336,
   1959,
   119,
   1335,
   4184,
   1103,
   4304,
   4334,
   112,
   188,
   2284,
   10945,
   1110,
   170,
   5404,
   5921,
   1104,
   1103,
   6567,
   2090,
   119,
   13301,
   1107,
   1524,
   1104,
   1103,
   4304,
   4334,
   1105,
   4749,
   1122,
   117,
   1110,
   170,
   7335,
   5921,
   1104,
   4028,
   1114,
   1739,
   1146,
   14089,
   5591,
   1114,
   1103,
   7051,
   107,
   159,
   21462,
   1566,
   24930,
   2508,
   152,
   1306,
   3965,
   107,
   119,
   5893,
   1106,
   1103,
   4304,
   4334,
   1110,
   1103,
   19349,
   1104,
   1103,
   11373,
   4641,
   119,
   13301,
   1481,
   1103,
   171,
   17506,
   9538,
   1110,
   1103,
   144,
   10595,
   2430,
   117,
   170,
   14789,
   1282,
   1104,
   8

In [17]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [18]:
validation_dataset = raw_datasets["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_datasets["validation"].column_names,
)
len(raw_datasets["validation"]), len(validation_dataset)


Map: 100%|██████████| 10570/10570 [00:03<00:00, 3025.62 examples/s]


(10570, 10822)

In [19]:
?tokenizer.get_special_tokens_mask

[31mSignature:[39m
tokenizer.get_special_tokens_mask(
    token_ids_0: List[int],
    token_ids_1: Optional[List[int]] = [38;5;28;01mNone[39;00m,
    already_has_special_tokens: bool = [38;5;28;01mFalse[39;00m,
) -> List[int]
[31mDocstring:[39m
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer `prepare_for_model` or `encode_plus` methods.

Args:
    token_ids_0 (`List[int]`):
        List of ids of the first sequence.
    token_ids_1 (`List[int]`, *optional*):
        List of ids of the second sequence.
    already_has_special_tokens (`bool`, *optional*, defaults to `False`):
        Whether or not the token list is already formatted with special tokens for the model.

Returns:
    A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
[31mFile:[39m      ~/anaconda3/envs/technotes/lib/python3.12/site-packages/transformers/tokenization_utils_base.py
[3

In [20]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
import accelerate  
accelerate.__version__

'1.7.0'

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    "bert-finetuned-squad",
    eval_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,
    push_to_hub=False,
)

from transformers import Trainer

# Disable wandb logging

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
)
trainer.train()

  trainer = Trainer(


Step,Training Loss
500,2.6393
1000,1.7364
1500,1.5197
2000,1.4077
2500,1.365
3000,1.3038
3500,1.2219
4000,1.2048
