# Introduction to training a NLP model with HuggingFace #

Welcome friends. This notebook discusses the fundamentals of model training - how to prepare your data for training, the important hyperparemters that drive model performance, and why we go through all this trouble in the first place. We'll explore all this through the lens of <a href="https://huggingface.co/docs/transformers/tasks/question_answering#question-answering">HuggingFace's Question answering task guide </a>. Nearly all the code in this demonstration is taken from the tutorial - this notebook adds color to the NLP terminology used and explains the programming logic in more detail.   

In [1]:
from datasets import load_dataset

squad = load_dataset("squad", split="train[:5000]")


  from .autonotebook import tqdm as notebook_tqdm
Downloading builder script: 5.27kB [00:00, 1.51MB/s]
Downloading metadata: 2.36kB [00:00, 1.40MB/s]
Downloading readme: 7.67kB [00:00, 1.07MB/s]
Found cached dataset squad (/Users/natepruitt/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


We begin by loading the 'squad' dataset. HuggingFace offers a <a href="https://huggingface.co/docs/datasets/index">wide-range of datasets</a> that can be loaded by simply passing their name as the argument to 'load_dataset'. Loading a dataset returns a dict-like 'Dataset' object with functions to help manipulate the data. 



When we build our model, we want to have a dataset that has inputs with labeled output, and a dataset with inputs but no <em>no</em> ouput. In the biz, these two datasets are referred to 'train' and 'test' respectively. The model uses the training set to build its prediction parameters and the test set to, you guessed it, test its prediction ability. 

For example, we will be building a question and answering model. A fully trained question and answering model receives a question and a context - the model then returns the answer found in the context. This is an important point - we are <em>not</em> building a text generation model like ChatGPT, where we can pass it only a question and expect an answer. Our model must have a context to extract an answer from. Therefore, our training dataset will have a question, context, and answer that the model uses - including the answer gives the model a target output as it calculates its prediction parameters. The test data set will include the question and context, but NO answer - we are taking off the training wheels (pun intended) and letting the model predict the output without our help. 




In [2]:
squad = squad.train_test_split(test_size=0.2)

Split the dataset into "training" and "testing" with the 'train_test_split' method on the <a href="https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.train_test_split">HuggingFace Dataset object.</a> 

The argument passed to the 'test_size' parameter determines what proportion of the original data should be dedicated to testing. Initial dataset size was 5000 questions, so the test set size is 1000 questions. Our model will be trained on the remaining 4000 questions

In [3]:
print(len(squad["train"]["context"]))
print(len(squad["test"]["context"]))

4000
1000





## Converting from human-readable to machine-readable ##
Datasets for training natural language models are made up of human-readable raw text - as an example, here's the first question, context, and answer from our training set: 

In [4]:
# 'question' is a list of questions. 
print(f"Question:'{squad['train']['question'][0]}'")
print(f"Context:'{squad['train']['context'][0]}'")
print(f"Answer: '{squad['train']['answers'][0]['text'][0]}'")

Question:'How much did Yao Ming donate?'
Context:'By May 14, the Ministry of Civil Affairs stated that 10.7 billion yuan (approximately US$1.5 billion) had been donated by the Chinese public. Houston Rockets center Yao Ming, one of the country's most popular sports icons, gave $214,000 and $71,000 to the Red Cross Society of China. The association has also collected a total of $26 million in donations so far. Other multinational firms located in China have also announced large amounts of donations.'
Answer: '$214,000 and $71,000'


Transformer models cannot process raw text. A transformer model (or any natural language model, really) needs inputs to be numerical - the underpinnings of these AI models is math -  extremely complicated math, but math none the less. How do we transform the above sentence into numbers?

### The Tokenzier ###

Tokenizing (tokenizing refers to the complete process of segmenting a sentence into words / phrases and assigning a numerical id to each word / phrase) the question and context returns a dictionary like object with keys input_ids, attention_mask, and offset_mapping. 

A tokenizer is an object that accepts raw text sequences (sentences) as input and outputs a machine-readable version of that sequence. The machince-readable version will be a sequence of 'ids' that won't mean a lick to you or I but reads like Shakespearean poetry to our model. 

Initialize a tokenizer using the <a href="https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.train_test_split">HuggingFace 'AutoTokenizer' </a> class. Pass the name of the model as an argument to the 'from_pretrained' method - it should be the same name as the model you'll eventually train your dataset on.

In [5]:
# Import and initialize tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Tokenize the first question
tokenized_question = tokenizer.tokenize(squad['train']['question'][0])
print(f"Tokenized question: {tokenized_question}")

Tokenized question: ['how', 'much', 'did', 'yao', 'ming', 'donate', '?']


The 'tokenize' method returns an array of <strong>character tokens</strong>, but the model won't understand them in this format either. To transform these tokens into what are called "ids", call the function <a href="https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.convert_tokens_to_ids">'convert_tokens_to_ids'</a>.

In [6]:
# transform character tokens into ids
input_ids = tokenizer.convert_tokens_to_ids(tokenized_question)
print(f"Input ids: {input_ids}")

Input ids: [2129, 2172, 2106, 23711, 11861, 21357, 1029]


To reduce the process into one function call, run <a href="https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.encode"> the encode method</a>.

In [7]:
encoded_question = tokenizer.encode(squad['train']['question'][0])
print(encoded_question)

[101, 2129, 2172, 2106, 23711, 11861, 21357, 1029, 102]


Astute readers will notice the output sequence of 'encode'  call is different than the output of 'convert_tokens_to_ids(tokenize(text))'. Specifically, the sequence output by 'encode' has one additional token at the beginning and end of the sequence. "Decoding" (converting back to a string from a sequence of ids) reveals the difference.

In [8]:
decoded_output_convert_tokens = tokenizer.decode(input_ids)
decoded_output_encoded_tokens = tokenizer.decode(encoded_question)

print(f"Decoded from input ids: {decoded_output_convert_tokens}")
print(f"Decoded from character tokens: {decoded_output_encoded_tokens}")

Decoded from input ids: how much did yao ming donate?
Decoded from character tokens: [CLS] how much did yao ming donate? [SEP]


Aha! They mystery tokens are 'CLS' and 'SEP'; they are reffered to as 'special tokens. Common in BERT derived models, 'CLS' stands for 'classification' and is placed at the beginning of input sequences. It signals to the model that the sequence is represented as a single vector (as opposed to matrix or other data structure). This helps the model make predictions, as it knows to base its prediction on the entire 'classification' sequence. 

'SEP' stands for seperator and exists to distinguish text within a sequence. While not relevant for the above one sentence example, it becomes crucial later on when tokenizing the questions and context together. Remember, the model needs both a 'question' and 'context' in order to make a prediction. Ultimately, each input our model receives as training data will be a <em>combined</em> vector of the question and corresponding context. More on that later. 

### Tokenizing questions and contexts ###

Let's begin preprocessing the data by tokenizing all the questions and context. The <a href="https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__">'tokenizer'</a> method accepts a multitude of parameters to customize its outputs. 

The first two parameters are <strong>text</strong> and <strong>text pair</strong>. Each is a list of strings the tokenizer will transform into a sequence.

<strong>max_length</strong> sets the maximum length for each output sequence. Any token sequence longer than this will be truncated.

<strong>return_offsets_mapping</strong> set to True instructs the tokenizer to include a datatable of token offset mappings in the output. Offset mapping will be explained in more detail later. 

<strong>padding</strong> informs the model how much to pad each sequence of input ids. A model expects each sequence to be the same length. Of course, it is unrealistic for each question and context string to be the exact same number of words. To reconcile this, a tokenizer will add ("pad") each sentence with a however many white space tokens are necessary to ensure consistent length.

The code below generates the 'inputs' variable; a dict-like object that stores the transformed raw text.

In [9]:
questions = [q.strip() for q in squad["train"]["question"]]
context = [c.strip() for c in squad['train']['context']]
inputs = tokenizer(
        questions,
        context,
        max_length=384,
        return_offsets_mapping=True,
        padding="max_length",
    )

In [10]:
# Print the keys in the input dictionary-like 'BatchEncoding' class.
len(inputs['input_ids'][0])

384

In [11]:
print(f"Keys of dict-like object 'input': {inputs.keys()}")
print('\n')
print(f"Number of rows in input_ids table: {len(inputs['input_ids'])}")
print('\n')
print(f"Number of columns in input_ids table: {len(inputs['input_ids'][0])}")

Keys of dict-like object 'input': dict_keys(['input_ids', 'attention_mask', 'offset_mapping'])


Number of rows in input_ids table: 4000


Number of columns in input_ids table: 384


#### The 'inputs' variable ####

The tokenizer objective is to return a table of 'input_ids', which for most transformer models is the only <a href="https://huggingface.co/transformers/v3.1.0/glossary.html#input-ids"> required parameter </a>

In the cell above are important characterstics of the 'inputs' variable. There are three keys in this dict-like object, and each key retrieves an array of arrays, which is best thought of as a datatable. Looking specifically at the datatable at key 'input_ids', each row represents a sequence of character tokens - each column being character token at that position in the sequence. For example, the value at row 1, column 2 would be the input id (numerical representation) of the character token from sequence 2, position 3 of our sample dataset (rows and columns are zero indexed). The datatable has 4000 rows (one for each example in our training data). Each row is a pairing of the question and associated context. Since training a question answering model requires both a quesiton and context, the tokenizer combines them into one sequence. In the tokenizer class call, the 'question' and 'context' are passed in seperately - the function handles joining the sequence. The input ids table has 384 columns - 384 because that is the value passed to the <strong>max_length</strong> argument in the tokenizer class call. For each question/context combined sequence that does not reach 384 character tokens, the tokenizer appends white space to the end of the sequence. 

As an example, below is the first row of the input ids table, along with its 'decoded', human-readable form.

In [12]:
print(f"First row in input ids table: {inputs['input_ids'][0]}")
print('\n')
print(f"Firt row in input ids table (decoded): {tokenizer.decode(inputs['input_ids'][0])}")

First row in input ids table: [101, 2129, 2172, 2106, 23711, 11861, 21357, 1029, 102, 2011, 2089, 2403, 1010, 1996, 3757, 1997, 2942, 3821, 3090, 2008, 2184, 1012, 1021, 4551, 11237, 1006, 3155, 2149, 1002, 1015, 1012, 1019, 4551, 1007, 2018, 2042, 6955, 2011, 1996, 2822, 2270, 1012, 5395, 12496, 2415, 23711, 11861, 1010, 2028, 1997, 1996, 2406, 1005, 1055, 2087, 2759, 2998, 18407, 1010, 2435, 1002, 19936, 1010, 2199, 1998, 1002, 6390, 1010, 2199, 2000, 1996, 2417, 2892, 2554, 1997, 2859, 1012, 1996, 2523, 2038, 2036, 5067, 1037, 2561, 1997, 1002, 2656, 2454, 1999, 11440, 2061, 2521, 1012, 2060, 20584, 9786, 2284, 1999, 2859, 2031, 2036, 2623, 2312, 8310, 1997, 11440, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

The decoded sequence includes [SEP] and [PAD] tokens - we discussed [SEP] tokens earlier but this example makes their importance more apparent - notice how they mark the split between question and context, and designate the end of the context, and where the [PAD] tokens begin. The [PAD] tokens are tokens added by the tokenizer to reach the 384 length requirement - these [PAD] tokens are denominated as 0 in the input id table. 

The tokenization process is more complicated than our example suggests. For example, how do we toeknize sentences with special characters and symbols? What about capitalization? For information on the inner workings of tokenization, HuggingFace https://www.youtube.com/watch?v=Yffk5aydLzg has a introduction video that makes a nice jumping off point for a trip down the rabbit hole. For the purposes of this demonstration, it is enough to understand that at a high level a tokenizer converts text sequences (sentences) into segmented text sequences and finally into a sequence of numerical ids. 

#### Attention! Attention! Read all about masks and offset mappings

The 'attention_mask' key retrieves a datatable of boolean values that marks tokens important for prediction. The table is 4000 rows by 384 columns - the same dimmensions of the input ids table. Each cell in the attention mask table refers to the equivalent cell in the input ids table. If you want to know if the token at row i, column j is important for prediction, you check the attention mask table at row i, column j. Generally, the tokens marked unimportant for prediction are the PAD tokens.

In [13]:
## Example of an important token
print(f"Input id at row 0, column 5: {inputs['input_ids'][0][5]}")
print(f"Attention mask at row 0, column 5: {inputs['attention_mask'][0][5]}")
# The '1' we print out translates to 'Yes, this id IS important for making predictions'

## Example of unimportant token
print(f"Input id at row 0, column 0: {inputs['input_ids'][0][380]}")
print(f"Attention mask at row 0, column 380: {inputs['attention_mask'][0][380]}")

# The '0' we print out from the attention mask table translates to 'No, this id is NOT relevant for making predictions'


Input id at row 0, column 5: 11861
Attention mask at row 0, column 5: 1
Input id at row 0, column 0: 0
Attention mask at row 0, column 380: 0


Offset mapping is a matrix of tuples. Each tuple in the matrix maps back to a character token. The first value in the tuple is the starting position for that character token; the second value is the closing position (non-inclusive - if the character token ends at position 4, then the second value of the tuple would be 5). In this context "position" refers to the character numerical order in the original sentence. The character "r" in the sentence "Here comes Sally" would have a position of 2, since the first position is set as 0 (the "H" character). If the above sentence was split into ["Here", "comes", "Sally"], then the accompanying offset mapping would be something like [(0,5), (6, 11), (12, 17)] - the segment "Here" begins at position 0 and ends at position 5 (remember, <italic> non-inclusive </italic>), "comes" begins at 6 and ends at 11, and "Sally" begins at 12 and ends at 17. Offset mapping is returned by the tokenzier when the 'return_offsets_mapping' argument is set to true - so why does our particular use case require it? We'll explain it through the context of our next data pre-processing step; extracting the answer tokens.

Each cell in the input_ids and offset_mapping table / matrix is mapped to a character token. The id at input_ids[i][j] and the offset mapping at offset_mapping[i][j] correspond to the <em>same</em> character token. It is important to understand that input_ids, attention_mask, and offset mapping all have the same dimmensions (4000 x 384) because each cell is referring to the same underlying character token.

#### Final Preparations - Calculating the Answer tokens
To help train the model, we'll pass in each answer corresponding to a question-context pair. The training dataset provides the text and 'starting position' of the answers - the character position within the context where the answer word / phrase begins.

Text-based answers are nice for us humans, but the machine demands numbers. Instead of going through a long process of tokenzing the answers, lets extract them from the tokenized context sequence. We'll use a function from <a href="https://huggingface.co/docs/transformers/tasks/question_answering#preprocess">HuggingFace </a> to handle the logic.

We are given the location of the answer within the context, and we need to map the answer to its corresponding ids. How? With the help of the offset mapping matrix. This matrix maps the starting and ending character positions (from the original sequence) of each character token. Run the below code to compare the first cell in the offset_mapping and input_ids matrices compared with the tokenized word example.

In [14]:
example_context = squad['train']['context'][0]
example_question = squad['train']['question'][0]
tokenized_context = tokenizer.tokenize(example_context)

## Length of tokenized example question
example_question_tokenized_length = len(tokenizer.tokenize(example_question))

# Length of tokenized example context
example_context_tokenized_length = len(tokenizer.tokenize(example_context))

# Rows in the input ids matrix are question and context combined, with the question first. 
# Therefore, our context will start at the index equal to the length of the tokenized question plus 2. 
# We add two to account for the [CLS] and [SEP] special tokens. 
context_start_position_index = example_question_tokenized_length + 2
context_end_position_index = context_start_position_index + example_context_tokenized_length

# Offset mapping for example context
offset_mapping_context = inputs['offset_mapping'][0][context_start_position_index:context_end_position_index]
# input ids for context tokens
input_ids_context = inputs['input_ids'][0][context_start_position_index:context_end_position_index]


print('Offset mapping array for first context entry \n')
print(offset_mapping_context)
print('\nTokenized context. Note how each offest mapping tuple spans the length of the corresponding character token. \n')
print(tokenized_context)

print(f"\nThe first tuple in the offset mapping context array is {offset_mapping_context[0]}. This means the first token in the context\n")
print(f"starts at position {offset_mapping_context[0][0]} and spans to position {offset_mapping_context[0][1] - 1} for a length of {offset_mapping_context[0][1] - offset_mapping_context[0][0]} characters, because when indexing an array\n")
print(f"the ending value is non-inclusive. Notice that the first token, {tokenized_context[0]}, is the exact length of characters.\n")
print(f"You can check the other tokens with the same process.")

Offset mapping array for first context entry 

[(0, 2), (3, 6), (7, 9), (9, 10), (11, 14), (15, 23), (24, 26), (27, 32), (33, 40), (41, 47), (48, 52), (53, 55), (55, 56), (56, 57), (58, 65), (66, 70), (71, 72), (72, 85), (86, 88), (88, 89), (89, 90), (90, 91), (91, 92), (93, 100), (100, 101), (102, 105), (106, 110), (111, 118), (119, 121), (122, 125), (126, 133), (134, 140), (140, 141), (142, 149), (150, 157), (158, 164), (165, 168), (169, 173), (173, 174), (175, 178), (179, 181), (182, 185), (186, 193), (193, 194), (194, 195), (196, 200), (201, 208), (209, 215), (216, 221), (221, 222), (223, 227), (228, 229), (229, 232), (232, 233), (233, 236), (237, 240), (241, 242), (242, 244), (244, 245), (245, 248), (249, 251), (252, 255), (256, 259), (260, 265), (266, 273), (274, 276), (277, 282), (282, 283), (284, 287), (288, 299), (300, 303), (304, 308), (309, 318), (319, 320), (321, 326), (327, 329), (330, 331), (331, 333), (334, 341), (342, 344), (345, 354), (355, 357), (358, 361), (361, 362)

#### Tokens and Offset mapping ####
We know offset mapping provides the starting and ending position of tokens. We know we can extract the starting position of the answer from the training set, AND the length of the answer. With these two pieces of information, we can identify the answer numerical ids in any row of the input_ids matrix.

In [15]:
answer_starting_position = squad['train']['answers'][0]['answer_start'][0]
num_char_in_answer = len(squad['train']['answers'][0]['text'][0])
answer_ending_position = answer_starting_position + num_char_in_answer
# Print the question and context
print(f'Question: {example_question}')
print(f'Context: {example_context}')
# Print the answer and starting position (within context). 
print(f"Answer: {squad['train']['answers'][0]['text'][0]}. Starting position: {answer_starting_position}")

# Print the number of characters in the answer
print(f"Number of characters in the answer: {num_char_in_answer}")

# Answer ending position (within context).
print(f"The answer can be found in the context, between character positions {answer_starting_position} and {answer_ending_position}")


Question: How much did Yao Ming donate?
Context: By May 14, the Ministry of Civil Affairs stated that 10.7 billion yuan (approximately US$1.5 billion) had been donated by the Chinese public. Houston Rockets center Yao Ming, one of the country's most popular sports icons, gave $214,000 and $71,000 to the Red Cross Society of China. The association has also collected a total of $26 million in donations so far. Other multinational firms located in China have also announced large amounts of donations.
Answer: $214,000 and $71,000. Starting position: 228
Number of characters in the answer: 20
The answer can be found in the context, between character positions 228 and 248


In [16]:
# Loop through offset mapping and find indexes that match the character start and end positions
start_index = 0
end_index = 0
for index, (start, end) in enumerate(offset_mapping_context):
    if start == answer_starting_position:
        start_index = index
    if end == answer_ending_position:
        # 1 added to account for array slicing being non-inclusive
        end_index = index + 1
        
# Remember, we need to adjust the offset mapping indices because we calculated it on the context alone, but the input id rows include the question AND context.
decoded_answer_from_input_ids = tokenizer.decode(inputs["input_ids"][0][start_index + context_start_position_index: end_index + context_start_position_index])
        
print(f"We can now extract the token answers from the context using the indices returned via looping through the first row of offset_mapping: {decoded_answer_from_input_ids}")


We can now extract the token answers from the context using the indices returned via looping through the first row of offset_mapping: $ 214, 000 and $ 71, 000


### Pre-processing the entire dataset ###

Tokenizing the question and context, converting it to numerical ids, deriving indexes of answers with offset mapping - all this work on a single example. We need a function that applies these processing steps to the entire dataset. We borrowed the following function from <a href="https://huggingface.co/docs/transformers/tasks/question_answering#preprocess">HuggingFace</a>. 

There is quite a bit of logic embedded in this single function. HuggingFace included comments to aid understanding, and I've added a few of my own to provide even more clarity. Before diving in, here are the high level processes the function executes.

 1. Tokenize the questions and context of passed in 'dataset' argument
 2. Using the returned offset_mapping matrix, iterate over each start, end position tuple.
 3. For each tuple, derive the start and end position of the answers by first calculating the start / end position of the <em>context</em>
 4. Append the start position input_ids index to the 'start_position' array. Do the same for end position input_ids index and the 'end_position' array 
 5. Include the 'start_positions' and 'end_positions' in the 'inputs' dict-like object, then return the object. 

In [17]:
def preprocess_function(dataset):
    questions = [q.strip() for q in dataset["question"]]
    inputs = tokenizer(
        questions,
        dataset["context"],
        max_length=384,
        
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = dataset["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        # sequence_ids categorizes tokens as part of the 'question' or 'context'. Each row in the input_ids
        # matrix contains ids for the tokenzied 'question' and 'context'. Ids alone can't determine
        # whether a token is part of the question or context. The sequence ids function call returns
        # an array of 0, 1, and None - 0 marking a token / input id as 'question', 1 for 'context', and
        # None for special tokens like CLS, SEP, and PAD. 
        # sequence_ids variable will resemble [None, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...]
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        # Knowing that '1' marks a token as part of the 'context', we increment
        # the idx variable until we know the start and end position (or index) of the context tokens 
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        # By "full inside", we mean words of the answer lie outside the context sentence.
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            # Increment the index counter variable 'idx' until we find the
            # index where the answer begins by checking the first value
            # of each offset mapping.
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    # 'start_position' and 'end_position' are arrays added to the input object
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

Apply the the function over the whole dataset with the <a href='https://huggingface.co/docs/datasets/v2.10.0/en/package_reference/main_classes#datasets.Dataset.map'>'map' function. </a>

Returning the dictionary-like 'inputs' variable in the mapping function adds the 'inputs' key-value pairs to the 'squad' dataset object. Since the function is called on a <a href='https://huggingface.co/docs/datasets/v2.10.0/en/package_reference/main_classes#datasets.DatasetDict' > DatasetDict object </a> (squad became a DatasetDict object when we split it into 'test' and 'train' sets), HuggingFace knows to apply the function to *both* 'test' and 'train' keys in the 'squad' DatasetDict object. 

Both 'test' and 'train' will be passed as the argument to the 'dataset' argument - the processing function will run independently on each.

The 'remove_columns' parameter tells map to drop columns from our database. By passing in the column names of the original dataset (id, title, context, question, answers) we remove fields that the model won't be able to process. That leaves us with a clean table of <em> only </em> model inputs.

In [18]:
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

100%|█████████████████████████████████████████████| 4/4 [00:01<00:00,  3.32ba/s]
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  3.43ba/s]


Check function results by decoding the tokens within the interval created by the start and end positions of the first answer. The decoded value derived from the 'input_ids' data table should match the 'text' from the original squad dataset 'answer' table.

In [20]:
### Run function with 'map', then observe first row.
answer_index = 0

answer_start_position = tokenized_squad['train']['start_positions'][answer_index]
answer_end_position = tokenized_squad['train']['end_positions'][answer_index]

decoded_answer = tokenizer.decode(tokenized_squad['train']['input_ids'][answer_index][answer_start_position: answer_end_position])
squad_dataset_answer = squad['train']['answers'][answer_index]['text']

print(decoded_answer)
print(squad_dataset_answer)




$ 214, 000 and $ 71,
['$214,000 and $71,000']


#### Data Collator ####
Not to be confused with collector, the data collactor conducts final pre-processing format changes before the inputs are passed into the model. The primary responsibility of the data collator is to batch inputs together. Similar to the tokenizer, the data collator receives inputs and re-formats the data into a structure more favorable to the model. The data collator standardizes vector length with padding characters. 

Batching refers to the "bundling" of model inputs together before passing to the model. With batching, the model can "process" (i.e. adjust model parameters) the inputs in parallel. Batching reduces how much computer memory the model needs to store the data - instead of holding the entire dataset in memory, it can hold only the current batch. This leaves memory open for the CPU to utilize while it processes the inputs. 

In [21]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

### At Long Last - Training ###

Finally, after enough pre-processing to make our heads spin, we are ready to train our model. Lets import our pre-trained model, instantiate a 'training_args' and 'trainer' instance, then call the 'train' function. 

Quickly, here is the explanation for each argument passed to the <a href="https://huggingface.co/docs/transformers/v4.27.1/en/main_classes/trainer#transformers.TrainingArguments"> TrainingArguments class</a> (once again shamelessly stealing the configuration from <a href="https://huggingface.co/docs/transformers/tasks/question_answering#train"> HuggingFace's tutorial </a>). 

<strong>output_dir</strong>: The only required argument, this is where the HuggingFace Trainer class saves the trained model. 

<strong>evaluation_strategy</strong>: When the model should "evaluate" itself. Specifics of model evaluation are out of scope, but broadly a model has a numerical representation (passed by a human) on what a correct output is, and the parameters are tweaked to get as close (numerically) to that output as possible. By measuring the difference between model and correct output, we can evaluate the models ability. Different metrics exist for measuring accuracy, with the <a href="https://en.wikipedia.org/wiki/Loss_function">loss function</a> being the standard for regression functions. 

We set this value equal to 'epoch', letting the model know it should evaluate itself every epoch. An 'epoch' is one full processing of the <em>entire</em> dataset. Theoretically, as the model continues to iterate over the training set, its outputs should improve, tweaking its parameters to get closer and closer to the desired outcome. 

<strong>learning_rate</strong>: Learning rate influences the models step size. Step size measures the rate at which a model updates parameters during training. A higher learning rate will result in a larger step size, increasing the magnitude of the parameter rate of change. The learning rate is exactly what is sounds like - the speed at which a model "learns". In model terms, "learning" describes the process of adjusting the parameters in order to get closer to optimization. A higher learning rate decrease the time to reach optimization, but puts the model at risk of "overfitting" - the phenomenon where the model parameters are overly optimized for the training data, therfore peforming poorly on new data.

<strong>per_device_train_batch_size, per_device_eval_batch_size</strong>: Both values correspond to our "batch" size, or how many example inputs we pass to the model at one time. 

<strong>num_train_epochs</strong>: Number of epoch cycles to perform. Remember, an epoch corresponds to training the model on the entire dataset. 

<strong>weight_decay</strong>: Weight decay is a "penalty" placed on the output of the lost function in order to reduce overfitting. A model adjusts its parameters based on the output of a loss function - by manipulating the loss function output, we reduce the magnitude of the parameter adjustment. The higher the weight_decay value, the less chance of overfitting. Of course, there is an inherent trade off to this approach - too much weight decay and the model risks being being undertrained.

In [22]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")


training_args = TrainingArguments(
    output_dir="my_awesome_qa_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)



Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this mode

Finally, training begins. Grab a coffee - you might wait awhile for the training to finish.

Why, you may be asking if you've read this far, did we go through all this trouble to train a <em>pretrained</em> model? The pretrained model we are using - <a href="https://huggingface.co/distilbert-base-uncased#training-data">distil-bert-uncased</a> - was already trained on over <a href="https://huggingface.co/distilbert-base-uncased#training-data">11,000 unpublished books AND Wikipedia</a>. What difference will it make to train on a additional 4000 examples? Here are the major reasons: 

 - The pre-trained model was built on raw, unlabeled text data. Labeling examples with correct output, as we've done with the 'start_position' and 'end_position' array, helps the model fine tune the parameters
  - NLP models can solve a variety of problems - sentiment analysis, sequence prediction, question answering, etc. Pretrained models are general - they are not designed to solve a specific problem. We want to build a solution for question and answering - the out of the box model would be able to serve our purpose, but the output will be better if trained on question / answering designed inputs.
  - Businesses have unique jargin and context that a general model will not understand. If a business wants to implement an NLP solution into an existing product, they'll want to train the model on industry contextualized examples. For example, a onine commerce store developing a chat bot would want to train their model on examples of common questions from real customers.

Pretrained models are valuable since they've already been trained on the general rules of the language (via raw text) which saves significant time and energy in the training process. Admittedly, 4000 examples processed 3 times will not significantly improve the models outputs - but the reasons for training a pre-trained model remain. 

In [23]:
# Train your model!
trainer.train()

***** Running training *****
  Num examples = 4000
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 750
  Number of trainable parameters = 66364418


Epoch,Training Loss,Validation Loss
1,No log,2.134887
2,2.536200,1.686889
3,2.536200,1.649944


***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16
Saving model checkpoint to my_awesome_qa_model/checkpoint-500
Configuration saved in my_awesome_qa_model/checkpoint-500/config.json
Model weights saved in my_awesome_qa_model/checkpoint-500/pytorch_model.bin
tokenizer config file saved in my_awesome_qa_model/checkpoint-500/tokenizer_config.json
Special tokens file saved in my_awesome_qa_model/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=750, training_loss=2.1230705159505208, metrics={'train_runtime': 6989.4713, 'train_samples_per_second': 1.717, 'train_steps_per_second': 0.107, 'total_flos': 1175877900288000.0, 'train_loss': 2.1230705159505208, 'epoch': 3.0})

### Next Steps ###
Congratulations! You've made it through this extensive breakdown of a pre-existing HuggingFace tutorial. We've introduced the tokenizing process, explained subject vocabulary like 'epoch' and 'offset mapping', and touched briefly on why we go through all this pre-training trouble in the first place. I encourage anyone who enjoyed the article to begin exploring model evaluation. We glossed over it in this article, but an understanding of different evaluation functions and tecniques is crucial to building successful (and professional) models. Happy coding.