### Assignment: Fine-Tuning a Question Answering Model


#### Introduction

- In this assignment, you will learn about fine-tuning a pre-trained question answering model 
using the SQuAD (Stanford Question Answering Dataset) dataset. 
- You will use the Hugging Face Transformers library to load the dataset, tokenize the data, train the model, and perform inference tasks.

***Source***: “How to Fine-Tune a Model for Common Downstream Tasks.” Accessed April 12, 2024. https://huggingface.co/docs/transformers/v4.15.0/en/custom_datasets.


***Additional Reference*** 
These references help with understand how fine tune process 
- SQuAD: Stanford Question Answering Dataset Explorer. Accessed April 12, 2024. https://rajpurkar.github.io/SQuAD-explorer/
- Kalyanmaram. "My Awesome QA Model." Hugging Face Model Hub. Accessed April 12, 2024. https://huggingface.co/kalyanmaram/my_awesome_qa_model/.
- “Fine-Tuning a Model with the Trainer API - Hugging Face NLP Course.” Accessed April 12, 2024. https://huggingface.co/learn/nlp-course/en/chapter3/3.

#### Instructions

- Set runtime in google colab to T4 GPU
- Run each cell
- Answer the following questions present after specific cells

#### Install packages

In [1]:
!pip install datasets accelerate transformers

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.29.1-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.3/297.3 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━

In [20]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

#### Import all libraries that are required



In [21]:
from transformers import AutoTokenizer
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
from transformers import DefaultDataCollator
from transformers import pipeline
from datasets import load_dataset


#### Using load_dataset method we load the dataset

In [22]:
squad = load_dataset("squad", split="train[:5000]")
squad = squad.train_test_split(test_size=0.2)

##### Questions
1. What function do we use to load the dataset?
2. How do we split the dataset into training and testing subsets?
3. Why is it important to split the dataset into training and testing subsets?
4. Explain the significance of the `split` parameter in the `load_dataset` function.
5. Can you name any other datasets that can be loaded using the `load_dataset` function?


#### Initialize model and tokenizer

In [23]:
model = AutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Tokenization Process

In [24]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

##### Questions
1. What is tokenization, and why is it important in NLP tasks?
2. How are the start and end positions of the answers determined during tokenization?
3. What parameters are passed to the tokenizer during tokenization, and why are they necessary?
4. Explain the purpose of the `return_offsets_mapping` parameter in the tokenizer.
5. Can you describe the role of tokenizers in handling out-of-vocabulary (OOV) words during tokenization?


#### Training the model

In [25]:
training_args = TrainingArguments(
    output_dir="my_awesome_qa_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)
data_collator = DefaultDataCollator()
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss
1,No log,2.646235
2,2.844800,1.940044
3,2.844800,1.830352


TrainOutput(global_step=750, training_loss=2.412160888671875, metrics={'train_runtime': 473.4979, 'train_samples_per_second': 25.343, 'train_steps_per_second': 1.584, 'total_flos': 1175877900288000.0, 'train_loss': 2.412160888671875, 'epoch': 3.0})

##### Questions
1. Why is the output directory specified in the training arguments?
2. What is the purpose of the evaluation strategy parameter?
3. Why is it necessary to define the number of training epochs?
4. Explain the significance of the `per_device_train_batch_size` parameter in training.
5. Can you describe any regularization techniques used during model training, and why are they important?


#### Push model and tokenizer to hugging face

In [30]:
trainer.push_to_hub()
model.push_to_hub("kalyanmaram/my_awesome_qa_model")
tokenizer.push_to_hub("kalyanmaram/my_awesome_qa_model")

README.md:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/kalyanmaram/my_awesome_qa_model/commit/96452fd4dbd9916563c5c1ec67e18273dd9631cf', commit_message='Upload tokenizer', commit_description='', oid='96452fd4dbd9916563c5c1ec67e18273dd9631cf', pr_url=None, pr_revision=None, pr_num=None)

#### Inference 1

In [34]:

question = "How many programming languages does BLOOM support?"
context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."
question_answerer = pipeline("question-answering", model="kalyanmaram/my_awesome_qa_model")
question_answerer(question=question, context=context)


{'score': 0.16517725586891174,
 'start': 10,
 'end': 95,
 'answer': '176 billion parameters and can generate text in 46 languages natural languages and 13'}

##### Questions
1. What question is asked in the first inference example?
2. What is the output of the question answering pipeline for the first inference?
3. How is the score calculated for the predicted answer?
4. Can you explain the process of post-processing the predicted answer?
5. What are some potential challenges in performing question answering tasks, and how can they be addressed?


#### Inference 2

In [35]:
question = "How many programming languages does BLOOM support?"
context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."
question_answerer2 = pipeline("question-answering", model="distilbert/distilbert-base-uncased")
question_answerer2(question=question, context=context)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'score': 0.004547498654574156,
 'start': 58,
 'end': 118,
 'answer': '46 languages natural languages and 13 programming languages.'}

##### Questions
1. How does the second inference example differ from the first one?
2. Why does the warning message appear during inference, and how can it be addressed?
3. What is the predicted answer for the second inference, and how does it compare to the first inference?
4. Can you discuss any limitations or drawbacks of the question answering model, based on the inference results?
5. How would you approach fine-tuning the model further to improve its performance?
