# **1. Setup envrionment and install necessary packages**
This notebook demonstrates a Question-Answer NLP model developed using Transformers framework developed by huggingface libraries. We use a pre-trained model, 'distilbert-base-uncased' and fine-tune it to the squad dataset. Fine tuning using both, **pytorch** and **tensorflow** is illustrated in this notebook.

---
Following cell installs necessary packages

In [2]:
!pip install transformers datasets evaluate accelerate
!pip install torch tensorflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m48.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m39.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.20.3-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_h

In [3]:
#Connecting notebook to huggingface by generating a access token from huggingface profile
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# 2. Load and prepare the dataset
Loading squad dataset, example is given below, containing title, context with a question and an answer.  

In [4]:
from datasets import load_dataset

#Using load_dataset() function to load and splitting the dataset. Notice how answer_start position is also given.
squad = load_dataset("squad", split="train[:5000]").train_test_split(test_size=0.2)
squad["train"][0]

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


{'id': '56d3711659d6e414001463d2',
 'title': 'Frédéric_Chopin',
 'context': 'In April, during the Revolution of 1848 in Paris, he left for London, where he performed at several concerts and at numerous receptions in great houses. This tour was suggested to him by his Scottish pupil Jane Stirling and her elder sister. Stirling also made all the logistical arrangements and provided much of the necessary funding.',
 'question': "What was Jane Stirling's national heritage?",
 'answers': {'text': ['Scottish'], 'answer_start': [191]}}

# 3. Preprocess the dataset


In [5]:
# The next step is to load a DistilBERT tokenizer to process the question and context fields:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Before training the model, we need to preprocess the dataset. This includes mapping the start and end positions of the answer to the original context and generating a tokenized_squad dataset that can be used for training the model. preprocess_function() is directly provided by huggingface to use.

In [6]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [7]:
#Pass the input dataset and preprocess_function to dataset.map function  to map the entire dataset at once.
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

# 4. Fine-tuning the QA model using pytorch
1. DefaultDataCollator is used to create a batch of exampes
2. AutoModelForQuestionAnswering.from_pretrained() allows loading pre-trained model which is stored in model variable
3. TrainingArguments() is used to define hyperparameters for training and name of the model is set under 'output_dir'
4. Finally, compile all objects into Trainer()

Hugging face also allows accelerated training to speedup training in notebooks. All we need to do is define all training code into a function and pass it to notebook_launcher.


In [9]:
#pytorch
from transformers import DefaultDataCollator, AutoModelForQuestionAnswering, TrainingArguments, Trainer
import numpy as np
import evaluate

data_collator = DefaultDataCollator()

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=np.int_(logits), references=np.int_(labels))

metric = evaluate.load("accuracy")

def training_function():
  training_args = TrainingArguments(
      output_dir="my_qa_model_pytorch",
      evaluation_strategy="epoch",
      learning_rate=2e-5,
      per_device_train_batch_size=16,
      per_device_eval_batch_size=16,
      num_train_epochs=8,
      weight_decay=0.01,
      push_to_hub=True
  )

  trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=tokenized_squad["train"],
      eval_dataset=tokenized_squad["test"],
      tokenizer=tokenizer,
      data_collator=data_collator,
#      compute_metrics=compute_metrics
  )
  trainer.train()
  trainer.push_to_hub()

from accelerate import notebook_launcher
notebook_launcher(training_function)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to

Launching training on one GPU.


/content/my_qa_model_pytorch is already a clone of https://huggingface.co/parasgopani94/my_qa_model_pytorch. Make sure you pull the latest changes with `repo.git_pull()`.


Epoch,Training Loss,Validation Loss
1,No log,2.147987
2,2.638700,1.513668
3,2.638700,1.428584
4,1.023900,1.54206
5,1.023900,1.553075
6,0.581600,1.63967
7,0.581600,1.710622
8,0.402400,1.758741


Upload file runs/Jun26_21-21-15_1954e8c20e14/events.out.tfevents.1687814478.1954e8c20e14.586.1: 100%|#########…

To https://huggingface.co/parasgopani94/my_qa_model_pytorch
   f6b014c..fd36a6d  main -> main

   f6b014c..fd36a6d  main -> main

To https://huggingface.co/parasgopani94/my_qa_model_pytorch
   fd36a6d..005011f  main -> main

   fd36a6d..005011f  main -> main



# 5. Fine-tuning the QA model using tensorflow

Most of the steps remain same as pytorch except few changes

1. DefaultDataCollator must be defined with return_tensors="tf"
2. AutoModelForQuestionAnswering.from_pretrained() allows loading pre-trained model which is stored in model variable
3. Hyperparameters are defined separately and a optimizer and schedule objects are also defined using create_optimizer() function.
4. Batched tf datasets are created by passing data_collator and tokenized dataset to prepare_tf_dataset function
5. Finally, to fine-tune your model call fit() with training and dev datasets, epochs, and callback. Callback is specified to allow push to hub for your model.



In [11]:
#tensorflow
from transformers import create_optimizer, DefaultDataCollator, TFAutoModelForQuestionAnswering
from transformers.keras_callbacks import PushToHubCallback
import tensorflow as tf

data_collator = DefaultDataCollator(return_tensors="tf")

batch_size = 16
num_epochs = 8
total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=0,
    num_train_steps=total_train_steps,
)

model_tf = TFAutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

def training_function():
  tf_train_set = model_tf.prepare_tf_dataset(
      tokenized_squad["train"],
      shuffle=True,
      batch_size=16,
      collate_fn=data_collator,
  )

  tf_validation_set = model_tf.prepare_tf_dataset(
      tokenized_squad["test"],
      shuffle=False,
      batch_size=16,
      collate_fn=data_collator,
  )

  #Configure model for training using compile function
  model_tf.compile(optimizer=optimizer)

  #A callback is specified to allow model to be pushed to huggingface hub
  callback = PushToHubCallback(
      output_dir="my_qa_model_tf",
      tokenizer=tokenizer,
  )

  #Finally, to fine-tune your model call fit with training and validation datasets, the number of epochs, and callback.
  model_tf.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=8, callbacks=[callback])

from accelerate import notebook_launcher
notebook_launcher(training_function)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForQuestionAnswering: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForQuestionAnswering were not initialized from the PyTorch model and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it

Launching training on one GPU.


Cloning https://huggingface.co/parasgopani94/my_qa_model_tf into local empty directory.


Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


# 6. Using the fine-tuned model


In [12]:
# Inference
question = "Where do I live??"
context = "My name is Thor and I live in Asgard."

from transformers import pipeline

question_answerer = pipeline("question-answering", model="my_qa_model_tf")
question_answerer(question=question, context=context)

Some layers from the model checkpoint at my_qa_model_tf were not used when initializing TFDistilBertForQuestionAnswering: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForQuestionAnswering were not initialized from the model checkpoint at my_qa_model_tf and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'score': 0.5654642581939697, 'start': 30, 'end': 36, 'answer': 'Asgard'}