# Fine-tuning DistilBert on SQUAD v2

using example from here: https://huggingface.co/transformers/custom_datasets.html#qa-squad

In [1]:
!pip install transformers



Start by downloading the data

In [6]:
!mkdir squad
!python -m wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -o squad/train-v2.0.json
!python -m wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -o squad/dev-v2.0.json

A subdirectory or file squad already exists.



Saved under squad/train-v2.0.json

Saved under squad/dev-v2.0.json


Each split is in a structured json file with anumber of questions and answers for each passage (context). Let's take this apart into parallel lists of contexts, questions, and answers (contexts are repeated because there are multiple questions per context).

In [1]:
import json
from pathlib import Path

def read_squad(path):
    path = Path(path)
    with open(path, 'rb') as f:
        squad_dict = json.load(f)

    contexts = []
    questions = []
    answers = []
    for group in squad_dict['data']:
        for passage in group['paragraphs']:
            context = passage['context']
            for qa in passage['qas']:
                question = qa['question']
                for answer in qa['answers']:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)

    return contexts, questions, answers

train_contexts, train_questions, train_answers = read_squad('squad/train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad('squad/dev-v2.0.json')

In [2]:
train_contexts[0]

'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'

In [3]:
train_questions[0]

'When did Beyonce start becoming popular?'

In [4]:
train_answers[0]

{'text': 'in the late 1990s', 'answer_start': 269}

Shorten the training dataset (!!!)

In [5]:
num_train = 10000
num_val = 2000

train_contexts = train_contexts[:num_train]
train_questions = train_questions[:num_train]
train_answers = train_answers[:num_train]

val_contexts = val_contexts[:num_val]
val_questions = val_questions[:num_val]
val_answers = val_answers[:num_val]

print(f'num train_contexts: {len(train_contexts)}')
print(f'num train_questions: {len(train_questions)}')
print(f'num train_answers: {len(train_answers)}')

print(f'\nnum val_contexts: {len(val_contexts)}')
print(f'num val_questions: {len(val_questions)}')
print(f'num val_answers: {len(val_answers)}')

num train_contexts: 10000
num train_questions: 10000
num train_answers: 10000

num val_contexts: 2000
num val_questions: 2000
num val_answers: 2000


The contexts and questions are just strings. The answers are dicts containing the subsequence of hte passage with the correct answer as well as an integer indicating hte character at which the answer begins. In order to train a model on this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which __token__ positions the answer begins and ends

First, let's get the _character_ position at which the answer ends in the passage (we are given the starting position). Sometimes SQuAD answers are off by one or two characters, so we will adjust for that. 

In [6]:
def add_end_idx(answers, contexts):
    for answer, context in zip(answers, contexts):
        gold_text = answer['text']
        start_idx = answer['answer_start']
        end_idx = start_idx + len(gold_text)

        # sometimes squad answers are off by a character or two – fix this
        if context[start_idx:end_idx] == gold_text:
            answer['answer_end'] = end_idx
        elif context[start_idx-1:end_idx-1] == gold_text:
            answer['answer_start'] = start_idx - 1
            answer['answer_end'] = end_idx - 1     # When the gold label is off by one character
        elif context[start_idx-2:end_idx-2] == gold_text:
            answer['answer_start'] = start_idx - 2
            answer['answer_end'] = end_idx - 2     # When the gold label is off by two characters

add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

Now `train_answers` and `val_answers` include the character end positions and the corrected start positions. Next, let's tokenize our context/question pairs. HF Tokenizers can accept parallel lists of sequences and encode them together as sequence pairs. 

In [7]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)

Next we need to convert our character start/end positions to token start/end positions. When using HF Fast Tokenizers, we can use the built in `char_to_token()` method

In [8]:
def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))

        # if start position is None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length

    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)

Our data is ready. Let's just put it in a PyTorch/Tensorflow dataset so that we can easily use it for training.

Let's do both for good measure

## PyTorch

In [9]:
import torch

class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

Now we can use a DistilBert model with a QA head for training:

In [10]:
from transformers import DistilBertForQuestionAnswering
model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this mode

The data and model are both ready to go. We can train the model with `Trainer`/`TFTrainer`. 

In [11]:
torch.cuda.is_available()

True

In [12]:
from torch.utils.data import DataLoader
from transformers import AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model.to(device)
model.train()

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(3):
    for batch in train_loader:
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
        loss = outputs[0]
        loss.backward()
        optim.step()

model.eval()

RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

## TensorFlow