# HuggingFace Question Answer System

Question answering tasks return an answer given a question. There are two common forms of question answering:

1. Extractive: extract the answer from the given context.
2. Abstractive: generate an answer from the context that correctly answers the question.

This guide will show you how to fine-tune DistilBERT on the SQuAD dataset for extractive question answering.

### Load SQuAD dataset

Train: 87599 instances

Test: 10570

In [1]:
from datasets import load_dataset

In [2]:
# squad_train = load_dataset("squad", split='train')
# squad_valid = load_dataset("squad", split='validation')
squad_train = load_dataset("squad", split="train[:2%]")
squad_valid = load_dataset("squad", split="validation[:2%]")
squad = load_dataset("squad")

In [3]:
squad['train'] = squad_train
squad['validation'] = squad_valid
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 1752
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 211
    })
})

In [4]:
# len(squad_train), len(squad_valid), len(squad)

### Load the DistilBERT tokenizer to process the question and context fields:

In [5]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

There are a few preprocessing steps particular to question answering that you should be aware of:

1. Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. Truncate only the context by setting truncation="only_second".
2. Next, map the start and end positions of the answer to the original context by setting return_offset_mapping=True.
3. With the mapping in hand, you can find the start and end tokens of the answer. Use the sequence_ids method to find which part of the offset corresponds to the question and which corresponds to the context.

Here is how you can create a function to truncate and map the start and end tokens of the answer to the context:

In [6]:
def preprocess_function(examples):

    questions = [q.strip() for q in examples["question"]]

    inputs = tokenizer(

        questions,

        examples["context"],

        max_length=384,

        truncation="only_second",

        return_offsets_mapping=True,

        padding="max_length",

    )

    offset_mapping = inputs.pop("offset_mapping")

    answers = examples["answers"]

    start_positions = []

    end_positions = []

    for i, offset in enumerate(offset_mapping):

        answer = answers[i]

        start_char = answer["answer_start"][0]

        end_char = answer["answer_start"][0] + len(answer["text"][0])

        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context

        idx = 0

        while sequence_ids[idx] != 1:

            idx += 1

        context_start = idx

        while sequence_ids[idx] == 1:

            idx += 1

        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)

        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:

            start_positions.append(0)

            end_positions.append(0)

        else:

            # Otherwise it's the start and end token positions

            idx = context_start

            while idx <= context_end and offset[idx][0] <= start_char:

                idx += 1

            start_positions.append(idx - 1)

            idx = context_end

            while idx >= context_start and offset[idx][1] >= end_char:

                idx -= 1

            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions

    inputs["end_positions"] = end_positions

    return inputs

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once. Remove the columns you don’t need:

In [7]:
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/1752 [00:00<?, ? examples/s]

Map:   0%|          | 0/211 [00:00<?, ? examples/s]

Use DefaultDataCollator to create a batch of examples. Unlike other data collators in 🤗 Transformers, the DefaultDataCollator does not apply additional preprocessing such as padding.

In [8]:
from transformers import DefaultDataCollator
data_collator = DefaultDataCollator()

# Train

Load DistilBERT with AutoModelForQuestionAnswering:

In [9]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


At this point, only three steps remain:

1. Define your training hyperparameters in TrainingArguments.
2. Pass the training arguments to Trainer along with the model, dataset, tokenizer, and data collator.
3. Call train() to fine-tune your model.

In [10]:
training_args = TrainingArguments(

    output_dir="./results",

    evaluation_strategy="epoch",

    learning_rate=2e-5,

    per_device_train_batch_size=16,

    per_device_eval_batch_size=16,

    num_train_epochs=3,

    weight_decay=0.01,

)

trainer = Trainer(

    model=model,

    args=training_args,

    train_dataset=tokenized_squad["train"],

    eval_dataset=tokenized_squad["validation"],

    tokenizer=tokenizer,

    data_collator=data_collator,

)

In [11]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,3.178071
2,No log,3.020923
3,No log,2.874038


TrainOutput(global_step=330, training_loss=3.4778793797348486, metrics={'train_runtime': 4949.642, 'train_samples_per_second': 1.062, 'train_steps_per_second': 0.067, 'total_flos': 515034520326144.0, 'train_loss': 3.4778793797348486, 'epoch': 3.0})

### Build evaluation method

In [48]:
def evaluate(model, data_loader, device):
    model.eval()
    pred_start = []
    pred_end = []
    true_start = []
    true_end = []
    id_lists = []
    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            start_positions = batch['start_positions'].to(device)
            end_positions = batch['end_positions'].to(device)
            
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
            pred_start.extend([int(torch.argmax(loc)) for loc in outputs['start_logits']])
            pred_end.extend([int(torch.argmax(loc)) for loc in outputs['end_logits']])
            true_start.extend(batch['start_positions'].tolist())
            true_end.extend(batch['end_positions'].tolist())
            id_lists.extend(batch['input_ids'].tolist())
            
    return true_start, true_end, pred_start, pred_end, id_lists

In [49]:
valid_dataset = tokenized_squad["validation"]
valid_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'start_positions', 'end_positions'])
valid_loader = DataLoader(valid_dataset, batch_size=8, shuffle=False)

true_start, true_end, pred_start, pred_end, id_lists = evaluate(model, val_dataloader, device)

In [62]:
pred_start

[57,
 57,
 161,
 161,
 167,
 162,
 72,
 160,
 162,
 159,
 73,
 159,
 74,
 163,
 170,
 46,
 62,
 163,
 80,
 163,
 77,
 71,
 42,
 53,
 159,
 46,
 160,
 71,
 162,
 162,
 27,
 133,
 66,
 40,
 29,
 44,
 27,
 25,
 29,
 26,
 35,
 33,
 37,
 29,
 95,
 25,
 43,
 29,
 26,
 29,
 44,
 46,
 24,
 44,
 25,
 67,
 47,
 45,
 59,
 72,
 43,
 36,
 66,
 43,
 43,
 73,
 70,
 75,
 50,
 68,
 70,
 40,
 47,
 44,
 45,
 41,
 38,
 47,
 45,
 74,
 33,
 34,
 55,
 33,
 30,
 15,
 31,
 30,
 30,
 32,
 29,
 36,
 35,
 21,
 26,
 31,
 30,
 34,
 27,
 29,
 30,
 28,
 29,
 28,
 33,
 31,
 17,
 29,
 26,
 18,
 20,
 18,
 16,
 16,
 23,
 27,
 36,
 22,
 22,
 22,
 24,
 47,
 57,
 29,
 54,
 48,
 25,
 6,
 49,
 18,
 49,
 46,
 47,
 23,
 28,
 83,
 84,
 84,
 81,
 27,
 22,
 79,
 38,
 85,
 37,
 84,
 38,
 48,
 83,
 79,
 20,
 34,
 80,
 80,
 46,
 37,
 32,
 30,
 18,
 31,
 15,
 17,
 18,
 37,
 15,
 19,
 47,
 38,
 24,
 14,
 36,
 44,
 31,
 21,
 15,
 44,
 15,
 33,
 146,
 81,
 71,
 81,
 79,
 83,
 91,
 85,
 71,
 72,
 65,
 75,
 81,
 82,
 83,
 84,
 85,
 149,
 1

In [70]:
def get_ans(tokens, start, end):
    answer = tokens[start]
    for i in range(start+1, end+1):
        if tokens[i][0:2] == "##":
            answer += tokens[i][2:]
        else:
            answer += " " + tokens[i]
    return answer

for i in range(10,len(id_lists)):
    tokens = tokenizer.convert_ids_to_tokens(id_lists[i])
    question = ' '.join(tokens[tokens.index('[CLS]')+1: tokens.index('[SEP]')])
    true_ans = get_ans(tokens, true_start[i], true_end[i])
    pred_ans = get_ans(tokens, pred_start[i], pred_end[i])
    print(' '.join(tokens))
    print()
    print(true_ans)
    print()
    print(pred_ans)
    print('---------{},{},{}------------'.format(i, pred_start[i], pred_end[i]))
    break

[CLS] what day was the super bowl played on ? [SEP] super bowl 50 was an american football game to determine the champion of the national football league ( nfl ) for the 2015 season . the american football conference ( afc ) champion denver broncos defeated the national football conference ( nfc ) champion carolina panthers 24 – 10 to earn their third super bowl title . the game was played on february 7 , 2016 , at levi ' s stadium in the san francisco bay area at santa clara , california . as this was the 50th super bowl , the league emphasized the " golden anniversary " with various gold - themed initiatives , as well as temporarily suspend ##ing the tradition of naming each super bowl game with roman nu ##meral ##s ( under which the game would have been known as " super bowl l " ) , so that the logo could prominently feature the arabic nu ##meral ##s 50 . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PA

In [50]:
def exact_match(pred_tokens, true_tokens):
    '''
    A straightforward way to check the equality of the two lists in Python 
    is by using the equality == operator. 
    When the equality == is used on the list type in Python, 
    it returns True if the lists are equal and False if they are not.
    '''
    return int(pred_tokens==true_tokens)

def half_exact_match(pred_tokens, true_tokens):
    '''
    A straightforward way to check the equality of the two lists in Python 
    is by using the equality == operator. 
    When the equality == is used on the list type in Python, 
    it returns True if the lists are equal and False if they are not.
    '''
    if len(pred_tokens)<=1 or len(true_tokens)<=1:
        return int(pred_tokens==true_tokens) 
    else:
        return int(pred_tokens[0]==true_tokens[0] or pred_tokens[-1]==true_tokens[-1])
    
def any_token_match(pred_tokens, true_tokens):
    common_tokens = set(pred_tokens) & set(true_tokens)
    return int(len(common_tokens)>0)
    

def get_prec_rec_f1(pred_tokens, true_tokens):
    # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred_tokens) == 0 or len(true_tokens) == 0:
        prec = rec = f1 = 1
        return prec, rec, f1
    
    common_tokens = set(pred_tokens) & set(true_tokens)
    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        prec = rec = f1 = 0
        return prec, rec, f1
    
    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(true_tokens)
    f1 = 2 * (prec * rec) / (prec + rec)
    
    return prec, rec, f1

In [51]:
match = 0
half_match = 0
any_match = 0
prec = 0
rec = 0
f1 = 0
count = 0
for idx in range(len(pred_start)):

    pred_tokens = [i for i in range(pred_start[idx], pred_end[idx]+1)]
    true_tokens = [i for i in range(true_start[idx], true_end[idx]+1)]
    
    match +=  exact_match(pred_tokens, true_tokens)
    half_match += half_exact_match(pred_tokens, true_tokens)
    any_match += any_token_match(pred_tokens, true_tokens)
    
    score =  get_prec_rec_f1(pred_tokens, true_tokens)
    prec += score[0]
    rec += score[1]
    f1 += score[2]
    count += 1

    
import datetime
now = datetime.datetime.now()
string = ''
string += '========={}========\n'.format(now)
string += '========={}========\n'.format(now)
string += 'epoch: '+str(training_args.num_train_epochs)+'\n'
string += 'exact_match: '+str(match/count)+'\n'
string += 'half_exact_match: '+str(half_match/count)+'\n'
string += 'any_match: '+str(any_match/count)+'\n'
string += 'recall: '+str(rec/count)+'\n'
string += 'precision: '+str(prec/count)+'\n'
string += 'f1: '+str(f1/count)+'\n\n'
print(string)

with open('Report_BERT_QAM_Squad.txt', 'a+') as FO:
    FO.write(string)

epoch: 3
exact_match: 0.13744075829383887
half_exact_match: 0.24170616113744076
any_match: 0.3080568720379147
recall: 0.6562401263823063
precision: 0.5191035950054195
f1: 0.5308140500545643




Reference: https://huggingface.co/docs/transformers/tasks/question_answering#train