# Hugging Face - Question Answering

Extractive question-answering system: Model extracts a part of the given reference (reference or context: A paragraph or sentence in which you want to find the answer to your question) to answer a question (question: The question of which answer you want from the model)

Abstractive or Generative question-answering system: Model generates some new words or sentences to answer from the context that correctly answers the question.

We will be building an extractive question-answering system that only needs a Bert or Transformer encoder-only architecture unlike a Generative question answering system which will require a whole transformer ( encoder+decoder) architecture

In [37]:
from transformers import AutoTokenizer
from datasets import load_dataset
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
from transformers import DefaultDataCollator
import datasets
import pandas as pd

In [2]:
squad = load_dataset("squad") 

Found cached dataset squad (C:/Users/jorda/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 158.82it/s]


In [3]:
squad.get("train")[0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

Id- Id of that question

Title- The topic, question belongs to

Context- The context in which the model has to find an answer to the question

Question- The question itself

Answer- Part of context which can be the answer of the question & index where the answer starts

In [4]:
tokenizer = AutoTokenizer.from_pretrained("deepset/bert-base-cased-squad2")
tokenizer

BertTokenizerFast(name_or_path='deepset/bert-base-cased-squad2', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

Now let’s define a “Preprocess” function, which tokenizes the whole data, does some cleaning, and preprocessing of text, and performs required changes in order to convert the data into a form that we can feed the model

**NOTE** - There are things happening in this function I do not fully understand, treating as a magic function for now. It essentially adds the start and end positions for a given context s answer location as character index locations.

In [5]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=512,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [6]:
# In batches or SQUAD dataset items apply the preprocess function.
tokenized_squad = squad.map(preprocess_function, batched=True)

Loading cached processed dataset at C:\Users\jorda\.cache\huggingface\datasets\squad\plain_text\1.0.0\d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453\cache-a7926d51eaae8d08.arrow
Loading cached processed dataset at C:\Users\jorda\.cache\huggingface\datasets\squad\plain_text\1.0.0\d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453\cache-cc816901e4102791.arrow


In [7]:
print(tokenized_squad['train']['start_positions'][0])
print(tokenized_squad['train']['end_positions'][0])

136
142


## Train Model

# Get a pretrained BERT model, which used SQUAD2.

In [8]:
data_collator = DefaultDataCollator()
model = AutoModelForQuestionAnswering.from_pretrained("deepset/bert-base-cased-squad2")

In [9]:
model

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elem

In [38]:
training_set = datasets.Dataset.from_pandas(pd.DataFrame(data=tokenized_squad['train'][:100]))
testing_set = datasets.Dataset.from_pandas(pd.DataFrame(data=tokenized_squad['validation'][:20]))

training_args = TrainingArguments(
 output_dir="./results",
 evaluation_strategy="epoch",
 learning_rate=2e-5,
 per_device_train_batch_size=2,
 per_device_eval_batch_size=2,
 num_train_epochs=1,
 weight_decay=0.01,
)

trainer = Trainer(
 model=model,
 args=training_args,
 train_dataset=training_set,
 eval_dataset=testing_set,
 tokenizer=tokenizer,
 data_collator=data_collator,
)

trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,0.829199


TrainOutput(global_step=50, training_loss=0.478194465637207, metrics={'train_runtime': 385.7592, 'train_samples_per_second': 0.259, 'train_steps_per_second': 0.13, 'total_flos': 26129675673600.0, 'train_loss': 0.478194465637207, 'epoch': 1.0})

## Inference

In [39]:
import torch
import transformers
from transformers import BertTokenizer, BertForQuestionAnswering

In [40]:
def find_answer(question,text):
    inputs = tokenizer(question, text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)

    answer_start_index = outputs.start_logits.argmax()
    answer_end_index = outputs.end_logits.argmax()

    predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
    return tokenizer.decode(predict_answer_tokens)

In [41]:
question =  "To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?"
text = "It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary."

print(find_answer(question,text))

Saint Bernadette Soubirous


In [42]:
question =  "What is the first major city west of paris?"
text = "Marseille is too the south, nannes to the east, brittany to the west"

print(find_answer(question,text))

brittany
