# Question answering

Question answering tasks return an answer given a question. If you've ever asked a virtual assistant like Alexa, Siri or Google what the weather is, then you've used a question answering model before. There are two common types of question answering tasks:

- Extractive: extract the answer from the given context.
- Abstractive: generate an answer from the context that correctly answers the question.

This guide will show you how to:

1. Finetune [DistilBERT](https://huggingface.co/distilbert-base-uncased) on the [SQuAD](https://huggingface.co/datasets/squad) dataset for extractive question answering.
2. Use your finetuned model for inference.

<Tip>
The task illustrated in this tutorial is supported by the following model architectures:

<!--This tip is automatically generated by `make fix-copies`, do not fill manually!-->

[ALBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/albert), [BART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bart), [BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bert), [BigBird](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/big_bird), [BigBird-Pegasus](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bigbird_pegasus), [BLOOM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bloom), [CamemBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/camembert), [CANINE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/canine), [ConvBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/convbert), [Data2VecText](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/data2vec-text), [DeBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/deberta), [DeBERTa-v2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/deberta-v2), [DistilBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/distilbert), [ELECTRA](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/electra), [ERNIE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/ernie), [ErnieM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/ernie_m), [FlauBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/flaubert), [FNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/fnet), [Funnel Transformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/funnel), [OpenAI GPT-2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt2), [GPT Neo](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt_neo), [GPT NeoX](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt_neox), [GPT-J](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gptj), [I-BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/ibert), [LayoutLMv2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/layoutlmv2), [LayoutLMv3](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/layoutlmv3), [LED](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/led), [LiLT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/lilt), [Longformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/longformer), [LUKE](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/luke), [LXMERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/lxmert), [MarkupLM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/markuplm), [mBART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mbart), [MEGA](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mega), [Megatron-BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/megatron-bert), [MobileBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mobilebert), [MPNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mpnet), [MVP](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mvp), [Nezha](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nezha), [Nyströmformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/nystromformer), [OPT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/opt), [QDQBert](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/qdqbert), [Reformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/reformer), [RemBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/rembert), [RoBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roberta), [RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roberta-prelayernorm), [RoCBert](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roc_bert), [RoFormer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roformer), [Splinter](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/splinter), [SqueezeBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/squeezebert), [XLM](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm), [XLM-RoBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-roberta), [XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-roberta-xl), [XLNet](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlnet), [X-MOD](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xmod), [YOSO](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/yoso)


<!--End of the generated tip-->

</Tip>

Before you begin, make sure you have all the necessary libraries installed:

```bash
pip install transformers datasets evaluate
```

We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

## Load dataset

In [39]:
import pandas as pd
import datasets
from datasets import Dataset, DatasetDict
import ast
from collections import defaultdict

In [40]:
f_label_mat = pd.read_excel("/home/tanluuuuuuu/Desktop/luunvt/direct_indirect/notebooks/label_1941_1981_2143.xlsx")
display(f_label_mat)

Unnamed: 0,asin,sentence,locs,words,loc_prediction,word_prediction
0,B08K86G2ZB,GRIPPY NON SLIP TEXTURE - Keeps you from slidi...,"[[197, 202, 'MAT']]",['matte'],[],[]
1,B08K86G2ZB,PVC-FREE - Made from a EVA Foam / TPE material...,"[[34, 37, 'MAT'], [27, 31, 'MAT']]","['TPE', 'Foam']","[[27, 31, 'MAT'], [34, 37, 'MAT']]","['Foam', 'TPE']"
2,B08K85C9PB,GRIPPY NON SLIP TEXTURE - Keeps you from slidi...,"[[197, 202, 'MAT']]",['matte'],[],[]
3,B08K85C9PB,PVC-FREE - Made from a EVA Foam / TPE material...,"[[34, 37, 'MAT'], [27, 31, 'MAT']]","['TPE', 'Foam']","[[27, 31, 'MAT'], [34, 37, 'MAT']]","['Foam', 'TPE']"
4,B09N7FJJGC,"Durable Construction: Made of durable, high- q...","[[53, 58, 'MAT']]",['metal'],"[[53, 58, 'MAT']]",['metal']
...,...,...,...,...,...,...
777,B07PFD4B2C,DURABLE LEATHER MATERIAL. To provide extreme d...,"[[8, 15, 'MAT'], [107, 114, 'MAT'], [161, 168,...","['LEATHER', 'Leather', 'leather']","[[8, 15, 'MAT'], [107, 114, 'MAT'], [161, 168,...","['LEATHER', 'Leather', 'leather']"
778,B07PFD4B2C,HIGH-QUALITY CONSTRUCTION. A heavy-duty latex ...,"[[40, 45, 'MAT']]",['latex'],"[[40, 45, 'MAT']]",['latex']
779,B07PFD4B2C,GREAT EXERCISE ACTIVITY. Practicing with this ...,"[[46, 53, 'MAT']]",['Leather'],"[[46, 53, 'MAT']]",['Leather']
780,B0928J3RKC,🥊QUALITY ASSURANCE - Kids punching bag with gl...,"[[92, 99, 'MAT']]",['leather'],"[[92, 99, 'MAT']]",['leather']


In [41]:
list_ans = []
for index, row in f_label_mat.iterrows():
    locs = ast.literal_eval(row['locs'])
    words = ast.literal_eval(row['words'])
    for i, word in enumerate(words):
        ans = {
            'text': [word],
            'answer_start': [locs[i][0]]
        }
        list_ans.append(ans)
        break
    

In [42]:
new_df = pd.DataFrame()
new_df['id'] =  f_label_mat['asin']
new_df['context'] = f_label_mat['sentence']
new_df['question'] = ["What materials is this product made of?" for x in range(len(f_label_mat))]
new_df['answers'] = pd.Series(list_ans)
display(new_df)

Unnamed: 0,id,context,question,answers
0,B08K86G2ZB,GRIPPY NON SLIP TEXTURE - Keeps you from slidi...,What materials is this product made of?,"{'text': ['matte'], 'answer_start': [197]}"
1,B08K86G2ZB,PVC-FREE - Made from a EVA Foam / TPE material...,What materials is this product made of?,"{'text': ['TPE'], 'answer_start': [34]}"
2,B08K85C9PB,GRIPPY NON SLIP TEXTURE - Keeps you from slidi...,What materials is this product made of?,"{'text': ['matte'], 'answer_start': [197]}"
3,B08K85C9PB,PVC-FREE - Made from a EVA Foam / TPE material...,What materials is this product made of?,"{'text': ['TPE'], 'answer_start': [34]}"
4,B09N7FJJGC,"Durable Construction: Made of durable, high- q...",What materials is this product made of?,"{'text': ['metal'], 'answer_start': [53]}"
...,...,...,...,...
777,B07PFD4B2C,DURABLE LEATHER MATERIAL. To provide extreme d...,What materials is this product made of?,"{'text': ['LEATHER'], 'answer_start': [8]}"
778,B07PFD4B2C,HIGH-QUALITY CONSTRUCTION. A heavy-duty latex ...,What materials is this product made of?,"{'text': ['latex'], 'answer_start': [40]}"
779,B07PFD4B2C,GREAT EXERCISE ACTIVITY. Practicing with this ...,What materials is this product made of?,"{'text': ['Leather'], 'answer_start': [46]}"
780,B0928J3RKC,🥊QUALITY ASSURANCE - Kids punching bag with gl...,What materials is this product made of?,"{'text': ['leather'], 'answer_start': [92]}"


In [43]:
squad = Dataset.from_pandas(new_df)
squad = squad.train_test_split(test_size=0.2)

## Preprocess

The next step is to load a DistilBERT tokenizer to process the `question` and `context` fields:

In [44]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

There are a few preprocessing steps particular to question answering tasks you should be aware of:

1. Some examples in a dataset may have a very long `context` that exceeds the maximum input length of the model. To deal with longer sequences, truncate only the `context` by setting `truncation="only_second"`.
2. Next, map the start and end positions of the answer to the original `context` by setting
   `return_offset_mapping=True`.
3. With the mapping in hand, now you can find the start and end tokens of the answer. Use the [sequence_ids](https://huggingface.co/docs/tokenizers/main/en/api/encoding#tokenizers.Encoding.sequence_ids) method to
   find which part of the offset corresponds to the `question` and which corresponds to the `context`.

Here is how you can create a function to truncate and map the start and end tokens of the `answer` to the `context`:

In [45]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove any columns you don't need:

In [46]:
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/625 [00:00<?, ? examples/s]

Map:   0%|          | 0/157 [00:00<?, ? examples/s]

Now create a batch of examples using [DefaultDataCollator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DefaultDataCollator). Unlike other data collators in 🤗 Transformers, the [DefaultDataCollator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DefaultDataCollator) does not apply any additional preprocessing such as padding.

In [47]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

## Train

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load DistilBERT with [AutoModelForQuestionAnswering](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForQuestionAnswering):

In [48]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model).
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, and data collator.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [54]:
training_args = TrainingArguments(
    output_dir="./my_awesome_qa_model",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

  0%|          | 0/120 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

{'eval_loss': 0.45357745885849, 'eval_runtime': 1.1334, 'eval_samples_per_second': 138.52, 'eval_steps_per_second': 8.823, 'epoch': 1.0}


  0%|          | 0/10 [00:00<?, ?it/s]

{'eval_loss': 0.3574216365814209, 'eval_runtime': 1.1278, 'eval_samples_per_second': 139.204, 'eval_steps_per_second': 8.867, 'epoch': 2.0}


  0%|          | 0/10 [00:00<?, ?it/s]

{'eval_loss': 0.36656296253204346, 'eval_runtime': 1.1307, 'eval_samples_per_second': 138.85, 'eval_steps_per_second': 8.844, 'epoch': 3.0}
{'train_runtime': 46.9633, 'train_samples_per_second': 39.925, 'train_steps_per_second': 2.555, 'train_loss': 0.2218625068664551, 'epoch': 3.0}


TrainOutput(global_step=120, training_loss=0.2218625068664551, metrics={'train_runtime': 46.9633, 'train_samples_per_second': 39.925, 'train_steps_per_second': 2.555, 'train_loss': 0.2218625068664551, 'epoch': 3.0})

## Evaluate

Evaluation for question answering requires a significant amount of postprocessing. To avoid taking up too much of your time, this guide skips the evaluation step. The [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) still calculates the evaluation loss during training so you're not completely in the dark about your model's performance.

If have more time and you're interested in how to evaluate your model for question answering, take a look at the [Question answering](https://huggingface.co/course/chapter7/7?fw=pt#postprocessing) chapter from the 🤗 Hugging Face Course!

## Inference

Great, now that you've finetuned a model, you can use it for inference!

Come up with a question and some context you'd like the model to predict:

In [74]:
question = "What materials is this product made of?"
context = '''
['Relax and take a load off in your garden or patio on this comfortable wooden sun lounger! You can spend a relaxing afternoon in your own outdoor space! It is perfect for a pool, balcony, garden, etc.', 'This outdoor sun lounger is made of solid acacia wood, making it sturdy and stable. The lounge bed features an ergonomically curved shape, allowing you to lie down and fully relax. This sunbed can be folded away when not in use. The cushion with head pillow adds extra comfort for you while lounging. Each cushion features six sets of ropes so that you can fix it on the sun lounger tightly.', 'This sun lounger offers you a comfortable outdoor relaxation experience. These outdoor loungers are perfect for your outdoor activities, such as soaking up the sun or relaxing after a swim. It is the ideal relaxation and vacation companion for you and your family to relax and enjoy near the pool, on the beach, in the garden, on the deck, etc.', 'Color of cushion: Red Material: Solid acacia wood with an oil finish Material of cushion: Fabric (100% polyester) Dimensions (unfolded): 72.4" x 21.7" x 25.2" (L x W x H) Dimensions (folded): 36.2" x 21.7" x 7.9" (L x W x H) Dimension of cushion: 73.2" x 22.8" x 1.2" (L x W x T) Dimension of head cushion: 18.5" x 11.8" x 1.2" (L x W x T) Assembly required: No Delivery contains: 1 x Sunlounger 1 x Cushion', 'We ship quickly, thirty days no reason after-sales is your best after-sales guarantee, you can rest assured that the purchase, if you have any questions, please contact us in a timely manner, we will see the first time to contact you, to give you the most satisfactory solution!']
'''

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for question answering with your model, and pass your text to it:

In [75]:
from transformers import pipeline

question_answerer = pipeline("question-answering", model="./my_awesome_qa_model/checkpoint-120")
question_answerer(question=question, context=context)

{'score': 0.1557905226945877, 'start': 1639, 'end': 1640, 'answer': ']'}

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

In [76]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./my_awesome_qa_model/checkpoint-120")
inputs = tokenizer(question, context, return_tensors="pt")

Pass your inputs to the model and return the `logits`:

In [77]:
import torch
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained("./my_awesome_qa_model/checkpoint-120")
with torch.no_grad():
    outputs = model(**inputs)

Get the highest probability from the model output for the start and end positions:

In [78]:
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

Decode the predicted tokens to get the answer:

In [79]:
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

'wooden'