# Contextual question answering

Mateusz Wojtulewicz

# Setup
I'm installing useful libraries and extracting QA datasets.

In [1]:
from IPython.display import clear_output
from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


In [2]:
! tar -xf drive/MyDrive/studia/9semestr/nlp/poquad.tar.gz
! unzip drive/MyDrive/studia/9semestr/nlp/simple-legal-questions-pl.zip

! git clone https://github.com/huggingface/transformers.git

! pip install git+https://github.com/huggingface/transformers
! pip install git+https://github.com/huggingface/datasets
! pip install git+https://github.com/huggingface/evaluate
! pip install sentencepiece

clear_output()
print("Done.")

Done.


# Lab

In [3]:
import random
import json

import pandas as pd

In [4]:
random.seed(77)

### Data preparation

I'm merging questions, relevant, passages and answers dataframes to create all-in-one dataframe for the next tasks.

To create a test dataset I'm using questions for which I've provided the answer (i.e. with id in range 1093-1113).

In [5]:
questions = pd.read_json("simple-legal-questions-pl/questions.jl", lines=True).rename(
    columns={"_id": "question-id"}
)
questions["question-id"] = questions["question-id"].astype(int)

relevant = pd.read_json("simple-legal-questions-pl/relevant.jl", lines=True)
relevant["question-id"] = relevant["question-id"].astype(int)

passages = pd.read_json("simple-legal-questions-pl/passages.jl", lines=True).rename(
    columns={"_id": "passage-id"}
)

answers = pd.read_json("simple-legal-questions-pl/answers.jl", lines=True)

In [6]:
df = (
    questions
    .merge(relevant, on="question-id")
    .merge(passages, on="passage-id")
    .merge(answers, on="question-id")
    .rename(columns={"text_x": "question", "text_y": "context"})
)

In [7]:
df

Unnamed: 0,question-id,question,passage-id,score_x,title,context,score_y,answer
0,22,Co się stanie jeżeli nie zostanie uiszczona op...,1994_195_223,1,Ustawa z dnia 30 czerwca 2000 r. Prawo własnoś...,"Art. 223. 1. Opłaty jednorazowe za zgłoszenia,...",1.0,postępowanie wszczęte w wyniku dokonania zgł...
1,23,Jak uniemożliwienić osobom nieuprawnionym dos...,1999_95_57,1,Ustawa z dnia 22 stycznia 1999 r. o ochronie i...,Art. 57. W celu uniemożliwienia osobom nieupra...,1.0,należy międzyinnymi stosować wyposażenie i u...
2,24,Jaki jest tygodniowy odpoczynek kierowcy?,2001_1354_7,1,Ustawa z dnia 24 sierpnia 2001 r. o czasie pra...,Art. 7. 1. W każdym tygodniu kierowca wykorzys...,1.0,tygodniowy odpoczynek w wymiarze co najmniej 4...
3,25,Do jakich przestępstw nie ma zastosowania częś...,1999_930_20,1,Ustawa z dnia 10 września 1999 r. Kodeks karny...,Art. 20. § 1. Do przestępstw skarbowych nie ma...,1.0,do przestępstw skarbowych
4,26,Co robi sąd po wysłuchaniu głosów stron?,1997_555_408,1,Ustawa z dnia 6 czerwca 1997 r. Kodeks postępo...,Art. 408. Po wysłuchaniu głosów stron sąd niez...,1.0,po wysłuchaniu głosów stron sąd niezwłocznie...
...,...,...,...,...,...,...,...,...
319,1432,Jakim przepisom podlegają przychody kościelnyc...,1995_479_29,1,Ustawa z dnia 30 czerwca 1995 r. o stosunku Pa...,Art. 29. 1. Majątek i przychody kościelnych os...,1.0,"ogólnym przepisom podatkowym, z wyjątkami okre..."
320,1433,Jakim przepisom podlegają przychody kościelnyc...,1995_482_27,1,Ustawa z dnia 30 czerwca 1995 r. o stosunku Pa...,Art. 27. 1. Majątek i przychody kościelnych os...,1.0,"ogólnym przepisom podatkowym, z wyjątkami okre..."
321,1434,Jakim przepisom podlegają przychody kościelnyc...,1997_554_19,1,Ustawa z dnia 13 maja 1994 r. o stosunku Państ...,Art. 19. 1. Majątek i przychody kościelnych os...,1.0,"ogólnym przepisom podatkowym, a w szczególnośc..."
322,1435,Jakim przepisom podlegają przychody kościelnyc...,1995_481_28,1,Ustawa z dnia 30 czerwca 1995 r. o stosunku Pa...,Art. 28. 1. Majątek i przychody Kościoła oraz ...,1.0,"ogólnym przepisom podatkowym, z wyjątkami okre..."


Before splitting into train, val and test datasets I'm dropping duplicates and leaving only questions that have an answer.

In [8]:
df = df[df.score_y == 1]
df = df.drop_duplicates()

#### Splitting dataset

The data subsets are prepared as follows:
1. The `test` dataset is constructed from data examples with questions in range 1093-1113,
2. The `val` dataset is a 20% split from the remaining dataset,
3. The rest data samples construct the `train` dataset,
4. Any example from the `train` dataset with a question that appears also in the `val` subset, the example is moved to the `val` dataset,
5. Examples from `PoQuAD` dataset are added to the `train` dataset to make it have at least 1k examples.

In [9]:
questions_test = set(df.loc[df["question-id"].isin(range(1093, 1114))].question.unique())

In [10]:
questions_test

{'Czy dotychczasowe przepisy wykonawcze zachowują moc po wydaniu nowych przepisów wykonawczych?',
 'Czy prezes Rady Ministrów określa wysokość wynagrodzenia wiceprzewodniczącego Rady Służby Cywilnej?',
 'Czy prowadzący skład celny jest odpowiedzialny za zapewnienie, aby towary złożone w składzie celnym nie zostały usunięte?',
 'Czy spółka partnerska jest spółką handlową?',
 'Do czego jest proporcjonalna wielkość limitu przyznawanego producentowi suszu?',
 'Jakiej szerokości jest pas gruntu stanowiący strefę ochronną?',
 'Kiedy została sporządzona Międzynarodowa Konwencja Przeciwko Braniu Zakładników?',
 'Kto dokonuje przekształcenia funduszu?',
 'Na dołączenie czego do protokołu może zezwolić organ podatkowy?',
 'O czym jest tekst ustawy z dnia 10 kwietnia 1974 r.?',
 'W skład jakiego ministerstwa wchodzi Sztab Generalny Wojska Polskiego?',
 'Z kim należy uzgadniać wysokość corocznych odpisów na fundusze specjalne NBP?',
 'Za co odpowiada osoba przystępująca do spółki w charakterze kom

In [11]:
df = df.loc[:, ["question", "context", "answer"]]

df_test = df[df.question.isin(questions_test)]
df_trainval = df[~df.question.isin(questions_test)]

df_val = df_trainval.sample(frac=0.2, random_state=77)

questions_val = df_val.question.unique()

df_val = df_trainval[df_trainval.question.isin(questions_val)]
df_train = df_trainval[~df_trainval.question.isin(questions_val)]

print(f"Test dataset size : {df_test.shape[0]}")
print(f"Val dataset size  : {df_val.shape[0]}")
print(f"Train dataset size: {df_train.shape[0]}")

Test dataset size : 14
Val dataset size  : 72
Train dataset size: 187


#### Adding data to train subset

The training subset contains only 187 samples, so I'm resizing it by adding samples from `PoQuAD` dataset.

In [12]:
with open("poquad/poquad_train.json") as f:
    poquad = json.load(f)

In [13]:
df_poquad = {"question": [], "context": [], "answer": []}

for topic in poquad["data"]:
    for paragraph in topic["paragraphs"]:
        for qa in paragraph["qas"]:
            if not qa["is_impossible"]:
                df_poquad["question"].append(qa["question"])
                df_poquad["context"].append(paragraph["context"])
                df_poquad["answer"].append(qa["answers"][0]["generative_answer"])

df_poquad = pd.DataFrame(df_poquad)

In [14]:
df_poquad.shape

(30757, 3)

In [15]:
df_train = pd.concat([df_train, df_poquad.sample(n=1000)])

print(f"Train dataset size: {df_train.shape[0]}")

Train dataset size: 1187


#### Save data subsets in SQuAD format

The subsets are saved in JSON files in SQuAD dataset format, so they can be used as inputs to `run_seq2seq_qa.py` script available in Transformers library.

In [16]:
def df_to_squad(df: pd.DataFrame, filename: str) -> None:
    data = []
    for index, row in df.iterrows():
        data.append(
            {
                "id": str(index),
                "context": row.context,
                "question": row.question,
                "answers": {
                    "text": [row.answer],
                    "answer_start": [0],
                },
            }
        )

    to_save = {"data": data}

    with open(filename, "w", encoding="utf8") as f:
        json.dump(to_save, f, ensure_ascii=False, indent=2)

In [17]:
df_to_squad(df=df_train, filename="train.json")
df_to_squad(df=df_val, filename="val.json")
df_to_squad(df=df_test, filename="test.json")

### Training two neural models

I'm fine-tuning two pre-trained generative models able to answer the legal questions in AQA approach (`plT5-base` and `mT5-small`).

For that I'm using a script available in Transformers library. The hyperparameters were set mostly to default values. The fine-tuning process is run for 10 epochs. The evaluation on the `val` subset is run at the end to compare both models' performace. Metrics that are used are `F1` and `exact_match` as those are default metrics for QA task.

The fine-tuned models are saved on disk to be used later for evaluation on test questions.

#### Model: `plT5-base`

In [18]:
! python transformers/examples/pytorch/question-answering/run_seq2seq_qa.py \
    --model_name_or_path allegro/plt5-base \
    --do_train \
    --do_eval \
    --do_predict \
    --predict_with_generate \
    --train_file train.json \
    --validation_file val.json \
    --test_file val.json \
    --per_device_train_batch_size 8 \
    --learning_rate 3e-5 \
    --num_train_epochs 10 \
    --max_seq_length 500 \
    --doc_stride 128 \
    --output_dir plt5-base-test

INFO:__main__:Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=True,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_tok

#### Model: `mT5-small`

In [20]:
! python transformers/examples/pytorch/question-answering/run_seq2seq_qa.py \
    --model_name_or_path google/mt5-base \
    --do_train \
    --do_eval \
    --do_predict \
    --predict_with_generate \
    --train_file train.json \
    --validation_file val.json \
    --test_file val.json \
    --per_device_train_batch_size 4 \
    --learning_rate 3e-5 \
    --num_train_epochs 10 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir mt5-base-test

INFO:__main__:Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=True,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_tok

#### Results on `val` dataset

The results on the validation dataset for both models are as follows:

| Model       | f1_score [%] | exact_match [%]|
|-------------|----------|-------------|
| `plT5-base` | 6.50     | 0           |
| `mT5-base`  | 9.94     | 8.33        |

### Evaluation on test questions

For the `mT5-base` model that performed best on the `val` dataset after fine-tuning I'm running an evaluation loop on the `test` dataset, which consist of questions I've provided answers for.

To do that I'm using transformers API to generate the answer in an autoregresive manner.

In [21]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_path = "./mt5-base-test/"
tokenizer = T5Tokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)

You are using a model of type mt5 to instantiate a model of type t5. This is not supported for all configurations of models and can yield errors.


In [22]:
import torch

@torch.no_grad()
def generate_answer(question: str, context: str):
    input = f"question: {question} context: {context}"

    encoded_input = tokenizer(
        [input], return_tensors="pt", max_length=500, truncation=True
    )

    model.eval()
    output = model.generate(
        input_ids=encoded_input.input_ids,
        attention_mask=encoded_input.attention_mask,
        max_length=500,
    )

    output = tokenizer.decode(
        output[0],
        skip_special_tokens=True,
    )

    return output

In [23]:
from collections import Counter
import string
import re

def normalize_answer(s):
    """Lower text and remove punctuation and extra whitespace."""
    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_punc(lower(s)))


def f1_score(prediction, ground_truth):
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


def exact_match_score(prediction, ground_truth):
    return (normalize_answer(prediction) == normalize_answer(ground_truth))

In [25]:
predictions = [generate_answer(r["question"], r["context"]) for i, r in df_test.iterrows()]

In [26]:
df = df_test.copy()
df["pred"] = predictions
df["f1"] = df.apply(lambda r: f1_score(r["pred"], r["answer"]), axis=1)
df["exact_match"] = df.apply(lambda r: exact_match_score(r["pred"], r["answer"]), axis=1)

df

Unnamed: 0,question,context,answer,pred,f1,exact_match
47,Jakiej szerokości jest pas gruntu stanowiący s...,Art. 3. 1. Wokół Pomnika Zagłady ustanawia się...,nie większej niż 100 m od granic Pomnika Zagłady,tak,0.0,False
52,Czy prezes Rady Ministrów określa wysokość wyn...,"Art. 19. Prezes Rady Ministrów określa, w drod...",Tak,tak,1.0,True
53,Kto dokonuje przekształcenia funduszu?,Art. 137. 1. Fundusz inwestycyjny zamknięty mo...,towarzystwo funduszy inwestycyjnych,tak,0.0,False
143,W skład jakiego ministerstwa wchodzi Sztab Gen...,Art. 1. 1. Minister Obrony Narodowej jest nacz...,Ministerstwa Obrony Narodowej,tak,0.0,False
144,W skład jakiego ministerstwa wchodzi Sztab Gen...,Art. 44. 1. Organami właściwymi do wyznaczania...,Ministerstwa Obrony Narodowej,tak,0.0,False
148,Z kim należy uzgadniać wysokość corocznych odp...,Art. 65. 1. Zasady tworzenia funduszu premiowe...,z Ministrem Finansów,tak,0.0,False
152,Na dołączenie czego do protokołu może zezwolić...,Art. 175. Organ podatkowy może zezwolić na doł...,zeznania na piśmie podpisanego przez zeznające...,nie,0.0,False
153,Do czego jest proporcjonalna wielkość limitu p...,Art. 60. 1. Wielkość limitu przyznanego produc...,do wielkości produkowanego suszu w poprzednim ...,tak,0.0,False
154,Kiedy została sporządzona Międzynarodowa Konwe...,Art. 1. Wyraża się zgodę na dokonanie przez Pr...,w dniu 18 grudnia 1979 r.,nie,0.0,False
155,Czy spółka partnerska jest spółką handlową?,"Art. 1. §1. Ustawa reguluje tworzenie, organiz...",Tak,tak,1.0,True


In [30]:
f1 = df.f1.mean() * 100
exact_match = df.exact_match.mean() * 100

print("Results on test dataset:")
print(f"F1 score         : {f1:.3f}")
print(f"exact_match score: {exact_match:.3f}")

Results on test dataset:
F1 score         : 26.984
exact_match score: 21.429


### Results on `test` dataset

The following table shows the results of `mT5-base` model on both `val` and `test` datasets:

| dataset | f1_score [%] | exact_match [%] |
|---------|----------|-------------|
| `val`   | 9.94     | 8.33        |
| `test`  | 26.98    | 21.43       |



### Answering questions

#### 1. Which pre-trained model performs better on that task?

The multilingual `mT5-base` model,  compared to polish-only `plT5-base` model, performed better on that task, acquiring higher F1 score (9.94 % vs 6.50 %) and exact match percentage (8.33 % vs 0 %).

However, the training was longer for the `mT5-base` model, mostly because it is bigger compared to `plT5-base` model, and for the training to fit on the GPU I had to lower the batch size from 8 samples to 4.


#### 2. Does the performance on the validation dataset reflects the performance on your test set?

Suprisingly that was not the case, because the model performed much better on `test` dataset than on `val` dataset. This is because the `test` dataset consists of easier questions, with short `Tak` answer, which boosted both F1 score and exact match score if the model get it right, which it mostly did because of some strange yes/no bias.


#### 3. What are the outcomes of the model on your own questions? Are they satisfying? If not, what might be the reason for that?

The outcomes on my own questions are not satisfactory. The model appears to have a bias towards answering shortly, with yes or no. For my questions it happened to be pretty good though. 



#### 4. Why extractive question answering is not well suited for inflectional languages?

For question answering task one has to understand the context, as well as the answer. In inflectional languages the words can take not only various forms, depending on the usage, but also arbitral order (e.g. _Jan zjadł rybę_, and _Rybę zjadł Jan_ has the same meaning but different order). Those facts make it harder to understand the meaning of the sentence and extract useful information for answering the question.


#### 5. Why you have to remove the duplicated questions from the training and the validation subsets?


The validation and test dataset has to contain examples that do not appear in train dataset, otherwise the evaluation won't be reliable. In QA task the example is not only understood as the question, context, answer tuple, but also as one question but different context, because the information that is beeing seeked is the same.