# Settings

In [None]:
%load_ext autoreload
%autoreload 2


In [None]:
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())


In [None]:
import pathlib

dirpath_data = pathlib.Path("../data")
dirpath_docs = dirpath_data / "docs"
dirpath_splits = dirpath_data / "splits"


# Datasets

The dataset used for this task is the `Question-Answer Dataset` with origins<br />
from the paper [2], which originally followed a structure as described by the<br />
the kaggle dataset page [1]:
> "The `question_answer_pairs.txt` files contain both the questions and answers.<br />
> The columns in this file are as follows:<br />
> * `ArticleTitle` is the name of the Wikipedia article from which questions and answers initially came<br />
> * `Question` is the question<br />
> * `Answer` is the answer<br />
> * `DifficultyFromQuestioner` is the prescribed difficulty rating for the question as given to the question-writer<br />
> * `DifficultyFromAnswerer` is a difficulty rating assigned by the individual who evaluated and answered the question, which may differ from the difficulty in field 4<br />
> * `ArticleFile` is the name of the file with the relevant article"<br />

This structure is, however, not the most convenient for consumption. For that<br />
reason the [./etl.ipynb](ETL) process was designed to transform the data into a<br />
simpler format.

## Data within the `./data/docs` folder

The documents that must be used in the Question Answering system are stored within<br />
the `./data/docs` folder. Each document is a plain text file with the content of<br />
the document that must be taken in consideration for the Question Answering task.

## Data within the `./data/splits` folder

Within the `./data/splits` folder there are 2 files: `train.csv` and `test.csv`.<br />
which are CSVs cleaned and transformed from the original `question_answer_pairs.txt`<br />
with the columns:

* `question`: The question to be answered
* `answer`: The answer to the question, used as ground truth

In [None]:
import pandas as pd


In [None]:
def load_dataset(split_name: str) -> pd.DataFrame:
    df = pd.read_csv(dirpath_splits / f"{split_name}.csv")
    return df


In [None]:
load_dataset("train").head()


# Data Preparation

Given the `RagQA` service does not require any pre-processing (e.g.: tokenisation)<br />
of the text prior to inference or training, text is kept in its original form.

In order to reduce variance caused by situations where the casing of words is<br />
different or inconsistent, all text is converted to lowercase. Also data points<br />
presenting null values are removed.

In [None]:
import pandas as pd


In [None]:
def prepare_datset(df: pd.DataFrame) -> pd.DataFrame:
    df = df.dropna().copy()
    df["question"] = df.question.map(str.lower)
    df["answer"] = df.answer.map(str.lower)
    return df


In [None]:
df_train = prepare_datset(load_dataset("train"))
df_train


# Success Criteria

**Target Metric**: Accuracy Score

Due to their natural language nature, answering produced by the RAG cannot be<br />
evaluated at a token-level. For that reason, it would require either a person<br />
to evaluate it, or something with near-human level of judgement over text to<br />
check whether the answers make sense. To solve this problem, we will use the<br />
"Automatic eValuation Approach to Question Answering Systems" (AVA) proposed by [3].

In terms of target metric, there are no categories or classes. Each Question may<br />
either be (a) answered correctly, (b) answered incorrectly. Each answer is evaluated<br />
semantically, not at the token level. For that reason, we will use the Accuracy<br />
score which provides a ratio of correct answers over the total number of questions.

`acc_score = len(corrects) / len(questions)`.

In [None]:
from openai import OpenAI

openai_client = OpenAI()


def is_answer_correct(
    y_true: str, y_pred: str, entailmanet_label: str = "entailment", return_answer: bool = False
) -> bool:
    instructions = "\n".join(
        [
            "You are a semantic similarity analyst.",
            "",
            "You will read a `premise`.",
            "then you will read a `hypothesis`.",
            "then you will write a `label`"
            f"- `{entailmanet_label}`: if `premise` and `hypothesis` could mean the same thing,",
            "- `different`: otherwise",
            "Ensure that your answer is the shortest possible, with no extra format or characters.",
        ]
    )

    job = "\n".join([f"premise=```{y_true}```", f"hypothesis=```{y_pred}```", "label="])

    completion = openai_client.chat.completions.create(
        model="gpt-3.5-turbo", messages=[{"role": "system", "content": instructions}, {"role": "user", "content": job}]
    )

    label = completion.choices[0].message.content
    correct = entailmanet_label in label.lower()
    return correct if not return_answer else (correct, label)


In [None]:
[
    (y_true, y_pred, is_answer_correct(y_true, y_pred, return_answer=True))
    for (y_true, y_pred) in [("yes", "affirmative"), ("yes", "negative"), ("no", "negative")]
]


# Evaluation Protocol

For evaluation a separate `test` split will be used. This approach is also known<br />
as **Hold-out Evaluation**, and this split is based on the `rtatman/questionanswer-dataset`<br />
dataset[2].

In [None]:
from hlm12rag.training import QATrainer


In [None]:
df_test = prepare_datset(load_dataset("test"))
df_test


In [None]:
trainer = QATrainer(dataset=df_train, correctness_fn=is_answer_correct)
trainer


# Model Selection

In [None]:
from hlm12rag.modelling import RagQABuilder


## Baseline Model

In [None]:
baseline_qa_model = RagQABuilder(dirpath=dirpath_docs).build()
baseline_qa_out = trainer.train(model=baseline_qa_model)
baseline_qa_out


## Fine Tunning

# Evaluation

# References

```
[1] Smith, N.A., Heilman, M., Hwa, R. 2008. Question generation as a competitive undergraduate course project. In Proceedings of the NSF Workshop on the Question Generation Shared Task and Evaluation Challenge, Online, Source:https://www.cs.cmu.edu/~nasmith/papers/smith+heilman+hwa.nsf08.pdf

[2] Tatman, R. 2018. The Question-Answer Dataset. https://www.kaggle.com/rtatman/questionanswer-dataset

[3] Thuy Vu and Alessandro Moschitti. 2021. AVA: an Automatic eValuation Approach for Question Answering Systems. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 5223–5233. DOI:https://doi.org/10.18653/v1/2021.naacl-main.412

```