# Quality Estimation: Process NarrativeQA

This referes to the processing of the NarrativeQA dataset for the experiment _Multi-task QE regression_. It will roughly follow the structure below:

1. __Filtering__: remove overlapping instances between original dataset and MOCHA. Results are stored in __raw/__ folder and the goal is to enhace reproducibility. It will contain the non-overlapping examples.
2. __Generate__: generate candidate answers using the specified models. 
3. __Evaluate__: compute the evaluation metrics for the generated answers.

The processed version of the dataset will contain: $\{(c, q, a, \hat{a}, m_1, \cdots, m_m, s)\}$ and will be written to the __OUTPUT_DIR__/processed. If different models are specified, different files will be produced.

Note: An important aspect to take into consideration when evaluating this model is __whether the results are sensitive to the _generator_'s model distribution__. 

## 1. Filtering 

We first load MOCHA's dataset and then use __datasets__ (huggingface library) to load the whole NarrativeQA dataset. The latter is formatted as follows: 

```python
Features({
    "document": {
        "id": datasets.Value("string"),
        "kind": datasets.Value("string"),
        "url": datasets.Value("string"),
        "file_size": datasets.Value("int32"),
        "word_count": datasets.Value("int32"),
        "start": datasets.Value("string"),
        "end": datasets.Value("string"),
        "summary": {
            "text": datasets.Value("string"),
            "tokens": datasets.features.Sequence(datasets.Value("string")),
            "url": datasets.Value("string"),
            "title": datasets.Value("string"),
        },
        "text": datasets.Value("string"),
    },
    "question": {
        "text": datasets.Value("string"),
        "tokens": datasets.features.Sequence(datasets.Value("string")),
    },
    "answers": [
        {
            "text": datasets.Value("string"),
            "tokens": datasets.features.Sequence(datasets.Value("string")),
        }
    ],      
})
```

As mentioned in the [MOCHA paper](https://arxiv.org/pdf/2010.03636.pdf) the candidates are generated from the dev and test sets. Therefore, we only need to worry about filtering for the dev and test sets.


In [None]:
import datasets
print("datasets version:", datasets.__version__)

narrqa = datasets.load_dataset("narrativeqa", split="train")
narrqa

In [16]:
def extract_text(example):
    document = example["document"]
    doc_id = document["id"]
    context = document["summary"]["text"]
    question = example["question"]["text"]
    answer = example["answer"]["text"]
    
    return example

# We have to extract the context, question and answers text
narrqa.map(extract_text)

SyntaxError: invalid syntax (2179829211.py, line 5)

### Generate

Let's generate the training set for our QE model.

In [10]:
# reference code: 
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "allenai/unifiedqa-t5-small" # you can specify the model size here
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

def run_model(input_string, **generator_args):
    input_ids = tokenizer.encode(input_string, return_tensors="pt")
    res = model.generate(input_ids, **generator_args)
    return tokenizer.batch_decode(res, skip_special_tokens=True)



In [9]:
model_name = "allenai/unifiedqa-t5-small"
print("Loading model", model_name)
model_tokenizer = transformers.T5TokenizerFast.from_pretrained(model_name)
model = transformers.T5ForConditionalGeneration.from_pretrained(model_name)



{'0005c7718ff653683df879622efb02d1': {'candidate': 'his distant relative pascal rougon',
  'context': "The plot centres on the neurotic young priest Serge Mouret, first seen in La ConquĂŞte de Plassans, as he takes his orders and becomes the parish priest for the uninterested village of Artauds. The inbred villagers have no interest in religion and Serge is portrayed giving several wildly enthusiastic Masses to his completely empty, near-derelict church. Serge not only seems unperturbed by this state of affairs but actually appears to have positively sought it out especially, for it gives him time to contemplate religious affairs and to fully experience the fervour of his faith. Eventually he has a complete nervous breakdown and collapses into a near-comatose state, whereupon his distant relative, the unconventional doctor Pascal Rougon (the central character of the last novel in the series, 1893's Le Docteur Pascal), places him in the care of the inhabitants of a nearby derelict state