# ListQA

In this notebook, we will see how to evaluate a model on questions that have lists as answers.

# Dependencies

If not already done, make sure to install PrimeQA with `notebooks` extras before getting started.

In [1]:
# If you want CUDA 11 uncomment and run this (for CUDA 10 or CPU you can ignore this line).
#! pip install 'torch~=1.11.0' --extra-index-url https://download.pytorch.org/whl/cu113

# Uncomment to install PrimeQA from source (pypi package pending).
# The path should be the project root (e.g. '.' below).
#! pip install .[notebooks]

# NQ List Subset

Create train and dev subsets from NQ that have lists as answers

In [2]:
from examples.listqa.list_nq2tydi import ListNQSubset

input_train = "/input/train/nq/files/nq-train-0*.jsonl.gz"
input_dev = "/input/dev/nq/files/nq-dev-*jsonl.gz"
output_data_dir = "output/list_data/dir/here"

# num_lines is how many lines to read from each file. Use -1 to load full dataset. We only load a subset here for illustration purposes 
processor = ListNQSubset()
processor.process(input_train, output_data_dir + "/nq-train-lists.jsonl", num_lines=5000)
processor.process(input_dev, output_data_dir + "/nq-dev-lists.jsonl", num_lines=1000)

INFO:root:Loading NQ data...
INFO:root:/dccstor/srosent1/projects/mlqa/data/NQ/train/nq-train-08.jsonl.gz
100%|██████████| 5000/5000 [01:21<00:00, 61.57it/s]
INFO:root:Done
INFO:root:Converting to tydi
100%|██████████| 5000/5000 [00:47<00:00, 105.50it/s]
INFO:root:data size: 36
INFO:root:/dccstor/srosent1/projects/mlqa/data/NQ/train/nq-train-07.jsonl.gz
100%|██████████| 5000/5000 [01:20<00:00, 62.44it/s]
INFO:root:Done
INFO:root:Converting to tydi
100%|██████████| 5000/5000 [00:49<00:00, 100.11it/s]
INFO:root:data size: 33
INFO:root:/dccstor/srosent1/projects/mlqa/data/NQ/train/nq-train-02.jsonl.gz
100%|██████████| 5000/5000 [01:22<00:00, 60.55it/s]
INFO:root:Done
INFO:root:Converting to tydi
100%|██████████| 5000/5000 [00:47<00:00, 105.50it/s]
INFO:root:data size: 27
INFO:root:/dccstor/srosent1/projects/mlqa/data/NQ/train/nq-train-01.jsonl.gz
100%|██████████| 5000/5000 [01:22<00:00, 60.34it/s]
INFO:root:Done
INFO:root:Converting to tydi
100%|██████████| 5000/5000 [00:47<00:00, 105.02i

# Configuration

We start by setting some parameters to configure the process.  Note that depending on the GPU being used you may need to tune the batch size.

In [3]:
# This needs to be filled in.
output_dir = '/result/output/dir/here'        # Save the results here.  Will overwrite if directory already exists.

# Optional parameters (feel free to leave as default).
model_name = 'roberta-base'  # # Set this to select the LM.  Use roberta or for better results a fine-tuned qa model (e.g. squad or tydi on run_mrc.)
cache_dir = None                 # Set this if you have a cache directory for transformers.  Alternatively set the HF_HOME env var.
train_batch_size = 16             # Set this to change the number of features per batch during training.
eval_batch_size = 16              # Set this to change the number of features per batch during evaluation.
gradient_accumulation_steps = 8  # Set this to effectively increase training batch size.
max_train_samples = None          # Set this to use a subset of the training data (or None for all).
max_eval_samples = 10            # Set this to use a subset of the evaluation data (or None for all).
num_train_epochs = 1             # Set this to change the number of training epochs.
fp16 = False                     # Set this to true to enable fp16 (hardware support required).
num_examples_to_show = 5        # Set this to change the number of random train examples (and their features) to show.
max_seq_length=512              # have large sequence length to accomodate lists which can be long (we don't want to split on lists)
max_answer_length=1000          # have large answer length to accomodate lists which can be long

In [4]:
from transformers import TrainingArguments
from transformers.trainer_utils import set_seed

seed = 42
set_seed(seed)

training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    do_train=False,
    do_eval=True,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    num_train_epochs=num_train_epochs,
    evaluation_strategy='no',
    learning_rate=5e-05,
    warmup_ratio=0.1,
    weight_decay=0.1,
    save_steps=50000,
    fp16=fp16,
    seed=seed,
)

# Loading the Model

Here we load the model and tokenizer based on the model_name parameter set above.  We use a model with an extractive QA task head which we will later fine-tune.

In [5]:
from transformers import AutoConfig, AutoTokenizer
from primeqa.mrc.models.heads.extractive import EXTRACTIVE_HEAD
from primeqa.mrc.models.task_model import ModelForDownstreamTasks

from primeqa.mrc.trainers.mrc import MRCTrainer

task_heads = EXTRACTIVE_HEAD
config = AutoConfig.from_pretrained(
    model_name,
    cache_dir=cache_dir,
)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    cache_dir=cache_dir,
    use_fast=True,
    config=config,
)
model = ModelForDownstreamTasks.from_config(
    config,
    model_name,
    task_heads=task_heads,
    cache_dir=cache_dir,
)
model.set_task_head(next(iter(task_heads)))

print(model)  # Examine the model structure

INFO:ExtractiveQAHead:Loading dropout value 0.1 from config attribute 'hidden_dropout_prob'
Some weights of RobertaModelForDownstreamTasks were not initialized from the model checkpoint at roberta-base and are newly initialized: ['task_heads.qa_head.classifier.dense.weight', 'task_heads.qa_head.classifier.out_proj.weight', 'task_heads.qa_head.classifier.dense.bias', 'task_heads.qa_head.qa_outputs.weight', 'task_heads.qa_head.classifier.out_proj.bias', 'task_heads.qa_head.qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:RobertaModelForDownstreamTasks:Setting task head for first time to 'None'


RobertaModelForDownstreamTasks(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNor

# Loading Data

Here we load the NQ dataset from disk in TyDi Google format.

In [6]:
import datasets

raw_dataset = datasets.load_dataset('json', 
        data_files={'train':[output_data_dir+"/nq-train-lists.jsonl"],'dev':[output_data_dir+"/nq-dev-lists.jsonl"]},
        cache_dir=training_args.output_dir)
train_examples = raw_dataset['train']

if max_train_samples is not None:
    # We will select sample from whole data if argument is specified
    train_examples = train_examples.select(range(max_train_samples))
eval_examples = raw_dataset['dev']
max_eval_samples = max_eval_samples
if max_eval_samples is not None:
    # We will select sample from whole data if argument is specified
    eval_examples = eval_examples.select(range(max_eval_samples))

print(f"Using {eval_examples.num_rows} eval examples.")



Downloading and preparing dataset json/default to /dccstor/srosent2/primeqa/experiments/run_nq/json/default-71f68cebf752c591/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /dccstor/srosent2/primeqa/experiments/run_nq/json/default-71f68cebf752c591/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Using 10 eval examples.


# Preprocessing

Here we preprocess the data to create features which can be given to the model.

In [7]:
from primeqa.mrc.processors.preprocessors.tydiqa_google import TyDiQAGooglePreprocessor

preprocessor = TyDiQAGooglePreprocessor(
    stride=256,
    tokenizer=tokenizer,
    max_seq_len=max_seq_length,
)

# Training Feature Creation
with training_args.main_process_first(desc="train dataset map pre-processing"):
    train_examples, train_dataset = preprocessor.process_train(train_examples)

print(f"Preprocessing produced {train_dataset.num_rows} train features from {train_examples.num_rows} examples.")

# Validation Feature Creation
with training_args.main_process_first(desc="validation dataset map pre-processing"):
    eval_examples, eval_dataset = preprocessor.process_eval(eval_examples)

print(f"Preprocessing produced {eval_dataset.num_rows} eval features from {eval_examples.num_rows} examples.")

INFO:TyDiQAGooglePreprocessor:TyDiQAGooglePreprocessor only supports single context multiple passages -- enabling


  0%|          | 0/671 [00:00<?, ?ex/s]

  0%|          | 0/671 [00:00<?, ?ex/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Running tokenizer on train dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Preprocessing produced 2396 train features from 671 examples.


  0%|          | 0/10 [00:00<?, ?ex/s]

  0%|          | 0/10 [00:00<?, ?ex/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Running tokenizer on eval dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Preprocessing produced 169 eval features from 10 examples.


In [8]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

# Based on https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb
def show_elements(dataset):
    df = pd.DataFrame(dataset)
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [9]:
import random

def trim_document(example, max_len=500):
    example['context'] = example['context'][0]
    doc_len = len(example['context'])
    if doc_len > max_len:
        example['context'] = f"{example['context'][:max_len - 3]}..."        
    return example

from datasets.features.features import Value
train_examples = train_examples.cast_column('example_id',Value(dtype="int64",id=None))
print(train_examples.features)

random_idxs = random.sample(range(len(train_examples)), num_examples_to_show)
random_examples = train_examples.select(random_idxs).remove_columns(['document_plaintext', 'passage_candidates'])
random_examples = random_examples.map(trim_document)

show_elements(random_examples)  # Show random train examples

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

{'document_plaintext': Value(dtype='string', id=None), 'document_title': Value(dtype='string', id=None), 'document_url': Value(dtype='string', id=None), 'example_id': Value(dtype='int64', id=None), 'language': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'passage_candidates': {'end_positions': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'start_positions': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}, 'target': {'end_positions': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'passage_indices': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'start_positions': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'yes_no_answer': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}, 'context': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}


  0%|          | 0/5 [00:00<?, ?ex/s]

Unnamed: 0,document_title,document_url,example_id,language,question,target,context
0,The Phantom of the Opera (2004 film),https://en.wikipedia.org//w/index.php?title=The_Phantom_of_the_Opera_(2004_film)&amp;oldid=832775247,8076043001083754456,english,who plays in phantom of the opera movie,"{'end_positions': [11286], 'passage_indices': [45], 'start_positions': [10758], 'yes_no_answer': ['NONE']}","The Phantom of the Opera ( 2004 film ) - wikipedia The Phantom of the Opera ( 2004 film ) Jump to : navigation , search The Phantom of the Opera Theatrical release poster Directed by Joel Schumacher Produced by Andrew Lloyd Webber Screenplay by * Andrew Lloyd Webber * Joel Schumacher Based on The Phantom of the Opera by Andrew Lloyd Webber Charles Hart Richard Stilgoe Gaston Leroux Le Fantôme de l'Opéra by Gaston Leroux Starring * Gerard Butler * Emmy Rossum * Patrick Wilson * Miranda Richard..."
1,Rostow's stages of growth,https://en.wikipedia.org//w/index.php?title=Rostow%27s_stages_of_growth&amp;oldid=808032643,7124827825473523345,english,which are the stages of rostow's five-stage model of economic development,"{'end_positions': [1551], 'passage_indices': [31], 'start_positions': [1435], 'yes_no_answer': ['NONE']}","Rostow 's Stages of growth - wikipedia Rostow 's Stages of growth Jump to : navigation , search Economics A supply and demand diagram , illustrating the effects of an increase in demand . * Index * Outline * * History * Types * Classification * History of economics * Economic history ( academic study ) * Schools of economics * Microeconomics * Macroeconomics * Methodology * Heterodox economics * JEL classification codes * Concepts * Theory * Techniques * Econometrics * Economic growth * Econo..."
2,Battle of the Sexes (film),https://en.wikipedia.org//w/index.php?title=Battle_of_the_Sexes_(film)&amp;oldid=819314756,-8379208631402330602,english,who stars in the battle of the sexes,"{'end_positions': [4736], 'passage_indices': [31], 'start_positions': [3499], 'yes_no_answer': ['NONE']}","Battle of the Sexes ( film ) - wikipedia Battle of the Sexes ( film ) Jump to : navigation , search Battle of the Sexes Theatrical release poster Directed by * Jonathan Dayton * Valerie Faris Produced by * Christian Colson * Danny Boyle * Robert Graf Written by Simon Beaufoy Starring * Emma Stone * Steve Carell * Sarah Silverman * Bill Pullman * Alan Cumming * Elisabeth Shue * Austin Stowell * Eric Christian Olsen Music by Nicholas Britell Cinematography Linus Sandgren Edited by Pamela Martin..."
3,The Correct Way to Kill,https://en.wikipedia.org//w/index.php?title=The_Correct_Way_to_Kill&amp;oldid=811751064,2759805113857343936,english,the avengers the correct way to kill cast,"{'end_positions': [1371], 'passage_indices': [18], 'start_positions': [1050], 'yes_no_answer': ['NONE']}","The Correct Way to Kill - Wikipedia The Correct Way to Kill 9th episode of the fifth season of The Avengers `` The Correct Way to Kill '' The Avengers episode Episode no . Season 5 Episode 9 Directed by Charles Crichton Written by Brian Clemens ( teleplay ) Produced by Albert Fennell , Brian Clemens , Julian Wintle Featured music Laurie Johnson , John Dankworth ( theme ) Production code 5 - 9 Original air date 11 March 1967 ( 1967 - 03 - 11 ) Guest appearance ( s ) Anna Quayle Michael Gough P..."
4,Time in the United States,https://en.wikipedia.org//w/index.php?title=Time_in_the_United_States&amp;oldid=798249631,-2599496890159211441,english,where does it change from eastern to central time,"{'end_positions': [11264], 'passage_indices': [47], 'start_positions': [10245], 'yes_no_answer': ['NONE']}","Time in the United States - wikipedia Time in the United States Jump to : navigation , search This article needs additional citations for verification . Please help improve this article by adding citations to reliable sources . Unsourced material may be challenged and removed . ( March 2013 ) ( Learn how and when to remove this template message ) Time in the United States , by law , is divided into nine standard time zones covering the states and its possessions , with most of the United Stat..."


In [10]:
from primeqa.mrc.data_models.target_type import TargetType

random_train_dataset = train_dataset.filter(lambda feature: feature['example_idx'] in random_idxs[:1]).remove_columns(['attention_mask', 'offset_mapping'])
show_elements(random_train_dataset) # Show random train features

  0%|          | 0/3 [00:00<?, ?ba/s]

Unnamed: 0,example_id,input_ids,example_idx,start_positions,end_positions,target_type
0,8076043001083754456,"[0, 8155, 1974, 11, 43741, 9, 5, 15382, 1569, 2, 2, 10248, 124, 159, 7, 39, 46502, 479, 38367, 272, 12708, 924, 4833, 5156, 147, 5, 30099, 1074, 2156, 8, 37, 1411, 7, 3906, 10248, 479, 20, 30099, 1572, 10248, 7, 218, 5, 3312, 3588, 8, 683, 456, 29200, 293, 39, 657, 2156, 8, 3365, 10248, 7, 12908, 123, 479, 10248, 5741, 7, 8838, 5, 30099, 14, 79, 473, 45, 2490, 39, 1717, 7210, 6601, 2156, 53, 1195, 39, 6378, 8, 10640, 7, 3549, 7, 120, 99, 37, 1072, 479, 1801, 172, 2156, 4833, 5156, 13328, 5, 46502, 2156, 8, ...]",423,345,486,1
1,8076043001083754456,"[0, 8155, 1974, 11, 43741, 9, 5, 15382, 1569, 2, 2, 14938, 5684, 129, 5, 30099, 128, 29, 1104, 11445, 479, 15008, 124, 7, 35284, 2156, 5, 7497, 4833, 5156, 5695, 10248, 128, 29, 21264, 1459, 8, 2127, 5, 930, 2233, 583, 69, 25660, 4670, 479, 91, 3311, 11, 7308, 13, 10, 1151, 8, 172, 4072, 7, 989, 2156, 53, 6897, 2115, 27515, 10, 2310, 1275, 1458, 19, 10, 909, 21041, 3016, 198, 5, 10114, 19, 10248, 128, 29, 4921, 3758, 7391, 7, 24, 25606, 28695, 14, 5, 30099, 16, 4299, 2156, 8, 202, 6138, 10248, 479, 6719, 36, 17668, ...]",423,101,242,1


In [11]:
from operator import attrgetter
from transformers import DataCollatorWithPadding
from primeqa.mrc.data_models.eval_prediction_with_processing import EvalPredictionWithProcessing
from primeqa.mrc.metrics.tydi_f1.tydi_f1 import TyDiF1
from primeqa.mrc.processors.postprocessors.extractive import ExtractivePostProcessor
from primeqa.mrc.processors.postprocessors.scorers import SupportedSpanScorers

# If using mixed precision we pad for efficient hardware acceleration
using_mixed_precision = any(attrgetter('fp16', 'bf16')(training_args))
data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=64 if using_mixed_precision else None)

# noinspection PyProtectedMember
postprocessor = ExtractivePostProcessor(
    k=3,
    n_best_size=20,
    max_answer_length=max_answer_length,
    scorer_type=SupportedSpanScorers.WEIGHTED_SUM_TARGET_TYPE_AND_SCORE_DIFF,
    single_context_multiple_passages=preprocessor._single_context_multiple_passages,
)

def compute_metrics(p: EvalPredictionWithProcessing):
    return TyDiF1().compute(predictions=p.processed_predictions, references=p.label_ids, passage_non_null_threshold=1, span_non_null_threshold=1)

trainer = MRCTrainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=eval_dataset if training_args.do_eval else None,
    eval_examples=eval_examples if training_args.do_eval else None,
    tokenizer=tokenizer,
    data_collator=data_collator,
    post_process_function=postprocessor.process_references_and_predictions,  # see QATrainer in Huggingface
    compute_metrics=compute_metrics,
)

# Evaluation

Here we evaluate the model on the validation set.

In [12]:
max_eval_samples = max_eval_samples or len(eval_dataset)

metrics = trainer.evaluate()
metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))

trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

INFO:primeqa.mrc.trainers.mrc:The following columns in the evaluation set  don't have a corresponding argument in `RobertaModelForDownstreamTasks.forward` and have been ignored: example_id, example_idx, offset_mapping.
***** Running Evaluation *****
  Num examples = 169
  Batch size = 16


100%|██████████| 10/10 [00:01<00:00,  5.89it/s]

Passage & english & \fpr{0.0}{0.0}{0.0}
Minimal Answer & english & \fpr{8.9}{11.8}{7.1}
********************
english
Language: english (10)
********************
PASSAGE ANSWER R@P TABLE:
Optimal threshold: 0.0
 F1     /  P      /  R
  0.00% /   0.00% /   0.00%
R@P=0.5: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.75: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.9: 0.00% (actual p=0.00%, score threshold=0.0)
********************
MINIMAL ANSWER R@P TABLE:
Optimal threshold: 0.45009
 F1     /  P      /  R
  8.87% /  11.83% /   7.10%
R@P=0.5: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.75: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.9: 0.00% (actual p=0.00%, score threshold=0.0)
Total # examples in gold: 10, # ex. in pred: 10 (including english)
*** Macro Over 0 Languages, excluding English **
Passage F1:0.000 P:0.000 R:0.000000
\fpr{0.0}{0.0}{0.0}
Minimal F1:0.000 P:0.000 R:0.000000
\fpr{0.0}{0.0}{0.0}
*** / Aggregate Scores ****
{"avg_passage_f1": 0, "avg_passage_re




# Predictions

Here we examine the model predictions.

In [13]:
import json
import os
from pprint import pprint

with open(os.path.join(output_dir, 'eval_predictions.json'), 'r') as f:
    predictions = json.load(f)

pprint(predictions)

{'-1717600980747760095': [{'cls_score': -0.06888147257268429,
                           'confidence_score': 0.33520141358132644,
                           'end_index': 342,
                           'end_logit': 0.7157881259918213,
                           'end_stdev': 0.0,
                           'example_id': '-1717600980747760095',
                           'normalized_span_answer_score': 0.33520141358132644,
                           'passage_index': -1,
                           'query_passage_similarity': 0.0,
                           'span_answer': {'end_position': 10622,
                                           'start_position': 9709},
                           'span_answer_score': 0.49595116917043924,
                           'span_answer_text': 'fying that I felt not uplifted '
                                               "-- but sandbagged ! '' And "
                                               'John Simon -- later notorious '
                          