# ListQA

In this notebook, we will see how to evaluate a model on questions that have lists as answers.

# Dependencies

If not already done, make sure to install PrimeQA with `notebooks` extras before getting started.

In [4]:
# If you want CUDA 11 uncomment and run this (for CUDA 10 or CPU you can ignore this line).
#! pip install 'torch~=1.11.0' --extra-index-url https://download.pytorch.org/whl/cu113

# Uncomment to install PrimeQA from source (pypi package pending).
# The path should be the project root (e.g. '.' below).
#! pip install .[notebooks]

# NQ List Subset

Create train and dev subsets from NQ that have lists as answers

In [5]:
from examples.listqa.list_nq2tydi import ListNQSubset

input_train = "/dccstor/srosent1/projects/mlqa/data/NQ/train/nq-train-0*.jsonl.gz"
input_dev = "/dccstor/srosent1/projects/mlqa/data/NQ/dev/nq-dev-*jsonl.gz"
output_data_dir = "/dccstor/srosent2/primeqa/data/nq/"

# num_lines is how many lines to read from each file. Use -1 to load full dataset. We only load a subset here for illustration purposes 
processor = ListNQSubset()
processor.process(input_train, output_data_dir + "/nq-train-lists.jsonl", num_lines=5000)
processor.process(input_dev, output_data_dir + "/nq-dev-lists.jsonl", num_lines=1000)

INFO:root:Loading NQ data...
INFO:root:/dccstor/srosent1/projects/mlqa/data/NQ/train/nq-train-08.jsonl.gz
100%|██████████| 2000/2000 [00:41<00:00, 47.92it/s]
INFO:root:Done
INFO:root:Converting to tydi
100%|██████████| 2000/2000 [00:18<00:00, 105.41it/s]
INFO:root:data size: 15
INFO:root:/dccstor/srosent1/projects/mlqa/data/NQ/train/nq-train-07.jsonl.gz
100%|██████████| 2000/2000 [00:32<00:00, 61.66it/s]
INFO:root:Done
INFO:root:Converting to tydi
100%|██████████| 2000/2000 [00:25<00:00, 79.38it/s] 
INFO:root:data size: 18
INFO:root:/dccstor/srosent1/projects/mlqa/data/NQ/train/nq-train-02.jsonl.gz
100%|██████████| 2000/2000 [00:33<00:00, 59.88it/s]
INFO:root:Done
INFO:root:Converting to tydi
100%|██████████| 2000/2000 [00:25<00:00, 79.43it/s] 
INFO:root:data size: 12
INFO:root:/dccstor/srosent1/projects/mlqa/data/NQ/train/nq-train-01.jsonl.gz
100%|██████████| 2000/2000 [00:35<00:00, 56.40it/s]
INFO:root:Done
INFO:root:Converting to tydi
100%|██████████| 2000/2000 [00:22<00:00, 87.97it

# Configuration

We start by setting some parameters to configure the process.  Note that depending on the GPU being used you may need to tune the batch size.

In [6]:
# This needs to be filled in.
output_dir = '/dccstor/srosent2/primeqa/experiments/run_nq'        # Save the results here.  Will overwrite if directory already exists.

# Optional parameters (feel free to leave as default).
model_name = 'roberta-base'  # Set this to select the LM.  We use a fine-tuned xlm-roberta model for list answers.
cache_dir = None                 # Set this if you have a cache directory for transformers.  Alternatively set the HF_HOME env var.
train_batch_size = 16             # Set this to change the number of features per batch during training.
eval_batch_size = 16              # Set this to change the number of features per batch during evaluation.
gradient_accumulation_steps = 8  # Set this to effectively increase training batch size.
max_train_samples = 500          # Set this to use a subset of the training data (or None for all).
max_eval_samples = 10            # Set this to use a subset of the evaluation data (or None for all).
num_train_epochs = 1             # Set this to change the number of training epochs.
fp16 = False                     # Set this to true to enable fp16 (hardware support required).
num_examples_to_show = 5        # Set this to change the number of random train examples (and their features) to show.
max_seq_length=512              # have large sequence length to accomodate lists which can be long (we don't want to split on lists)
max_answer_length=1000          # have large answer length to accomodate lists which can be long

In [7]:
from transformers import TrainingArguments
from transformers.trainer_utils import set_seed

seed = 42
set_seed(seed)

training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    do_train=False,
    do_eval=True,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    num_train_epochs=num_train_epochs,
    evaluation_strategy='no',
    learning_rate=5e-05,
    warmup_ratio=0.1,
    weight_decay=0.1,
    save_steps=50000,
    fp16=fp16,
    seed=seed,
)

# Loading the Model

Here we load the model and tokenizer based on the model_name parameter set above.  We use a model with an extractive QA task head which we will later fine-tune.

In [8]:
from transformers import AutoConfig, AutoTokenizer
from primeqa.mrc.models.heads.extractive import EXTRACTIVE_HEAD
from primeqa.mrc.models.task_model import ModelForDownstreamTasks

from primeqa.mrc.trainers.mrc import MRCTrainer

task_heads = EXTRACTIVE_HEAD
config = AutoConfig.from_pretrained(
    model_name,
    cache_dir=cache_dir,
)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    cache_dir=cache_dir,
    use_fast=True,
    config=config,
)
model = ModelForDownstreamTasks.from_config(
    config,
    model_name,
    task_heads=task_heads,
    cache_dir=cache_dir,
)
model.set_task_head(next(iter(task_heads)))

print(model)  # Examine the model structure

INFO:ExtractiveQAHead:Loading dropout value 0.1 from config attribute 'hidden_dropout_prob'
Some weights of XLMRobertaModelForDownstreamTasks were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['task_heads.qa_head.qa_outputs.bias', 'task_heads.qa_head.classifier.dense.bias', 'task_heads.qa_head.qa_outputs.weight', 'task_heads.qa_head.classifier.dense.weight', 'task_heads.qa_head.classifier.out_proj.weight', 'task_heads.qa_head.classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:XLMRobertaModelForDownstreamTasks:Setting task head for first time to 'None'


XLMRobertaModelForDownstreamTasks(
  (roberta): XLMRobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (L

# Loading Data

Here we load the NQ dataset from disk in TyDi Google format.

In [9]:
import datasets

raw_dataset = datasets.load_dataset('json', 
        data_files={'train':[output_data_dir+"/nq-train-lists.jsonl"],'dev':[output_data_dir+"/nq-dev-lists.jsonl"]},
        cache_dir=training_args.output_dir)
train_examples = raw_dataset['train']
max_train_samples = max_train_samples
if max_train_samples is not None:
    # We will select sample from whole data if argument is specified
    train_examples = train_examples.select(range(max_train_samples))
eval_examples = raw_dataset['dev']
max_eval_samples = max_eval_samples
if max_eval_samples is not None:
    # We will select sample from whole data if argument is specified
    eval_examples = eval_examples.select(range(max_eval_samples))

print(f"Using {eval_examples.num_rows} eval examples.")



Downloading and preparing dataset json/default to /dccstor/srosent2/primeqa/experiments/run_nq/json/default-17b552305cac9100/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /dccstor/srosent2/primeqa/experiments/run_nq/json/default-17b552305cac9100/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Using 5 eval examples.


# Preprocessing

Here we preprocess the data to create features which can be given to the model.

In [10]:
from primeqa.mrc.processors.preprocessors.tydiqa_google import TyDiQAGooglePreprocessor

preprocessor = TyDiQAGooglePreprocessor(
    stride=256,
    tokenizer=tokenizer,
    max_seq_len=max_seq_length,
)

# Training Feature Creation
with training_args.main_process_first(desc="train dataset map pre-processing"):
    train_examples, train_dataset = preprocessor.process_train(train_examples)

print(f"Preprocessing produced {train_dataset.num_rows} train features from {train_examples.num_rows} examples.")

# Validation Feature Creation
with training_args.main_process_first(desc="validation dataset map pre-processing"):
    eval_examples, eval_dataset = preprocessor.process_eval(eval_examples)

print(f"Preprocessing produced {eval_dataset.num_rows} eval features from {eval_examples.num_rows} examples.")

INFO:TyDiQAGooglePreprocessor:TyDiQAGooglePreprocessor only supports single context multiple passages -- enabling


  0%|          | 0/100 [00:00<?, ?ex/s]

  0%|          | 0/100 [00:00<?, ?ex/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Running tokenizer on train dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Preprocessing produced 382 train features from 100 examples.


  0%|          | 0/5 [00:00<?, ?ex/s]

  0%|          | 0/5 [00:00<?, ?ex/s]

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Running tokenizer on eval dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Preprocessing produced 65 eval features from 5 examples.


In [11]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

# Based on https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb
def show_elements(dataset):
    df = pd.DataFrame(dataset)
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [12]:
import random

def trim_document(example, max_len=500):
    example['context'] = example['context'][0]
    doc_len = len(example['context'])
    if doc_len > max_len:
        example['context'] = f"{example['context'][:max_len - 3]}..."        
    return example

from datasets.features.features import Value
train_examples = train_examples.cast_column('example_id',Value(dtype="int64",id=None))
print(train_examples.features)

random_idxs = random.sample(range(len(train_examples)), num_examples_to_show)
random_examples = train_examples.select(random_idxs).remove_columns(['document_plaintext', 'passage_candidates'])
random_examples = random_examples.map(trim_document)

show_elements(random_examples)  # Show random train examples

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

{'document_plaintext': Value(dtype='string', id=None), 'document_title': Value(dtype='string', id=None), 'document_url': Value(dtype='string', id=None), 'example_id': Value(dtype='int64', id=None), 'language': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'passage_candidates': {'end_positions': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'start_positions': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}, 'target': {'end_positions': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'passage_indices': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'start_positions': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'yes_no_answer': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}, 'context': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}


  0%|          | 0/5 [00:00<?, ?ex/s]

Unnamed: 0,document_title,document_url,example_id,language,question,target,context
0,Mamma Mia! Here We Go Again,https://en.wikipedia.org//w/index.php?title=Mamma_Mia!_Here_We_Go_Again&amp;oldid=846364736,2796511963863747460,english,mama mia 2 here we go again cast,"{'end_positions': [3719], 'passage_indices': [27], 'start_positions': [2233], 'yes_no_answer': ['NONE']}",Mamma Mia ! Here We Go Again - Wikipedia Mamma Mia ! Here We Go Again Mamma Mia ! Here We Go Again Theatrical release poster Directed by Ol Parker Produced by * Judy Craymer * Gary Goetzman Screenplay by Ol Parker Story by * Catherine Johnson * Richard Curtis * Ol Parker Based on Mamma Mia ! by Catherine Johnson Starring * Christine Baranski * Pierce Brosnan * Dominic Cooper * Colin Firth * Andy García * Lily James * Amanda Seyfried * Stellan Skarsgård * Julie Walters * Cher * Meryl Streep Mu...
1,Gunsmoke,https://en.wikipedia.org//w/index.php?title=Gunsmoke&amp;oldid=836953511,-1267223525065420052,english,how many episodes of gunsmoke was burt reynolds on,"{'end_positions': [45852], 'passage_indices': [179], 'start_positions': [44202], 'yes_no_answer': ['NONE']}","Gunsmoke - wikipedia Gunsmoke Jump to : navigation , search This article is about the radio and television series . For other uses , see Gun Smoke . James Arness as Matt Dillon in the television version of Gunsmoke ( 1956 ) Gunsmoke is an American radio and television Western drama series created by director Norman Macdonnell and writer John Meston . The stories take place in and around Dodge City , Kansas , during the settlement of the American West . The central character is lawman Marshal ..."
2,The Fab Four (tribute),https://en.wikipedia.org//w/index.php?title=The_Fab_Four_(tribute)&amp;oldid=854972994,684722901270187660,english,what are the names of the fab four,"{'end_positions': [2880], 'passage_indices': [15], 'start_positions': [2133], 'yes_no_answer': ['NONE']}","The Fab Four ( tribute ) - wikipedia The Fab Four ( tribute ) The Fab Four Background information Origin California Genres Rock and roll Beatles tribute Years active 1997 - Present Labels Delta Ent . ( Laserlight ) , New World Digital Website www.TheFabFour.com Members Ron McNeil Gilbert Bonilla Tyson Kelly Ardy Sarraf Neil Candelora Michael Amador Gavin Pring Rolo Sandoval Joe Bologna Luis Renteria Erik Fidel The Fab Four is a California - based tribute band paying homage to The Beatles . Fo..."
3,Rise of the Guardians,https://en.wikipedia.org//w/index.php?title=Rise_of_the_Guardians&amp;oldid=826399004,-5871963713943154664,english,who plays voices in rise of the guardians,"{'end_positions': [6904], 'passage_indices': [27], 'start_positions': [5187], 'yes_no_answer': ['NONE']}","Rise of the Guardians - Wikipedia Rise of the Guardians Jump to : navigation , search This article is about the film . For the video game , see Rise of the Guardians : The Video Game . Rise of the Guardians Theatrical release poster Directed by Peter Ramsey Produced by Christina Steinberg Nancy Bernstein Screenplay by David Lindsay - Abaire Based on The Guardians of Childhood by William Joyce Starring Chris Pine Alec Baldwin Hugh Jackman Isla Fisher Jude Law Music by Alexandre Desplat Edited ..."
4,Who Stole the Cookie from the Cookie Jar?,https://en.wikipedia.org//w/index.php?title=Who_Stole_the_Cookie_from_the_Cookie_Jar%3F&amp;oldid=834546477,-4735931497873009375,english,who ate the cookies from the cookie jar song lyrics,"{'end_positions': [1858], 'passage_indices': [4], 'start_positions': [1553], 'yes_no_answer': ['NONE']}","Who stole the cookie from the cookie jar ? - Wikipedia Who stole the cookie from the cookie jar ? `` Cookie Jar Song '' redirects here . For the Gym Class Heroes song , see Cookie Jar ( song ) . `` Who Stole the Cookie from the Cookie Jar ? '' or the Cookie Jar Song is a sing along and game of children 's music . The song is an infinite - loop motif , where each verse directly feeds into the next . The game begins with the children sitting or standing , arranged in an inward - facing circle ...."


In [13]:
from primeqa.mrc.data_models.target_type import TargetType

random_train_dataset = train_dataset.filter(lambda feature: feature['example_idx'] in random_idxs[:1]).remove_columns(['attention_mask', 'offset_mapping'])
show_elements(random_train_dataset) # Show random train features

  0%|          | 0/1 [00:00<?, ?ba/s]

Unnamed: 0,example_id,input_ids,example_idx,start_positions,end_positions,target_type
0,2796511963863747460,"[0, 8840, 8132, 116, 3688, 642, 738, 13438, 37702, 2, 2, 24668, 111, 70, 5701, 9351, 6, 5, 581, 1346, 6057, 7, 6035, 8613, 24714, 4517, 6, 4, 141499, 14631, 6, 4, 119497, 74583, 84285, 6, 4, 161055, 22411, 32673, 6, 4, 64672, 80490, 7, 6, 4, 47231, 329, 126449, 4458, 6, 4, 166132, 35526, 927, 6, 4, 124172, 19, 104964, 7, 33653, 6, 4, 120725, 59797, 6, 4, 136, 70535, 6, 5, 1650, 83, 80889, 71, 47, 186, 121447, 23, 23924, 21629, 136, 70, 17274, 98, 20414, 387, 6, 4, 267, 390, 53900, 134896, 6, 4, 1492, 5369, 47, ...]",67,0,0,2
1,2796511963863747460,"[0, 8840, 8132, 116, 3688, 642, 738, 13438, 37702, 2, 2, 7, 15, 47231, 329, 126449, 4458, 6, 4, 124172, 19, 104964, 7, 33653, 6, 4, 136, 166132, 35526, 927, 1388, 6, 4, 136, 165249, 10, 76849, 6, 4, 1620, 272, 538, 756, 98, 604, 10002, 6, 4, 15490, 10, 42732, 47, 17997, 604, 4210, 678, 142, 241957, 19922, 1295, 22008, 2412, 1902, 959, 77049, 71, 707, 84751, 47, 1957, 152, 604, 9963, 432, 9319, 6, 4, 158189, 53257, 12667, 6, 5, 32301, 15, 27211, 1388, 661, 119497, 74583, 84285, 237, 145906, 53257, 12667, 6, 4, 140716, 242, 7, 496, ...]",67,86,491,1
2,2796511963863747460,"[0, 8840, 8132, 116, 3688, 642, 738, 13438, 37702, 2, 2, 459, 6496, 42179, 661, 20984, 11, 151466, 237, 44389, 12015, 478, 661, 47231, 329, 126449, 4458, 237, 3362, 3980, 266, 1436, 583, 6, 4, 145906, 242, 7, 29954, 1021, 9319, 136, 7722, 67373, 6, 4, 142, 120552, 20, 15672, 107653, 6, 4, 158189, 242, 7, 775, 20, 23, 20, 27165, 6, 4, 136, 71390, 47, 140716, 661, 169707, 6921, 34271, 237, 44389, 3362, 661, 166132, 35526, 927, 237, 20904, 138276, 6, 4, 145906, 242, 7, 7722, 67373, 136, 10, 56101, 150065, 6, 5, 661, 198952, 68761, 1679, 237, 44389, ...]",67,0,0,2


In [17]:
from operator import attrgetter
from transformers import DataCollatorWithPadding
from primeqa.mrc.data_models.eval_prediction_with_processing import EvalPredictionWithProcessing
from primeqa.mrc.metrics.tydi_f1.tydi_f1 import TyDiF1
from primeqa.mrc.processors.postprocessors.extractive import ExtractivePostProcessor
from primeqa.mrc.processors.postprocessors.scorers import SupportedSpanScorers

# If using mixed precision we pad for efficient hardware acceleration
using_mixed_precision = any(attrgetter('fp16', 'bf16')(training_args))
data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=64 if using_mixed_precision else None)

# noinspection PyProtectedMember
postprocessor = ExtractivePostProcessor(
    k=3,
    n_best_size=20,
    max_answer_length=max_answer_length,
    scorer_type=SupportedSpanScorers.WEIGHTED_SUM_TARGET_TYPE_AND_SCORE_DIFF,
    single_context_multiple_passages=preprocessor._single_context_multiple_passages,
)

def compute_metrics(p: EvalPredictionWithProcessing):
    return TyDiF1().compute(predictions=p.processed_predictions, references=p.label_ids, passage_non_null_threshold=1, span_non_null_threshold=1)

trainer = MRCTrainer(
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=eval_dataset if training_args.do_eval else None,
    eval_examples=eval_examples if training_args.do_eval else None,
    tokenizer=tokenizer,
    data_collator=data_collator,
    post_process_function=postprocessor.process_references_and_predictions,  # see QATrainer in Huggingface
    compute_metrics=compute_metrics,
)

# Evaluation

Here we evaluate the model on the validation set.

In [18]:
max_eval_samples = max_eval_samples or len(eval_dataset)

metrics = trainer.evaluate()
metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))

trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

INFO:primeqa.mrc.trainers.mrc:The following columns in the evaluation set  don't have a corresponding argument in `XLMRobertaModelForDownstreamTasks.forward` and have been ignored: offset_mapping, example_id, example_idx.
***** Running Evaluation *****
  Num examples = 65
  Batch size = 16


100%|██████████| 5/5 [00:00<00:00, 15.83it/s]


Passage & english & \fpr{0.0}{0.0}{0.0}
Minimal Answer & english & \fpr{0.0}{0.0}{0.0}
********************
english
Language: english (5)
********************
PASSAGE ANSWER R@P TABLE:
Optimal threshold: 0.0
 F1     /  P      /  R
  0.00% /   0.00% /   0.00%
R@P=0.5: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.75: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.9: 0.00% (actual p=0.00%, score threshold=0.0)
********************
MINIMAL ANSWER R@P TABLE:
Optimal threshold: 0.0
 F1     /  P      /  R
  0.00% /   0.00% /   0.00%
R@P=0.5: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.75: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.9: 0.00% (actual p=0.00%, score threshold=0.0)
Total # examples in gold: 5, # ex. in pred: 5 (including english)
*** Macro Over 0 Languages, excluding English **
Passage F1:0.000 P:0.000 R:0.000000
\fpr{0.0}{0.0}{0.0}
Minimal F1:0.000 P:0.000 R:0.000000
\fpr{0.0}{0.0}{0.0}
*** / Aggregate Scores ****
{"avg_passage_f1": 0, "avg_passage_recall": 0

# Predictions

Here we examine the model predictions.

In [16]:
import json
import os
from pprint import pprint

with open(os.path.join(output_dir, 'eval_predictions.json'), 'r') as f:
    predictions = json.load(f)

pprint(predictions)

{'-8498955431733322253': [{'cls_score': -0.4986354410648346,
                           'confidence_score': 0.3383668723153968,
                           'end_index': 120,
                           'end_logit': 0.12701758742332458,
                           'end_stdev': 0.0,
                           'example_id': '-8498955431733322253',
                           'normalized_span_answer_score': 0.3383668723153968,
                           'passage_index': -1,
                           'query_passage_similarity': 0.0,
                           'span_answer': {'end_position': 1302,
                                           'start_position': 1299},
                           'span_answer_score': 0.40101583674550056,
                           'span_answer_text': 'tym',
                           'start_index': 120,
                           'start_logit': 0.23672683537006378,
                           'start_stdev': 0.0,
                           'target_type_logits': [0.2221