# Confidence Calibration

In this notebook, we will see how to train a confidence calibration model for MRC using the TyDiQA dataset.


# Dependencies

If not already done, make sure to install PrimeQA with notebooks extras before getting started.


In [1]:
# If you want CUDA 11 uncomment and run this (for CUDA 10 or CPU you can ignore this line).
#! pip install 'torch~=1.11.0' --extra-index-url https://download.pytorch.org/whl/cu113

# Uncomment to install PrimeQA from source (pypi package pending).
# The path should be the project root (e.g. '.' below).
#! pip install .[notebooks]

# Configuration

We start by setting some parameters to configure the process. Note that depending on the GPU being used you may need to tune the batch size.


In [2]:
# These need to be filled in.
output_dir = 'FILL_ME_IN'               # Save the mrc model here.  Will overwrite if directory already exists.
confidence_model_dir = 'FILL_ME_IN'     # Save the confidence model here.  Will overwrite if diirectory already exists.
confidence_dataset_dir = 'FILL_ME_IN'   # Save the confidence model training data here.  Will overwrite if directory already exists.

# Parameters for mrc model training (feel free to leave as default).
model_name = 'xlm-roberta-base'  # Set this to select the LM.  Since this is a multi-lingual dataset, we use the XLM-Roberta model.
cache_dir = None                 # Set this if you have a cache directory for transformers.  Alternatively set the HF_HOME env var.
train_batch_size = 8             # Set this to change the number of features per batch during training.
eval_batch_size = 8              # Set this to change the number of features per batch during evaluation.
gradient_accumulation_steps = 8  # Set this to effectively increase training batch size.
max_train_samples = 300          # Set this to use a subset of the training data (or None for all).
max_eval_samples = 50            # Set this to use a subset of the evaluation data (or None for all).
num_train_epochs = 1             # Set this to change the number of training epochs.
fp16 = False                     # Set this to true to enable fp16 (hardware support required).

# Parameters for coonfidence model training (feel free to leave as default).
relative_confidence_train_size = 0.1          # Set this to change the relative size of confidence model train set split from original train set.
max_iter_of_confidence_model_training = 100   # Set this to change the maximum number of iterations for confidence model training.

In [3]:
from transformers import TrainingArguments
from transformers.trainer_utils import set_seed

seed = 42
set_seed(seed)

training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=eval_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    num_train_epochs=num_train_epochs,
    evaluation_strategy='no',
    learning_rate=4e-05,
    warmup_ratio=0.1,
    weight_decay=0.1,
    save_steps=50000,
    fp16=fp16,
    seed=seed,
)

task_args = dict(
    confidence_model_dir=confidence_model_dir,
    confidence_dataset_dir=confidence_dataset_dir,
    relative_confidence_train_size=relative_confidence_train_size,
    max_iter_of_confidence_model_training=max_iter_of_confidence_model_training,    
)

# Loading Data

Here we load the TyDiQA dataset using Huggingface's datasets library.


In [4]:
import datasets

raw_datasets = datasets.load_dataset(
    'tydiqa',
    'primary_task',
    cache_dir=cache_dir,
)


Reusing dataset tydiqa (/dccstor/zhrong-nmt/QA/primeqa/exp/cache/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148)


  0%|          | 0/2 [00:00<?, ?it/s]

# Splitting Data

Here we split the train set of the raw_datasets into mrc_train set and confidence_train set with the ratio specified by 'relative_confidence_train_size'. For example, 'relative_confidence_train_size=0.1' means 10% of the original train data is used for confidence model training, and 90% is used for MRC model training. The new datasets, including the two new train sets and original validation set are saved to the directory specifid by 'confidence_dataset_dir'.


In [5]:
from datasets import DatasetDict
import os

original_train_set = raw_datasets["train"]
split_train_set = original_train_set.train_test_split(test_size=task_args['relative_confidence_train_size'])
mrc_train_set = split_train_set["train"]
confidence_train_set = split_train_set["test"]
validation_set = raw_datasets["validation"]

confidence_datasets = DatasetDict({
    "mrc_train": mrc_train_set,
    "confidence_train": confidence_train_set,
    "validation": validation_set,
})

# save new datasets
os.makedirs(task_args['confidence_dataset_dir'], exist_ok=True)
confidence_datasets.save_to_disk(task_args['confidence_dataset_dir'])


Loading cached split indices for dataset at /dccstor/zhrong-nmt/QA/primeqa/exp/cache/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148/cache-4d5134431da553f5.arrow and /dccstor/zhrong-nmt/QA/primeqa/exp/cache/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148/cache-46529ac46b6a5efb.arrow
Loading cached processed dataset at /dccstor/zhrong-nmt/QA/primeqa/exp/cache/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148/cache-8548cd041ab057a6.arrow
Loading cached processed dataset at /dccstor/zhrong-nmt/QA/primeqa/exp/cache/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148/cache-dc77c82f5ac6aaa8.arrow


# Building MRC Model

Before the building of a confidence calibration model, we need to train a MRC model which will be used later to generate the features for confidence model training.


## Loading Language Model

The first step of MRC model training is to load the LM and tokenizer based on the model_name parameter set above. We also set the task head to EXTRACTIVE_WITH_CONFIDENCE_HEAD which supports confidence calibration.


In [6]:
from transformers import AutoConfig, AutoTokenizer
from primeqa.mrc.models.heads.extractive import EXTRACTIVE_WITH_CONFIDENCE_HEAD
from primeqa.mrc.models.task_model import ModelForDownstreamTasks

from primeqa.mrc.trainers.mrc import MRCTrainer

task_heads = EXTRACTIVE_WITH_CONFIDENCE_HEAD
config = AutoConfig.from_pretrained(
    model_name,
    cache_dir=cache_dir,
)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    cache_dir=cache_dir,
    use_fast=True,
    config=config,
)
config.sep_token_id = tokenizer.convert_tokens_to_ids(tokenizer.sep_token)
config.output_dropout_rate = 0.25
config.decoding_times_with_dropout = 5
model = ModelForDownstreamTasks.from_config(
    config,
    model_name,
    task_heads=task_heads,
    cache_dir=cache_dir,
)
model.set_task_head(next(iter(task_heads)))

print(model)  # Examine the model structure


Some weights of XLMRobertaModelForDownstreamTasks were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['task_heads.qa_head.classifier.dense.bias', 'task_heads.qa_head.classifier.out_proj.bias', 'task_heads.qa_head.classifier.out_proj.weight', 'task_heads.qa_head.qa_outputs.weight', 'task_heads.qa_head.classifier.dense.weight', 'task_heads.qa_head.qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


XLMRobertaModelForDownstreamTasks(
  (roberta): XLMRobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (L

## Preprocessing

Next we preprocess the data of mrc_train set which can be given to the model.

In [7]:
from primeqa.mrc.processors.preprocessors.tydiqa import TyDiQAPreprocessor

preprocessor = TyDiQAPreprocessor(
    stride=128,
    tokenizer=tokenizer,
)

train_dataset = confidence_datasets["mrc_train"]
if max_train_samples is not None:
    # We will select sample from whole data if argument is specified
    train_dataset = train_dataset.select(range(max_train_samples))
# Train Feature Creation
with training_args.main_process_first(desc="train dataset map pre-processing"):
    train_examples, train_dataset = preprocessor.process_train(train_dataset)
print(f"Preprocessing produced {train_dataset.num_rows} train features from {train_examples.num_rows} examples.")

Loading cached processed dataset at /dccstor/zhrong-nmt/QA/primeqa/exp/cache/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148/cache-1c80317fa3b1799d.arrow
Loading cached processed dataset at /dccstor/zhrong-nmt/QA/primeqa/exp/cache/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148/cache-bdd640fb06671ad1.arrow
Loading cached processed dataset at /dccstor/zhrong-nmt/QA/primeqa/exp/cache/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148/cache-3eb13b9046685257.arrow
Loading cached processed dataset at /dccstor/zhrong-nmt/QA/primeqa/exp/cache/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148/cache-23b8c1e9392456de.arrow


Preprocessing produced 447 train features from 500 examples.


## Fine-tuning

Here we fine-tune the MRC model on the training set.

In [8]:
from operator import attrgetter
from transformers import DataCollatorWithPadding
from primeqa.mrc.data_models.eval_prediction_with_processing import EvalPredictionWithProcessing
from primeqa.mrc.metrics.tydi_f1.tydi_f1 import TyDiF1
from primeqa.mrc.processors.postprocessors.extractive import ExtractivePostProcessor
from primeqa.mrc.processors.postprocessors.scorers import SupportedSpanScorers

# If using mixed precision we pad for efficient hardware acceleration
using_mixed_precision = any(attrgetter('fp16', 'bf16')(training_args))
data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=64 if using_mixed_precision else None)

# noinspection PyProtectedMember
postprocessor = ExtractivePostProcessor(
    k=3,
    n_best_size=20,
    max_answer_length=30,
    scorer_type=SupportedSpanScorers.WEIGHTED_SUM_TARGET_TYPE_AND_SCORE_DIFF,
    single_context_multiple_passages=preprocessor._single_context_multiple_passages,
    output_confidence_feature = True,
)

def compute_metrics(p: EvalPredictionWithProcessing):
    return TyDiF1().compute(predictions=p.processed_predictions, references=p.label_ids)

trainer = MRCTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset if training_args.do_train else None,
    eval_dataset=None,
    eval_examples=None,
    tokenizer=tokenizer,
    data_collator=data_collator,
    post_process_function=postprocessor.process_references_and_predictions,  # see QATrainer in Huggingface
    compute_metrics=compute_metrics,
)

train_result = trainer.train()
trainer.save_model()  # Saves the tokenizer too for easy upload

metrics = train_result.metrics
max_train_samples = max_train_samples or len(train_dataset)
metrics["train_samples"] = min(max_train_samples, len(train_dataset))

trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

***** Running training *****
  Num examples = 447
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 8
  Total optimization steps = 7


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to FILL_ME_IN
Configuration saved in FILL_ME_IN/config.json
Model weights saved in FILL_ME_IN/pytorch_model.bin
tokenizer config file saved in FILL_ME_IN/tokenizer_config.json
Special tokens file saved in FILL_ME_IN/special_tokens_map.json


***** train metrics *****
  epoch                    =        1.0
  total_flos               =   111370GF
  train_loss               =      4.608
  train_runtime            = 0:09:58.81
  train_samples            =        447
  train_samples_per_second =      0.746
  train_steps_per_second   =      0.012


# Building Confidence Calibration Model

Here we start to build the confidence calibration model.


## Preprocessing

We first preprocess the data of confidence_train set.

In [10]:
conf_examples = confidence_datasets["confidence_train"]
if max_eval_samples is not None:
    # We will select sample from whole data
    conf_examples = conf_examples.select(range(max_eval_samples))
# Feature Creation
with training_args.main_process_first(desc="confidence dataset map pre-processing"):
    conf_examples, conf_dataset = preprocessor.process_eval(conf_examples)
print(f"Preprocessing produced {conf_dataset.num_rows} train features from {conf_examples.num_rows} examples.")

  0%|          | 0/100 [00:00<?, ?ex/s]

  0%|          | 0/100 [00:00<?, ?ex/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Running tokenizer on eval dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Preprocessing produced 1475 train features from 100 examples.


## Generating Confidence Features

Here we run the MRC model obtained in previous step on the processed confidence_train set. The prediction file contains the information that can be used to generate confidence features. 

In [11]:
metrics = trainer.evaluate(eval_dataset=conf_dataset, eval_examples=conf_examples)
max_conf_samples = max_eval_samples if max_eval_samples else len(conf_dataset)
metrics["confidence_samples"] = min(max_conf_samples, len(conf_dataset))
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

for fn in ["eval_predictions.json", "eval_references.json", "eval_predictions_processed.json"]:
    if not os.path.exists(os.path.join(training_args.output_dir, fn)):
        raise ValueError(f"Unable to find eval result file {fn} from {training_args.output_dir}.")

os.replace(os.path.join(training_args.output_dir, 'eval_predictions.json'),
           os.path.join(training_args.output_dir, 'conf_predictions.json'))
os.replace(os.path.join(training_args.output_dir, 'eval_references.json'),
           os.path.join(training_args.output_dir, 'conf_references.json'))
confidence_set_prediction_file = os.path.join(training_args.output_dir, "conf_predictions.json")
confidence_set_reference_file = os.path.join(training_args.output_dir, "conf_references.json")


***** Running Evaluation *****
  Num examples = 1475
  Batch size = 8


100%|█████████████████████████████████████████████████████████████████████████████| 100/100 [00:05<00:00, 18.62it/s]


Passage & english & \fpr{0.0}{0.0}{0.0}
Minimal Answer & english & \fpr{0.0}{0.0}{0.0}
********************
english
Language: english (3)
********************
PASSAGE ANSWER R@P TABLE:
Optimal threshold: 0.0
 F1     /  P      /  R
  0.00% /   0.00% /   0.00%
R@P=0.5: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.75: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.9: 0.00% (actual p=0.00%, score threshold=0.0)
********************
MINIMAL ANSWER R@P TABLE:
Optimal threshold: 0.0
 F1     /  P      /  R
  0.00% /   0.00% /   0.00%
R@P=0.5: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.75: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.9: 0.00% (actual p=0.00%, score threshold=0.0)
Passage & arabic & \fpr{0.0}{0.0}{0.0}
Minimal Answer & arabic & \fpr{0.0}{0.0}{0.0}
********************
arabic
Language: arabic (14)
********************
PASSAGE ANSWER R@P TABLE:
Optimal threshold: 0.0
 F1     /  P      /  R
  0.00% /   0.00% /   0.00%
R@P=0.5: 0.00% (actual p=0.00%, score thresh

## Training Confidence Model

Here we train the confidence calibration model and save it to the directory specified in 'confidence_model_dir'.


In [12]:
import joblib
from joblib import dump, load
from sklearn.neural_network import MLPClassifier
from primeqa.calibration.confidence_scorer import ConfidenceScorer

confidence_model = MLPClassifier(random_state = 1, activation = 'tanh',
                                 hidden_layer_sizes=(100,100),
                                 max_iter=task_args['max_iter_of_confidence_model_training'],
                                 verbose=1)
X, Y = ConfidenceScorer.make_training_data(confidence_set_prediction_file, confidence_set_reference_file)

print("Training confidence model ...")
confidence_model.fit(X, Y)

confidence_model_file = os.path.join(task_args['confidence_model_dir'], 'confidence_model.bin')
dump(confidence_model, confidence_model_file)


Training confidence model ...
Iteration 1, loss = 0.51296729
Iteration 2, loss = 0.37961850
Iteration 3, loss = 0.27984038
Iteration 4, loss = 0.20783396
Iteration 5, loss = 0.15726690
Iteration 6, loss = 0.12240877
Iteration 7, loss = 0.09866111
Iteration 8, loss = 0.08260248
Iteration 9, loss = 0.07180332
Iteration 10, loss = 0.06458435
Iteration 11, loss = 0.05980054
Iteration 12, loss = 0.05667476
Iteration 13, loss = 0.05467909
Iteration 14, loss = 0.05345336
Iteration 15, loss = 0.05275034
Iteration 16, loss = 0.05239923
Iteration 17, loss = 0.05228125
Iteration 18, loss = 0.05231340
Iteration 19, loss = 0.05243743
Iteration 20, loss = 0.05261242
Iteration 21, loss = 0.05280964
Iteration 22, loss = 0.05300907
Iteration 23, loss = 0.05319686
Iteration 24, loss = 0.05336367
Iteration 25, loss = 0.05350336
Iteration 26, loss = 0.05361218
Iteration 27, loss = 0.05368806
Iteration 28, loss = 0.05373018
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Sto

['FILL_ME_IN/confidence_model.bin']

# Running MRC with Confidence Calibration

Here we evaluate the MRC model on the validation set, and apply the confidence calibration model obtained in previous step to the prediction result. Each answer will be assigned a confidence score to show how reliable it is.


## Preprocessing

Here we preprocess the data of validation set.


In [13]:
eval_examples = confidence_datasets["validation"]
if max_eval_samples is not None:
    # We will select sample from whole data if argument is specified
    eval_examples = eval_examples.select(range(max_eval_samples))
# Validation Feature Creation
with training_args.main_process_first(desc="validation dataset map pre-processing"):
    eval_examples, eval_dataset = preprocessor.process_eval(eval_examples)
print(f"Preprocessing produced {eval_dataset.num_rows} train features from {eval_examples.num_rows} examples.")

  0%|          | 0/100 [00:00<?, ?ex/s]

  0%|          | 0/100 [00:00<?, ?ex/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Running tokenizer on eval dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Preprocessing produced 1084 train features from 100 examples.


## Predicting Answer and Assigning Confidence Score

Here we run the MRC model to predict answer for each question and apply the confidence calibration model to get the confidence score.

In [14]:
import json

metrics = trainer.evaluate(eval_dataset=eval_dataset, eval_examples=eval_examples)
max_eval_samples = max_eval_samples if max_eval_samples else len(eval_dataset)
metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset))
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

for fn in ["eval_predictions.json", "eval_references.json", "eval_predictions_processed.json"]:
    if not os.path.exists(os.path.join(training_args.output_dir, fn)):
        raise ValueError(f"Unable to find eval result file {fn} from {training_args.output_dir}.")

confidence_scorer = ConfidenceScorer(task_args['confidence_model_dir'])
validation_set_prediction_file = os.path.join(training_args.output_dir, "eval_predictions.json")
with open(validation_set_prediction_file, 'r') as f:
    validation_predictions = json.load(f)

for example_id in validation_predictions:
    scores = confidence_scorer.predict_scores(validation_predictions[example_id])
    for i in range(len(validation_predictions[example_id])):
        validation_predictions[example_id][i]["confidence_score"] = scores[i]

rescored_prediction_file = os.path.join(training_args.output_dir, "eval_predictions.rescored.json")
with open(rescored_prediction_file, 'w') as f:
    json.dump(validation_predictions, f, indent=4)
print(f"Saved rescored validation predictions to {rescored_prediction_file}.")


***** Running Evaluation *****
  Num examples = 1084
  Batch size = 8
100%|█████████████████████████████████████████████████████████████████████████████| 100/100 [00:03<00:00, 25.33it/s]


Passage & english & \fpr{0.0}{0.0}{0.0}
Minimal Answer & english & \fpr{0.0}{0.0}{0.0}
********************
english
Language: english (1)
********************
PASSAGE ANSWER R@P TABLE:
Optimal threshold: 0.0
 F1     /  P      /  R
  0.00% /   0.00% /   0.00%
R@P=0.5: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.75: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.9: 0.00% (actual p=0.00%, score threshold=0.0)
********************
MINIMAL ANSWER R@P TABLE:
Optimal threshold: 0.0
 F1     /  P      /  R
  0.00% /   0.00% /   0.00%
R@P=0.5: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.75: 0.00% (actual p=0.00%, score threshold=0.0)
R@P=0.9: 0.00% (actual p=0.00%, score threshold=0.0)
Passage & arabic & \fpr{0.0}{0.0}{0.0}
Minimal Answer & arabic & \fpr{0.0}{0.0}{0.0}
********************
arabic
Language: arabic (8)
********************
PASSAGE ANSWER R@P TABLE:
Optimal threshold: 0.0
 F1     /  P      /  R
  0.00% /   0.00% /   0.00%
R@P=0.5: 0.00% (actual p=0.00%, score thresho

## Printing Output

Here we print out the answers with confidence scores. Please be noted that this notebook is only for the purpose of showing how to train a confidence calibration model. So we select a very small number of examples for both MRC and confidence model training. This leads to the result that most of the confidence scores are low. Users are welcome to modify this notebook to use a larger training set in order to obtain more reasonable confiidence scores. 


In [15]:
for example_id in validation_predictions:
    print(validation_predictions[example_id][0])

{'example_id': 'd309e832-e871-41de-8e40-5b78701c71f4', 'cls_score': -0.13252444937825203, 'start_logit': -0.1879718154668808, 'end_logit': 0.05548543855547905, 'span_answer': {'start_position': 8081, 'end_position': 8089}, 'span_answer_score': 0.15447045117616653, 'start_index': 183, 'end_index': 183, 'passage_index': 16, 'target_type_logits': [0.45754021406173706, 0.3089028298854828, 0.02573591284453869, -0.1720322072505951, -0.5589603781700134], 'span_answer_text': 'ในที่สุด', 'yes_no_answer': 0, 'start_stdev': 0.06726226210594177, 'end_stdev': 0.12430277466773987, 'query_passage_similarity': 1.976719856262207, 'normalized_span_answer_score': 0.3443146476218905, 'confidence_score': 0.006224607667931191}
{'example_id': '8711e3e6-b978-45ac-bca8-ffd37b39df72', 'cls_score': 1.1598286777734756, 'start_logit': -0.2032819539308548, 'end_logit': -0.17536720633506775, 'span_answer': {'start_position': 1123, 'end_position': 1126}, 'span_answer_score': -0.46588316559791565, 'start_index': 339, 