# Answer classification for boolean questions

In this notebook, we look at the answer (evidence) classifation, which is a component in the TyDiQA pipeline which decides whether a boolean question should be answered `yes` or `no`, based on a passage selected by the machine reading comprehension component.

## Preliminaries
We assume that the machine reading comprehension and the question type classifier components of the TyDiQA pipeline have already run, either through the integrated command line or the step-by-step process, both described [here](../../examples/boolqa/README.md) and that the output directory was `base`.

First some setup.  The classifier will obtain its input from the `qtc/eval_predictions.json` file produced by the question type classifier.
Most of this setup is very similar to the setup for [mrc](../mrc/mrc.ipynb)

In [2]:
output_dir="out"
input_file=f"{base}/qtc/eval_predictions.json"

from transformers import (
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    HfArgumentParser,
    Trainer,
    TrainingArguments)
from transformers.trainer_utils import set_seed
from primeqa.boolqa.processors.postprocessors.boolqa_classifier import BoolQAClassifierPostProcessor
from primeqa.boolqa.processors.preprocessors.boolqa_classifier import BoolQAClassifierPreProcessor
from primeqa.boolqa.processors.dataset.mrc2dataset import create_dataset_from_run_mrc_output, create_dataset_from_json_str
import pandas as pd

seed = 42
set_seed(seed)

training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    do_train=False,
    do_eval=True,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=1,
    num_train_epochs=1,
    evaluation_strategy='no',
    learning_rate=4e-05,
    warmup_ratio=0.1,
    weight_decay=0.1,
    save_steps=50000,
    seed=seed,
)

## Setup the auxiliary classes

These are the same type of classes that are used in the mrc system.  The `sentence1_key` and `sentence2_key` argument to the preprocessor specifies that the evidence classifier will predict `yes` or `no` based on the question and the (long) passage answer produced by the upstream MRC system.  In general the minimal (short) answers are too short to make reasonable predictions from.

In [3]:
config = AutoConfig.from_pretrained('PrimeQA/tydiqa-boolean-answer-classifier', num_labels=3, use_auth_token=True)

tokenizer=AutoTokenizer.from_pretrained('PrimeQA/tydiqa-boolean-answer-classifier', use_fast=True, use_auth_token=True)

model = AutoModelForSequenceClassification.from_pretrained('PrimeQA/tydiqa-boolean-answer-classifier', config=config, use_auth_token=True)

label_list=['no', 'no_answer', 'yes']

postprocessor_class = BoolQAClassifierPostProcessor
postprocessor = postprocessor_class(
    k=10,       
    drop_label="no_answer",
    label_list = label_list,
    id_key='example_id',
    output_label_prefix='boolean_answer'
)

preprocessor_class = BoolQAClassifierPreProcessor
preprocessor = preprocessor_class(
    sentence1_key='question',
    sentence2_key='passage_answer_text',
    tokenizer=tokenizer,
    load_from_cache_file=False,
    max_seq_len=500,
    padding=False
)

## Inputs
Here we create a dataset from the input file.  The input file is the output file of the question type classifier.  For illustrative purposes, we filter it to focus on the english questions that have been predicted to be boolean.

In [4]:
examples=create_dataset_from_run_mrc_output(input_file, unpack=False)
examples=examples.filter(lambda x:x['language']=='english' and x['question_type_pred']=='boolean')
eval_examples, eval_dataset = preprocessor.process_eval(examples)
eval_examples



  0%|          | 0/19 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['example_id', 'cls_score', 'start_logit', 'end_logit', 'span_answer', 'span_answer_score', 'start_index', 'end_index', 'passage_index', 'target_type_logits', 'span_answer_text', 'yes_no_answer', 'start_stdev', 'end_stdev', 'query_passage_similarity', 'normalized_span_answer_score', 'confidence_score', 'question', 'language', 'passage_answer_text', 'order', 'rank', 'question_type_pred', 'question_type_scores', 'question_type_conf'],
    num_rows: 112
})

## Do the predictions.
As in mrc, the trainer class instance runs the predictions.

In [5]:
trainer = Trainer( 
    model=model,
    args=training_args,
    train_dataset=None,
    eval_dataset=eval_dataset,
    compute_metrics=None, #compute_metrics,
    tokenizer=tokenizer,
    data_collator=None,
)
predictions = trainer.predict(eval_dataset, metric_key_prefix="predict").predictions



The following columns in the test set  don't have a corresponding argument in `XLMRobertaForSequenceClassification.forward` and have been ignored: language, example_id, passage_answer_text, question. If language, example_id, passage_answer_text, question are not expected by `XLMRobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 112
  Batch size = 16


## Predictions

The pretrained model we provide is actually as ternary model - it predictions `no`, `no_answer`, or `yes`.  The `no_answer` is discarded for pipelines that end with an actual tydi evaluation, since the tydi evaluation script selects the score threshold that distinguishes answerable and unanswerable questions.  However, other applications may want to make use of this category.

In [6]:
pd.DataFrame.from_records(predictions[0:5,:])

Unnamed: 0,0,1,2
0,-6.061874,5.307685,1.121236
1,-5.060166,5.039388,0.649324
2,5.695564,-1.510027,-3.645893
3,-3.895487,-1.826522,5.120609
4,-4.380426,6.555937,-1.952797


In [7]:
eval_preds = postprocessor.process_references_and_predictions(eval_examples, eval_dataset, predictions)
eval_preds_ds = create_dataset_from_json_str(eval_preds.predictions, False)
print(eval_preds_ds)

in process_references_and_predictions
Dataset({
    features: ['example_id', 'cls_score', 'start_logit', 'end_logit', 'span_answer', 'span_answer_score', 'start_index', 'end_index', 'passage_index', 'target_type_logits', 'span_answer_text', 'yes_no_answer', 'start_stdev', 'end_stdev', 'query_passage_similarity', 'normalized_span_answer_score', 'confidence_score', 'question', 'language', 'passage_answer_text', 'order', 'rank', 'question_type_pred', 'question_type_scores', 'question_type_conf', 'boolean_answer_pred', 'boolean_answer_scores', 'boolean_answer_conf'],
    num_rows: 112
})


## Questions and answers

Here we display some questions that have been identified as boolean, and their predicted answers, based on the system output of the MRC system.  A weakness in the TydiQA dataset is that most (85%) of the boolean questions have an answer of `yes` - apparently the question writers wrote questions seeking confirmations of what they already knew or suspected.  We display the `passage_answer_text` that was automatically extracted by the upstream MRC system because that was used by the classifier to make the `yes`/`no`/`no_answer` prediction.

In [8]:
from datasets import ClassLabel, Sequence
from numpy.random import permutation
import random
import pandas as pd
from IPython.display import display, HTML

# Based on https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb
def show_balanced_examples(dataset, perm, groups, nrows, maxchars, cols):
    df = pd.DataFrame(dataset)
    dfp = df.iloc[perm] # shuffle
    dfg = dfp.groupby(groups)
    df_todisplay = dfg.head(nrows)[cols]
    if 'passage_answer_text' in cols:
        df_todisplay['passage_answer_text'] = df_todisplay['passage_answer_text'].str.slice(0,maxchars) + '...'
    display(HTML(df_todisplay.to_html()))
    
    

english_boolean_eval_examples = eval_preds_ds.filter(lambda x:x['language']=='english' and x['question_type_pred']=='boolean')
random_idxs = permutation(len(english_boolean_eval_examples))
cols=['example_id','question','passage_answer_text', 'boolean_answer_pred', 'boolean_answer_scores']
show_balanced_examples(english_boolean_eval_examples, random_idxs, 'boolean_answer_pred', 5, 300, cols)


  0%|          | 0/1 [00:00<?, ?ba/s]

Unnamed: 0,example_id,question,passage_answer_text,boolean_answer_pred,boolean_answer_scores
40,ec0e4b73-ec45-4ffb-b6d8-3fd23d00dddf,Does an animal with vertebrae have to be a chordate?,"Craniates, one of the three subdivisions of chordates, all have distinct skulls. They include the hagfish, which have no vertebrae. Michael J. Benton commented that ""craniates are characterized by their heads, just as chordates, or possibly all deuterostomes, are by their tails"".[12]...",no,"{'no': 6.118734836578369, 'no_answer': -1.7950204610824585, 'yes': -3.937854528427124}"
65,0168b3e5-647f-485c-b3b0-58a6b47b9516,Does California get snow?,"The high mountains, including the Sierra Nevada, the Cascade Range, and the Klamath Mountains, have a mountain climate with snow in winter and mild to moderate heat in summer. Ski resorts at Lake Tahoe, Mammoth Lakes, and Mount Shasta routinely receive over 10 feet (3.0m) of snow in a season, and so...",yes,"{'no': -5.855449676513672, 'no_answer': -0.5776659846305847, 'yes': 5.964153289794922}"
4,3f08a82a-48dd-4207-8e55-2431125c6c05,Does the Magellanic Cloud system have a super massive black hole?,"The Magellanic Clouds (or Nubeculae Magellani[2]) are two irregular dwarf galaxies visible in the Southern Celestial Hemisphere; they are members of the Local Group and are orbiting the Milky Way galaxy. Because both show signs of a bar structure, they are often reclassified as Magellanic spiral gal...",yes,"{'no': -4.380425930023193, 'no_answer': 6.55593729019165, 'yes': -1.9527971744537354}"
47,31b00909-a884-4eae-8a15-a388befb5eec,Is the great horned owl endangered?,"The great horned owl is not considered a globally threatened species by the IUCN.[1] Including the Magellanic species, there are approximately 5.3 million wild horned owls in the Americas.[7] Most mortality in modern times is human-related, caused by owls flying into man-made objects, including buil...",no,"{'no': 5.95291805267334, 'no_answer': -0.9690039753913879, 'yes': -4.356748580932617}"
42,e215f384-0afa-4477-b34a-d5f75ac8a467,Is the Mauser C96 produced today?,"The Spanish gunmaker Astra-Unceta y Cia began producing a copy of the Mauser C.96 in 1927 that was externally similar to the C96 (including the presence of a detachable shoulder stock/holster) but with non-interlocking internal parts. It was produced until 1941, with a production hiatus in 1937 and ...",no,"{'no': 3.602304458618164, 'no_answer': 0.8610267043113708, 'yes': -4.071805000305176}"
69,5b95654d-2cad-436e-bd42-18491a65c386,Can the central nervous system heal itself?,"Nervous system injuries affect over 90,000 people every year.[2] It is estimated that spinal cord injuries alone affect 10,000 each year.[3] As a result of this high incidence of neurological injuries, nerve regeneration and repair, a subfield of neural tissue engineering, is becoming a rapidly grow...",no,"{'no': 1.9971483945846558, 'no_answer': -1.7380565404891968, 'yes': -0.2094581127166748}"
26,8eb7e6eb-a03d-41aa-a9d4-dd6d46493b65,Is Daydream Software still an active company?,"The Israeli company Waze Mobile developed the Waze software. Ehud Shabtai, Amir Shinar and Uri Levine founded the company. Two Israeli venture capital firms, Magma and Vertex, and an early-stage American venture capital firm, Bluerun Ventures, provided funding. Google acquired Waze Mobile in 2013....",yes,"{'no': -4.1507720947265625, 'no_answer': 6.917901992797852, 'yes': -2.0596728324890137}"
11,07d20f1b-460f-48ce-ad38-feb956b25c49,Is Hungarian a romance language?,Hungarian (magyar nyelv) is a Finno-Ugric language spoken in Hungary and several neighbouring countries. It is the official language of Hungary and one of the 24 official languages of the European Union. Outside Hungary it is also spoken by communities of Hungarians in the countries that today make...,yes,"{'no': -3.6559829711914062, 'no_answer': 5.812913417816162, 'yes': -1.4828946590423584}"
10,bbb56373-e84e-4fde-8858-be612b8f5d2c,Is Cantonese written the same as Mandarin?,"General estimates of vocabulary differences between Cantonese and Mandarin range from 30 to 50 percent. Donald B. Snow, the author of Cantonese as Written Language: The Growth of a Written Chinese Vernacular, wrote that ""It is difficult to quantify precisely how different"" the two vocabularies are.[...",no,"{'no': 6.098764419555664, 'no_answer': -1.3992488384246826, 'yes': -4.08381462097168}"
73,2254035a-f43d-48e9-9eae-ef1c1e626f96,Can salt marsh die-off be fixed?,"Research on the salt marsh snail Littoraria irrorata and its effects on marsh plant productivity, have provided strong evidence of consumer control in marshes triggered by overexploitation. This snail is capable of turning strands of cordgrass (Spartina alterniflora) (>2.5m tall) into mudflats withi...",yes,"{'no': -5.442584037780762, 'no_answer': 6.890753746032715, 'yes': -1.1657015085220337}"
