# TydiQA - support for boolean questions

Here we assume that you have used `run_mrc.py` with the `--do_boolean` option to decode the TydiQA dataset with full support for boolean questions.  See top-level README.md. There are four stages in the process:

1. MRC (**M**achine **R**eading **C**omprehension) - given a question and and answer, find a representative span that may contain a short answer.  This is analyzed in detail in the `tydiqa.ipynb`
2. QTC (**Q**uestion **T**ype **C**lassification) - given the question, decide if it is `boolean` or `short_answer`
3. EVC (**Ev**idence **C**lassifier) - given a question and a short answer span, decide the short answer span supports `yes` or `no`.  This is analyzed in more detail in `evc.ipynb`.
4. SN (**S**core **N**ormalization) - span scores may have different dynamic ranges according as whether the question is `boolean` or `short_anwer`.  Normalize them uniformally to $[0,1]$

In this notebook, we will show what happened internally in each step of the operation by looking at intermediate files from the experiment.

# Intermediate files

We will load some output/intermediate files from a recent command-line experiment with command
```
python examples/mrc/run_mrc.py --model_name_or_path ${BOOLEAN_MODEL_NAME} \
       --output_dir ${OUTPUT_DIR} --fp16 --do_eval \
       --per_device_eval_batch_size 128 --overwrite_output_dir \
       --postprocessor oneqa.boolqa.processors.postprocessors.extractive.ExtractivePipelinePostProcessor \
       --do_boolean --boolean_config ${BOOLEAN_CONFIG_FILE}
```

In [6]:
base='/dccstor/mabornea2/oneqa_os/oneqa_bool/EXP_07/mrc/'
mrc_file=f'{base}/eval_predictions.json'
qtc_file=f'{base}/qtc/eval_predictions.json'
evc_file=f'{base}/evc/eval_predictions.json'
out_file=f'{base}/sn/eval_predictions_processed.json'

# Display helper

Our intermediate files have many fields - to display them better we use a helper routine to convert to dataframes.

In [7]:
from oneqa.boolqa.processors.dataset.mrc2dataset  import create_dataset_from_run_mrc_output

from datasets import ClassLabel, Sequence
from numpy.random import permutation
import pandas as pd
from IPython.display import display, HTML

# Based on https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb
def show_balanced_examples(dataset, perm, groups, nrows, maxchars, cols):
    df = pd.DataFrame(dataset)
    dfp = df.iloc[perm] # shuffle
    dfg = dfp.groupby(groups)
    df_todisplay = dfg.head(nrows)[cols]
    if 'passage_answer_text' in cols:
        df_todisplay['passage_answer_text'] = df_todisplay['passage_answer_text'].str.slice(0,maxchars) + '...'
    display(HTML(df_todisplay.to_html()))

# Samples of MRC output

Here we show `question`'s and the predicted answer `span_answer_text` for the random examples (one from each language.)  This is at the initial stage of question answering - a purely extractive system.  The confidence in the span answer is given by `span_answer_score`, which is a function of various other logits available in the file.

In [12]:
eval_examples=create_dataset_from_run_mrc_output(mrc_file, unpack=False)
random_idxs = permutation(len(eval_examples))

cols=['example_id','question','span_answer_text','language', 'span_answer_score']
show_balanced_examples(eval_examples, random_idxs, 'language', 1, 100, cols)

Unnamed: 0,example_id,question,span_answer_text,language,span_answer_score
13471,5262a9c9-1fb8-4b04-ae06-6941f357c987,చెలిమారివలస గ్రామంలో ప్రధాన పంట ఏంటి?,బంజరు,telugu,-1.688477
3580,fd1d73c0-2e26-47e9-9d00-020a7b8fd82a,متى عاش أبو عبد الله سفيان بن سعيد بن مسروق؟,97 هـ-161 هـ,arabic,6.950684
5481,6a4a85dd-5a68-4126-95b4-c5f65057d8c8,ลี่ หมิง มีชื่อเรียกภาษาอังกฤษว่า ลีออน หลี่ เป็นนักร้อง-นักแสดงชาย สัญชาติใด ?,แคนาดาเชื้อสายฝรั่งเศส,thai,0.180664
923,a953d2ad-1ee1-4da0-acc1-2a28b80ad3fb,kapankah PDIA didirikan?,13 Desember 1937,indonesian,2.708008
13583,48bce28b-b966-41eb-b2e8-dd2b38b42570,Je ubaguzi wa rangi wa kisheria nchini Afrika Kusini uliisha mwak upi?,Ujamaa,swahili,-7.0784
5244,d118c473-c70a-4dd1-9569-ae217c874bc5,電気を人工的に発生させることができるようになったのはいつから,1939年に,japanese,-4.271973
4375,f2a0f11b-ff9f-4feb-875b-4c8141705514,Milloin Sonja Henie syntyi?,(8. huhtikuuta 1912,finnish,9.71875
6952,4dfb2e44-9774-4024-95c0-f80bd8a1ec73,Когда начали увлекаться спиритическими сеансами в России?,8 по 11 сентября 1888,russian,3.710938
8609,9a742799-fc3d-489a-ae80-4bcfaad8c628,드와이트 데이비드 하워드 주니어의 마지막 연봉은 얼마인가?,25만 달러,korean,-3.300293
15440,03c3672d-6ad0-4ccc-86d9-2f4bd4a17dd2,How old is the oldest operating steam locomotive?,Fairy Queen,english,3.339844


# Samples of QTC output

At this stage, two fields have been added: `question_type_pred` which is `boolean` if the question is a boolean question, and `short_answer` if the question is not boolean - typically factoid in this dataset.
The other field `question_type_scores` contains the classifier scores (logits) for each class. 
By far the majority of questions in TydiQA are `short_answer`: we present random examples chosen equally from those predicted `boolean` and those predicted `short_answer`.

In [9]:
eval_examples=create_dataset_from_run_mrc_output(qtc_file, unpack=False)
english_eval_examples = eval_examples.filter(lambda x:x['language']=='english')
random_idxs = permutation(len(english_eval_examples))
cols=['example_id','question','question_type_pred', 'question_type_scores']
show_balanced_examples(english_eval_examples, random_idxs, 'question_type_pred', 5, 100, cols)

  0%|          | 0/19 [00:00<?, ?ba/s]

Unnamed: 0,example_id,question,question_type_pred,question_type_scores
522,61c7d9aa-ea6b-43f4-add0-d4b2a65c3418,What else can you do with general anaesthetic?,short_answer,"{'boolean': -2.9969449043273926, 'short_answer': 3.780900716781616}"
700,6a01d1f7-e56b-4da3-9b94-7a5844fd903a,Where was Dora Sakayan born?,short_answer,"{'boolean': -2.997539758682251, 'short_answer': 3.781093120574951}"
471,e215f384-0afa-4477-b34a-d5f75ac8a467,Is the Mauser C96 produced today?,boolean,"{'boolean': 3.4154090881347656, 'short_answer': -4.333992004394531}"
243,417f356c-a560-4d51-b453-115e3af0b338,How long does the World Series of Poker last?,short_answer,"{'boolean': -2.997260570526123, 'short_answer': 3.7806217670440674}"
477,c53d58fc-ceea-4418-97f4-742d0909437d,How can you join the Union of Australian Women,short_answer,"{'boolean': -2.996549606323242, 'short_answer': 3.7806396484375}"
146,758d4526-a9b8-4de4-895e-217f3e42868c,Has Romania been an any major war?,boolean,"{'boolean': 3.4168570041656494, 'short_answer': -4.33167839050293}"
675,7a8e9329-9ae7-4700-bbec-91b3c1f934e9,Do the Aborigines have a verbal culture?,boolean,"{'boolean': 3.415651559829712, 'short_answer': -4.333675384521484}"
685,b54d2d39-0c91-425f-865a-006c4091af26,When was the Romantic period in classical music?,short_answer,"{'boolean': -2.9972946643829346, 'short_answer': 3.7816860675811768}"
254,e75977a2-23d1-406d-a349-01c16bec9a38,Did Charlemagne's son found France?,boolean,"{'boolean': 3.4171557426452637, 'short_answer': -4.332158088684082}"
801,d39b8a03-91ad-486b-a442-8f567179cdd7,Is popcorn lung linked to vaping?,boolean,"{'boolean': 3.4153695106506348, 'short_answer': -4.333074569702148}"


# Samples of EVC output 
As above this classifier adds two new fields.  `boolean_answer_pred` is `yes` if the predicted answer to a boolean question is positive/true/yes, `no` if the answer is negative/false/no, and `no_answer` if there is no support for either answer in the context.  The field `boolean_answer_scores` provides the scores (logits) of each class.
For the TydiQA evaluation, we discard the `no_answer` prediction and always predict `yes` or `no`.  Other application may choose a different behavior.

For presentation purposes, we select the English questions from the dev set (they are not scored by tydi_eval.py), which have a higher fraction of boolean questions.  The boolean questions in the tydi dataset are overwhelmingly biased towards having a `yes` rather than a `no`  as the answer.  We suspect that the question writers were attempting to confirm existing knowledge.
Note that the answer classifier runs on all questions, even on the short answer questions, for simplicity.  A real deployed system would run the answer classifier only on questions that are predicted to be boolean.

In [10]:
eval_examples=create_dataset_from_run_mrc_output(evc_file, unpack=False)
english_boolean_eval_examples = eval_examples.filter(lambda x:x['language']=='english' and x['question_type_pred']=='boolean')
random_idxs = permutation(len(english_boolean_eval_examples))
cols=['example_id','question','passage_answer_text', 'boolean_answer_pred', 'boolean_answer_scores']
show_balanced_examples(english_boolean_eval_examples, random_idxs, 'boolean_answer_pred', 5, 300, cols)

  0%|          | 0/19 [00:00<?, ?ba/s]

Unnamed: 0,example_id,question,passage_answer_text,boolean_answer_pred,boolean_answer_scores
24,64caf47e-c348-4b2c-a90a-45aae4930784,Were there ever any WMD in Iraq?,"In January 2003, United Nations weapons inspectors reported that they had found no indication that Iraq possessed nuclear weapons or an active program. Some former UNSCOM inspectors disagree about whether the United States could know for certain whether or not Iraq had renewed production of weapons ...",no,"{'no': -1.4815336465835571, 'no_answer': 4.486413478851318, 'yes': -2.3255317211151123}"
38,5ba960fe-4ef8-4043-a0ef-998733a83992,Was there slavery in Hispaniola?,"Enslaved people challenged their captivity in ways that ranged from introducing non-European elements into Christianity (syncretism) to mounting alternative societies outside the plantation system (Maroons). The first open black rebellion occurred in Spanish plantations in 1521.[6] Resistance, parti...",yes,"{'no': -5.312918663024902, 'no_answer': -0.6684049367904663, 'yes': 5.860434532165527}"
11,07d20f1b-460f-48ce-ad38-feb956b25c49,Is Hungarian a romance language?,Hungarian (magyar nyelv) is a Finno-Ugric language spoken in Hungary and several neighbouring countries. It is the official language of Hungary and one of the 24 official languages of the European Union. Outside Hungary it is also spoken by communities of Hungarians in the countries that today make...,yes,"{'no': -3.671246290206909, 'no_answer': 5.828882217407227, 'yes': -1.4822595119476318}"
110,aa9181b5-4986-4c00-adf9-f9929b210dfb,Is the Renaissance considered classicalism?,"Classicism is a recurrent tendency in the Late Antique period, and had a major revival in Carolingian and Ottonian art. There was another, more durable revival in the Italian renaissance when the fall of Byzantium and rising trade with the Islamic cultures brought a flood of knowledge about, and fro...",yes,"{'no': -4.013012409210205, 'no_answer': -3.4352951049804688, 'yes': 6.554855823516846}"
9,495e52f7-c094-4dfe-a6ae-ff65071a15bd,Can you tell skin color from DNA?,"Human skin color ranges in variety from the darkest brown to the lightest hues. An individual's skin pigmentation is the result of genetics, being the product of both of the individual's biological parents' genetic makeup, and exposure to sun. In evolution, skin pigmentation in human beings evolved ...",yes,"{'no': -5.738819599151611, 'no_answer': 4.450618267059326, 'yes': 1.4050309658050537}"
109,f8b361d5-a0e6-42c1-b2b5-6ab55dbd6dc6,Did Key make anymore adult visual novels?,"Key is a Japanese visual novel studio which formed on July 21, 1998 as a brand under the publisher Visual Arts and is located in Kita, Osaka. Key's debut visual novel Kanon (1999) combined an elaborate storyline, up-to-date anime-style art, and a musical score which helped to set the mood for the ga...",yes,"{'no': -3.663787364959717, 'no_answer': -3.0000221729278564, 'yes': 6.0394463539123535}"
47,31b00909-a884-4eae-8a15-a388befb5eec,Is the great horned owl endangered?,"The great horned owl is not considered a globally threatened species by the IUCN.[1] Including the Magellanic species, there are approximately 5.3 million wild horned owls in the Americas.[7] Most mortality in modern times is human-related, caused by owls flying into man-made objects, including buil...",no,"{'no': 5.9517669677734375, 'no_answer': -0.9666835069656372, 'yes': -4.357577323913574}"
53,8de0f49d-cd6b-401a-b084-25b3eafba62d,Is there a cure for juvenile rheumatoid arthritis?,"The cause of JIA remains unknown. However, the disorder is autoimmune[10] — meaning that the body's own immune system starts to attack and destroy cells and tissues (particularly in the joints) for no apparent reason. The immune system is thought to be provoked by changes in the environment, in com...",no,"{'no': 4.393433094024658, 'no_answer': 0.7118827104568481, 'yes': -4.3216376304626465}"
2,be22ceb6-0881-494b-8a0c-46289a81341d,Is there a cure for legionellosis?,"No vaccine is available.[8] Prevention depends on good maintenance of water systems.[8] Treatment of Legionnaires' disease is with antibiotics.[9] Recommended agents include fluoroquinolones, azithromycin, or doxycycline.[12] Hospitalization is often required.[11] About 10% of those who are infected...",no,"{'no': 5.696393013000488, 'no_answer': -1.5115960836410522, 'yes': -3.6455893516540527}"
14,758d4526-a9b8-4de4-895e-217f3e42868c,Has Romania been an any major war?,The United Principalities of Moldavia and Wallachia did not participate in any wars....,no,"{'no': 4.973013401031494, 'no_answer': -0.7406830191612244, 'yes': -3.809783458709717}"


# Final output

The final output file is in a format suitable for the tydi evalutation script and contains no textual information.  The `confidence_score` is normalized to `[0,1]` by the score normalizer based the confidence score of the original mrc output, and the prediction of the question type classifier.

In [11]:
pd.read_json(out_file)

Unnamed: 0,example_id,start_position,end_position,passage_index,yes_no_answer,confidence_score
0,740c33ae-78b9-40e5-8444-b6c8d2776776,986,1020,2,0,0.697498
1,e7541e40-46ec-494c-b0d6-a9c435568f2b,385,388,1,0,0.044546
2,869198a8-fc4d-43b4-bed7-24a61c17d8ab,14805,14807,27,0,0.099508
3,308f64d3-2794-410c-b2d5-10472b7e6661,6703,6715,13,0,0.083534
4,87ade38f-9ae7-4a98-8558-335772c33843,2539,2544,6,0,0.081707
...,...,...,...,...,...,...
18665,6504cb42-77d4-4a7d-a8bd-00d7f2df8994,1,12,0,0,0.213102
18666,704baa32-153b-45fb-9ef2-ca9f2e7265fd,5770,5777,16,0,0.034656
18667,5383daad-3b82-4e34-ad04-0c08a3ff86f7,1483,1541,4,0,0.172497
18668,a737ab02-a0a0-4b09-9ea8-ea348d19e212,61,69,0,0,0.120389
