# TydiQA - support for boolean questions

Here we assume that you have used `run_mrc.py` with the `--do_boolean` option to decode the TydiQA dataset with full support for boolean questions.  See top-level README.md. There are four stages in the process:

1. MRC (**M**achine **R**eading **C**omprehension) - given a question and and answer, find a representative span that may contain a short answer.  This is analyzed in detail in the `tydiqa.ipynb`
2. QTC (**Q**uestion **T**ype **C**lassification) - given the question, decide if it is `boolean` or `short_answer`
3. EVC (**Ev**idence **C**lassifier) - given a question and a short answer span, decide the short answer span supports `yes` or `no`.  This is analyzed in more detail in `evc.ipynb`.
4. SN (**S**core **N**ormalization) - span scores may have different dynamic ranges according as whether the question is `boolean` or `short_anwer`.  Normalize them uniformally to $[0,1]$

In this notebook, we will show what happened internally in each step of the operation by looking at intermediate files from the experiment.

# Intermediate files

We will load some output/intermediate files from a recent command-line experiment with command
```
python primeqa/mrc/run_mrc.py --model_name_or_path PrimeQA/tydi-reader_bpes-xlmr_large-20221117 \
       --output_dir ${OUTPUT_DIR} --fp16 --do_eval \
       --per_device_eval_batch_size 128 --overwrite_output_dir \
       --postprocessor primeqa.boolqa.processors.postprocessors.extractive.ExtractivePipelinePostProcessor \
       --do_boolean --boolean_config primeqa/boolqa/tydi_boolqa_config.json
```

In [2]:
mrc_file=f'{base}/eval_predictions.json'
qtc_file=f'{base}/qtc/predictions.json'
evc_file=f'{base}/evc/predictions.json'
out_file=f'{base}/sn/eval_predictions_processed.json'

# Display helper

Our intermediate files have many fields - to display them better we use a helper routine to convert to dataframes.

In [3]:
from primeqa.boolqa.processors.dataset.mrc2dataset  import create_dataset_from_run_mrc_output

from datasets import ClassLabel, Sequence
from numpy.random import permutation
import pandas as pd
from IPython.display import display, HTML

# Based on https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb
def show_balanced_examples(dataset, perm, groups, nrows, maxchars, cols):
    df = pd.DataFrame(dataset)
    dfp = df.iloc[perm] # shuffle
    dfg = dfp.groupby(groups)
    df_todisplay = dfg.head(nrows)[cols]
    if 'passage_answer_text' in cols:
        df_todisplay['passage_answer_text'] = df_todisplay['passage_answer_text'].str.slice(0,maxchars) + '...'
    display(HTML(df_todisplay.to_html()))

# Samples of MRC output

Here we show `question`'s and the predicted answer `span_answer_text` for the random examples (one from each language.)  This is at the initial stage of question answering - a purely extractive system.  The confidence in the span answer is given by `span_answer_score`, which is a function of various other logits available in the file.

In [4]:
eval_examples=create_dataset_from_run_mrc_output(mrc_file, unpack=False)
random_idxs = permutation(len(eval_examples))

cols=['example_id','question','span_answer_text','language', 'span_answer_score']
show_balanced_examples(eval_examples, random_idxs, 'language', 1, 100, cols)

Unnamed: 0,example_id,question,span_answer_text,language,span_answer_score
2952,44d0fc1d-d9da-46c9-ba9a-b685e98db154,পৃথিবীর প্রথম মানচিত্রের নাম কী ?,এরাতোস্থেনেস,bengali,-0.668091
14958,3c3eaf7e-89e9-40ad-8547-e78c2b567623,"Je,Arusha ina idadi ya watu wangapi?",23000,swahili,4.373718
145,1bff8109-bfa9-4e99-96cc-c72431a6b139,Saiko Suomi jatkosodassa sotasaaliiksi viisi 130/50 N -tykkiä?,Talvisodan synnyttämä revanssihenki oli osaltaan viemässä Suomea jatkosotaan.,finnish,-3.3974
10092,1b300284-f80b-4aad-a113-333906b38caf,apakah Ariana Grande seorang Heteroseksual?,Contoh,indonesian,-3.620056
11962,2cbccd85-62b7-4661-8386-bb37c10c525d,บีเวอร์ มีชื่อทางวิทยาศาสตร์ว่าอย่างไร?,Castor fiber,thai,6.416016
3941,4baf5b00-de62-4789-ba25-4affe428a2f4,2017 నాటికి ఆదిలక్ష్మాంబ పురం గ్రామంలో వ్యవసాయేతర వినియోగంలో ఉన్న భూమి ఎంత?,27.11,telugu,2.588379
13676,eaf55e82-7713-46c2-b913-23ed1130faa0,한국에서 가장 오래된 성당은 어디인가?,성공회 강화성당,korean,6.465332
14561,5507963f-9a14-491c-9cdf-7390991eed49,Как зовут главного персонажа фильма «Рэкет» (1951)?,Ник Скэнлон,russian,4.588623
16500,23da52e4-119d-4082-9c45-8521b79ae3f7,Where did the Meiji Restoration take place?,Empire of Japan,english,6.046127
15484,7ff4cc0c-a8fe-4c79-8e23-29458f0d4926,ロストラの大きさは？,大きな,japanese,-5.183105


# Samples of QTC output

At this stage, two fields have been added: `question_type_pred` which is `boolean` if the question is a boolean question, and `short_answer` if the question is not boolean - typically factoid in this dataset.
The other field `question_type_scores` contains the classifier scores (logits) for each class. 
By far the majority of questions in TydiQA are `short_answer`: we present random examples chosen equally from those predicted `boolean` and those predicted `short_answer`.

In [5]:
eval_examples=create_dataset_from_run_mrc_output(qtc_file, unpack=False)
english_eval_examples = eval_examples.filter(lambda x:x['language']=='english')
random_idxs = permutation(len(english_eval_examples))
cols=['example_id','question','question_type_pred', 'question_type_scores']
show_balanced_examples(english_eval_examples, random_idxs, 'question_type_pred', 5, 100, cols)

  0%|          | 0/19 [00:00<?, ?ba/s]

Unnamed: 0,example_id,question,question_type_pred,question_type_scores
510,6435f62e-c90f-4222-a0a2-10e673285bf6,How many metropolitan areas does Route 64 pass through?,other,"{'boolean': -3.2897751331329346, 'other': 3.0375118255615234}"
980,8aef6ed1-209c-49e1-94b8-a00b916b9e69,What can we do with metal–organic frameworks?,other,"{'boolean': -2.8831963539123535, 'other': 2.6908504962921143}"
932,38f0b151-1346-4072-80fb-d7bc03eedb3c,Is oiliness of the skin considered a disease?,boolean,"{'boolean': 3.482534170150757, 'other': -3.5299534797668457}"
502,fe4ddaa6-264c-4786-b3ba-257f19040957,When did the term gypsy develop to describe Romanis?,other,"{'boolean': -3.3366098403930664, 'other': 3.031615734100342}"
871,cb702fcd-e348-4d20-b8fe-dc541ae05378,What percentage of people experience relapse after recovering from addiction?,other,"{'boolean': -2.444016695022583, 'other': 2.1554253101348877}"
28,2459bd3e-add5-49ea-a6e1-5a360d494bf4,When did Jean Anna Fau die?,other,"{'boolean': -3.435251474380493, 'other': 3.13974666595459}"
823,68ee1a17-9720-41da-b31f-757a78125001,Does Mary Hoffman have kids?,boolean,"{'boolean': 3.476332187652588, 'other': -3.555983066558838}"
611,ebce63e0-78bd-4685-92f0-db5294d6cecb,Does Google autotranslate all pages?,boolean,"{'boolean': 3.4442615509033203, 'other': -3.4900364875793457}"
579,044b1b49-906e-4b23-b354-8ca7be9881ef,Does Donna Troy have any superpowers?,boolean,"{'boolean': 3.4369497299194336, 'other': -3.508683681488037}"
734,f6b04fa0-e30e-4a26-a579-90eb0ea3f3b2,Can salt marsh die-off be fixed?,boolean,"{'boolean': 3.4441893100738525, 'other': -3.5055618286132812}"


# Samples of EVC output 
As above this classifier adds two new fields.  `boolean_answer_pred` is `yes` if the predicted answer to a boolean question is positive/true/yes, `no` if the answer is negative/false/no.   The field `boolean_answer_scores` provides the scores (logits) of each class.


For presentation purposes, we select the English questions from the dev set (they are not scored by tydi_eval.py), which have a higher fraction of boolean questions.  The boolean questions in the tydi dataset are overwhelmingly biased towards having a `yes` rather than a `no`  as the answer.  We suspect that the question writers were attempting to confirm existing knowledge.
Note that the answer classifier runs on all questions, even on the short answer questions, for simplicity.  A real deployed system would run the answer classifier only on questions that are predicted to be boolean.

In [6]:
eval_examples=create_dataset_from_run_mrc_output(evc_file, unpack=False)
english_boolean_eval_examples = eval_examples.filter(lambda x:x['language']=='english' and x['question_type_pred']=='boolean')
random_idxs = permutation(len(english_boolean_eval_examples))
cols=['example_id','question','passage_answer_text', 'boolean_answer_pred', 'boolean_answer_scores']
show_balanced_examples(english_boolean_eval_examples, random_idxs, 'boolean_answer_pred', 5, 300, cols)

  0%|          | 0/19 [00:00<?, ?ba/s]

Unnamed: 0,example_id,question,passage_answer_text,boolean_answer_pred,boolean_answer_scores
86,4aa91b84-9653-4019-8cc1-eb9580b7a654,Does Frankfurt have a regional dish?,"Handkäse (pronounced[ˈhantkɛːzə]; literally: ""hand cheese"") is a German regional sour milk cheese (similar to Harzer) and is a culinary speciality of Frankfurt am Main, Offenbach am Main, Darmstadt, Langen, and other parts of southern Hesse. It gets its name from the traditional way of producing it:...",yes,"{'no': -5.016531944274902, 'yes': 4.2546706199646}"
83,2f229467-f0c9-4d9f-a919-9076467d0444,Do fungus spread by spores?,"The fungi produce asexual spores which disperse by wind, water or by insect vectors[9] spreading the infection....",yes,"{'no': -4.835036277770996, 'yes': 3.925503969192505}"
11,36af5968-9d7a-4139-a678-531f205db4d3,Is Hungarian a romance language?,"Additionally, the letter pairs ⟨ny⟩, ⟨ty⟩, and ⟨gy⟩ represent the palatal consonants /ɲ/, /c/, and /ɟ/ (a little like the ""d+y"" sounds in British ""du</i>ke"" or American ""woul<i data-parsoid='{""dsr"":[64312,64319,2,2]}'>d y</i>ou"")—a bit like saying ""d"" with the tongue pointing to the palate....",no,"{'no': 4.103992938995361, 'yes': -3.6571133136749268}"
21,111824a0-a13a-4c23-b591-cbbf5eff4a9f,Is Thailand apart of China?,"China – Thailand relations officially started in November 1975 after years of negotiations.[1][2] For a long time, Thailand, or in its former name, Siam, was a very strong and loyal Sinophilic country, and usually the Chinese issued Siam with a strong respect from China to ensure its alliance with t...",no,"{'no': 2.0598034858703613, 'yes': -1.898944616317749}"
75,2e5f01f0-b22c-4afe-a568-4a9eb1e1c2da,Is communism the same as socialism?,"In addition to this, the term communism (as well as socialism) is often used to refer to those political and economic systems and states dominated by a political, bureaucratic class, typically attached to one single Communist party that follow Marxist-Leninist doctrines and often claim to represent ...",yes,"{'no': -4.715367794036865, 'yes': 3.6751482486724854}"
95,094443ee-0bb1-4d8b-98e1-540c69cea311,Is steam still used in some trains?,"From the early 1900s steam locomotives were gradually superseded by electric and diesel locomotives, with railways fully converting to electric and diesel power beginning in the late 1930s. The majority of steam locomotives were retired from regular service by the 1980s, though several continue to r...",no,"{'no': 2.094043254852295, 'yes': -1.9507315158843994}"
85,631a9a40-746a-448e-8c73-31e5ee11510b,Did Emmylou Harris go to college?,"Harris is from a career military family. Her father, Walter Harris (1921-1993),[2] was a Marine Corps officer, and her mother, Eugenia (1921-2014),[3] was a wartime military wife. Her father was reported missing in action in Korea in 1952 and spent ten months as a prisoner of war. Born in Birmingham...",yes,"{'no': -4.775354862213135, 'yes': 4.165227890014648}"
70,44091813-f673-47b1-902f-6a96557221a7,Can the central nervous system heal itself?,"Nervous system injuries affect over 90,000 people every year.[2] It is estimated that spinal cord injuries alone affect 10,000 each year.[3] As a result of this high incidence of neurological injuries, nerve regeneration and repair, a subfield of neural tissue engineering, is becoming a rapidly grow...",yes,"{'no': -4.795304298400879, 'yes': 3.9687118530273438}"
48,aa6b0971-4aa1-4594-8006-b117b869e9a9,Is the great horned owl endangered?,"The great horned owl is not considered a globally threatened species by the IUCN.[1] Including the Magellanic species, there are approximately 5.3 million wild horned owls in the Americas.[7] Most mortality in modern times is human-related, caused by owls flying into man-made objects, including buil...",no,"{'no': 4.286905288696289, 'yes': -3.9321670532226562}"
43,d2424a99-1125-41a5-85bf-62c0309feee6,Is the Mauser C96 produced today?,"A version of the Mauser pistol with a full-sized grip, six-shot internal magazine, and a 120-millimetre (4.7in) barrel. Production was phased out by 1899....",no,"{'no': 4.3284735679626465, 'yes': -4.126214981079102}"


# Final output

The final output file is in a format suitable for the tydi evalutation script and contains no textual information.  The `confidence_score` is normalized to `[0,1]` by the score normalizer based the confidence score of the original mrc output, and the prediction of the question type classifier.

In [7]:
pd.read_json(out_file)

Unnamed: 0,example_id,start_position,end_position,passage_index,yes_no_answer,confidence_score
0,353a941a-b57a-4e36-8e55-0d5963694022,986,1020,2,0,0.916934
1,ab4c6470-4fb9-4ff0-b439-0ca8ffdf7e60,371,388,1,0,0.101929
2,06b1a008-5c3f-4f59-8b6c-bfc1bd3e9b52,14805,14807,27,0,0.104343
3,8f4612a0-eae9-4f32-a4d2-4a2b6dc848c2,5993,6010,12,0,0.082423
4,0cf6e728-1925-49a1-80ff-4d19b3efb538,1,22,0,0,0.171585
...,...,...,...,...,...,...
18665,417a5750-e4d3-4b2f-af60-278498883693,176,185,0,0,0.581892
18666,d264bbbd-9dcf-4bb8-b467-fbba0d0b6fe0,183,192,0,0,0.064176
18667,2f7b0365-33e2-4a26-ba2c-0a1f087a4939,1483,1488,4,0,0.061807
18668,3141a8e2-979f-4fca-aca9-1fba6a6f90e4,723,731,3,0,0.061466
