# TydiQA - support for boolean questions

Here we assume that you have used `run_mrc.py` with the `--do_boolean` option to decode the TydiQA dataset with full support for boolean questions.  See top-level README.md. There are four stages in the process:

1. MRC (**M**achine **R**eading **C**omprehension) - given a question and and answer, find a representative span that may contain a short answer.  This is analyzed in detail in the `tydiqa.ipynb`
2. QTC (**Q**uestion **T**ype **C**lassification) - given the question, decide if it is `boolean` or `short_answer`
3. EVC (**Ev**idence **C**lassifier) - given a question and a short answer span, decide the short answer span supports `yes` or `no`.  This is analyzed in more detail in `evc.ipynb`.
4. SN (**S**core **N**ormalization) - span scores may have different dynamic ranges according as whether the question is `boolean` or `short_anwer`.  Normalize them uniformally to $[0,1]$

In this notebook, we will show what happened internally in each step of the operation by looking at intermediate files from the experiment.

# Intermediate files

We will load some output/intermediate files from a recent command-line experiment with command
```
python examples/mrc/run_mrc.py --model_name_or_path PrimeQA/tydiqa-primary-task-xlm-roberta-large \
       --output_dir ${OUTPUT_DIR} --fp16 --do_eval \
       --per_device_eval_batch_size 128 --overwrite_output_dir \
       --postprocessor primeqa.boolqa.processors.postprocessors.extractive.ExtractivePipelinePostProcessor \
       --do_boolean --boolean_config examples/boolqa/tydi_boolqa_config.json
```

In [2]:
mrc_file=f'{base}/eval_predictions.json'
qtc_file=f'{base}/qtc/eval_predictions.json'
evc_file=f'{base}/evc/eval_predictions.json'
out_file=f'{base}/sn/eval_predictions_processed.json'

# Display helper

Our intermediate files have many fields - to display them better we use a helper routine to convert to dataframes.

In [3]:
from primeqa.boolqa.processors.dataset.mrc2dataset  import create_dataset_from_run_mrc_output

from datasets import ClassLabel, Sequence
from numpy.random import permutation
import pandas as pd
from IPython.display import display, HTML

# Based on https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb
def show_balanced_examples(dataset, perm, groups, nrows, maxchars, cols):
    df = pd.DataFrame(dataset)
    dfp = df.iloc[perm] # shuffle
    dfg = dfp.groupby(groups)
    df_todisplay = dfg.head(nrows)[cols]
    if 'passage_answer_text' in cols:
        df_todisplay['passage_answer_text'] = df_todisplay['passage_answer_text'].str.slice(0,maxchars) + '...'
    display(HTML(df_todisplay.to_html()))

# Samples of MRC output

Here we show `question`'s and the predicted answer `span_answer_text` for the random examples (one from each language.)  This is at the initial stage of question answering - a purely extractive system.  The confidence in the span answer is given by `span_answer_score`, which is a function of various other logits available in the file.

In [4]:
eval_examples=create_dataset_from_run_mrc_output(mrc_file, unpack=False)
random_idxs = permutation(len(eval_examples))

cols=['example_id','question','span_answer_text','language', 'span_answer_score']
show_balanced_examples(eval_examples, random_idxs, 'language', 1, 100, cols)

Unnamed: 0,example_id,question,span_answer_text,language,span_answer_score
2991,40d6b689-2367-400b-8696-7b4699319250,Onko seminoleilla oma kieli?,Kieltä puhuvat creekit ja seminolit Oklahomassa ja Floridassa,finnish,-9.770508
103,51b64c30-b88b-4df2-a031-4aea7767cc2b,2018년 한해동안 가장 많은 강수량을 기록한 나라는 어디인가?,대한민국,korean,0.393555
13250,b8ba1fe6-4151-48c6-b01e-9601745f47cd,mata uang apakah yang digunakan di brazil?,Real Brasil,indonesian,6.929688
1950,63da4d5b-c749-4ea4-983b-66175aa4fffd,পশ্চিমবঙ্গের মুর্শিদাবাদ জেলার সদর শহর কোনটি ?,বহরমপুর,bengali,6.760742
3482,d492809f-ee82-499a-83c5-47978f4f2384,2014 వరకి విజయవాడలో అతిపెద్ద కట్టడం ఏది?,నల్లూరి వెంకటేశ్వర్లు,telugu,-7.500366
3465,ef360d58-694c-49b3-b54a-a8751e4caa8d,ローリング・ストーンズはいつ結成した？,1962年4月の,japanese,9.82666
16352,a407bfdb-4209-472f-9023-1d2bde833319,Где разворачивается действие игры Dreamfall Chapters?,"двух параллельных мирах — Старке, антиутопичном будущем Земли, и Аркадии, его магическом мире-двойнике",russian,7.37207
8929,db464794-b7c0-4659-a898-6c3b26b70caa,อีดี อามิน มีภรรยาชื่อว่าอะไร?,มาดินา อามิน,thai,4.783203
2712,9136366d-4825-42bd-bd49-c4cd3f75d673,في أي عام تولى عبد الفتاح السيسي الحكم في مصر؟,2014,arabic,6.96582
14054,2cc55a29-83a6-4224-bfe6-2cde22c5c2ee,Kisiwa cha Pemba kina mitaa mingapi?,980 km².,swahili,0.041016


# Samples of QTC output

At this stage, two fields have been added: `question_type_pred` which is `boolean` if the question is a boolean question, and `short_answer` if the question is not boolean - typically factoid in this dataset.
The other field `question_type_scores` contains the classifier scores (logits) for each class. 
By far the majority of questions in TydiQA are `short_answer`: we present random examples chosen equally from those predicted `boolean` and those predicted `short_answer`.

In [5]:
eval_examples=create_dataset_from_run_mrc_output(qtc_file, unpack=False)
english_eval_examples = eval_examples.filter(lambda x:x['language']=='english')
random_idxs = permutation(len(english_eval_examples))
cols=['example_id','question','question_type_pred', 'question_type_scores']
show_balanced_examples(english_eval_examples, random_idxs, 'question_type_pred', 5, 100, cols)

  0%|          | 0/19 [00:00<?, ?ba/s]

Unnamed: 0,example_id,question,question_type_pred,question_type_scores
524,4fb782e6-fc06-41f0-a511-2f67ed428ff3,How large is the German military today?,short_answer,"{'boolean': -2.996520519256592, 'short_answer': 3.7801873683929443}"
123,e1b05e1b-de67-46af-8393-f996ebcecf4e,What is the largest rail yard in the New York City Subway system?,short_answer,"{'boolean': -2.9972739219665527, 'short_answer': 3.7812328338623047}"
131,1478b75c-7f5f-411d-9bc5-99c009ae15a4,On what network did Space: 1999 originally air?,short_answer,"{'boolean': -2.997262477874756, 'short_answer': 3.7814853191375732}"
236,548f4820-75a8-4909-93b8-5c88c3f97795,What is the most watched show on Oklahoma?,short_answer,"{'boolean': -2.9975571632385254, 'short_answer': 3.7821381092071533}"
279,0d928e95-3286-47ff-8c2c-9018f84a86ce,How many countries are in Africa?,short_answer,"{'boolean': -2.9956188201904297, 'short_answer': 3.778301477432251}"
466,78ef1304-b481-4de9-966a-5635f3c6dd10,Is Windex poisonous?,boolean,"{'boolean': 3.414430618286133, 'short_answer': -4.332897186279297}"
69,bbb56373-e84e-4fde-8858-be612b8f5d2c,Is Cantonese written the same as Mandarin?,boolean,"{'boolean': 3.4161205291748047, 'short_answer': -4.3320770263671875}"
505,398f7acb-3508-4b19-8440-8a685799eade,Did Dilophosaurus have feathers?,boolean,"{'boolean': 3.4151928424835205, 'short_answer': -4.333571910858154}"
33,24970674-4111-4cef-97b4-9de085794256,Is there a term limit to the Russian presidency?,boolean,"{'boolean': 3.417029619216919, 'short_answer': -4.33302640914917}"
643,2c73eb36-c0e2-4c1c-8ebe-27d61d0ae713,Was Slovenia ever under Roman rule?,boolean,"{'boolean': 3.417308807373047, 'short_answer': -4.332035541534424}"


# Samples of EVC output 
As above this classifier adds two new fields.  `boolean_answer_pred` is `yes` if the predicted answer to a boolean question is positive/true/yes, `no` if the answer is negative/false/no, and `no_answer` if there is no support for either answer in the context.  The field `boolean_answer_scores` provides the scores (logits) of each class.
For the TydiQA evaluation, we discard the `no_answer` prediction and always predict `yes` or `no`.  Other application may choose a different behavior.

For presentation purposes, we select the English questions from the dev set (they are not scored by tydi_eval.py), which have a higher fraction of boolean questions.  The boolean questions in the tydi dataset are overwhelmingly biased towards having a `yes` rather than a `no`  as the answer.  We suspect that the question writers were attempting to confirm existing knowledge.
Note that the answer classifier runs on all questions, even on the short answer questions, for simplicity.  A real deployed system would run the answer classifier only on questions that are predicted to be boolean.

In [6]:
eval_examples=create_dataset_from_run_mrc_output(evc_file, unpack=False)
english_boolean_eval_examples = eval_examples.filter(lambda x:x['language']=='english' and x['question_type_pred']=='boolean')
random_idxs = permutation(len(english_boolean_eval_examples))
cols=['example_id','question','passage_answer_text', 'boolean_answer_pred', 'boolean_answer_scores']
show_balanced_examples(english_boolean_eval_examples, random_idxs, 'boolean_answer_pred', 5, 300, cols)

  0%|          | 0/19 [00:00<?, ?ba/s]

Unnamed: 0,example_id,question,passage_answer_text,boolean_answer_pred,boolean_answer_scores
96,9cd50fd0-eeb1-4146-935b-438caceccb19,Is oiliness of the skin considered a disease?,"Sebaceous glands are microscopic exocrine glands in the skin that secrete an oily or waxy matter, called sebum, to lubricate and waterproof the skin and hair of mammals. In humans, they occur in the greatest number on the face and scalp, but also on all parts of the skin except the palms of the hand...",yes,"{'no': -5.162100791931152, 'no_answer': 6.438989162445068, 'yes': -0.6098946928977966}"
92,c463b465-d068-4fa1-80b9-819683217594,Does Kenshin take place during the Meiji era?,"Rurouni Kenshin: Meiji Swordsman Romantic Story(Japanese:るろうに剣心 -明治剣客浪漫譚-,Hepburn:Rurōni Kenshin -Meiji Kenkaku Romantan-),[lower-alpha 1] also known as Samurai X, is a Japanese manga series written and illustrated by Nobuhiro Watsuki. The story begins during the 11th year of the Meiji period in Jap...",yes,"{'no': -4.352438449859619, 'no_answer': -2.97670578956604, 'yes': 6.797365665435791}"
15,0ed01f94-b8c7-43ad-bf68-d861da8bbb11,Does Bob Jones University do any research?,"The school offers undergraduate majors in biology (zoo and wildlife, and cell biology[48]), premed/predent, chemistry, engineering, and physics and also offers courses in astronomy. Between 80% and 100% of the pre-med graduates are accepted to medical school every year.[49] The Department of Biology...",yes,"{'no': -6.790474891662598, 'no_answer': 2.862802743911743, 'yes': 3.7170300483703613}"
43,14e03c4b-effd-459f-aea9-8da1cf936cbd,Did American Epic play in theaters?,"The film was previewed as a work in progress at film festivals around the world throughout 2016, including a Special Event at Sundance hosted by Robert Redford,[55] SXSW,[56] International Documentary Film Festival Amsterdam,[57] Denver International Film Festival,[58] Sydney Film Festival,[59] and ...",yes,"{'no': -5.135706901550293, 'no_answer': 1.3839850425720215, 'yes': 3.9040300846099854}"
73,2254035a-f43d-48e9-9eae-ef1c1e626f96,Can salt marsh die-off be fixed?,"Research on the salt marsh snail Littoraria irrorata and its effects on marsh plant productivity, have provided strong evidence of consumer control in marshes triggered by overexploitation. This snail is capable of turning strands of cordgrass (Spartina alterniflora) (>2.5m tall) into mudflats withi...",yes,"{'no': -5.444829940795898, 'no_answer': 6.888958930969238, 'yes': -1.1627817153930664}"
42,e215f384-0afa-4477-b34a-d5f75ac8a467,Is the Mauser C96 produced today?,"The Spanish gunmaker Astra-Unceta y Cia began producing a copy of the Mauser C.96 in 1927 that was externally similar to the C96 (including the presence of a detachable shoulder stock/holster) but with non-interlocking internal parts. It was produced until 1941, with a production hiatus in 1937 and ...",no,"{'no': 3.608994245529175, 'no_answer': 0.8599523901939392, 'yes': -4.0795159339904785}"
10,bbb56373-e84e-4fde-8858-be612b8f5d2c,Is Cantonese written the same as Mandarin?,"General estimates of vocabulary differences between Cantonese and Mandarin range from 30 to 50 percent. Donald B. Snow, the author of Cantonese as Written Language: The Growth of a Written Chinese Vernacular, wrote that ""It is difficult to quantify precisely how different"" the two vocabularies are.[...",no,"{'no': 6.1006059646606445, 'no_answer': -1.4011447429656982, 'yes': -4.083403587341309}"
44,1a47aa43-2afb-4d16-a472-d2a9e71f3195,Does the KGB still exist?,"The KGB (Russian:Комите́т Госуда́рственной Безопа́сности (КГБ), tr.Komitet Gosudarstvennoy Bezopasnosti,IPA:[kəmʲɪˈtʲet ɡəsʊˈdarstvʲɪnːəj bʲɪzɐˈpasnəsʲtʲɪ](listen)), translated in English as Committee for State Security, was the main security agency for the Soviet Union from 1954 until its break-up ...",no,"{'no': 6.468663692474365, 'no_answer': -2.3071203231811523, 'yes': -3.554609775543213}"
24,64caf47e-c348-4b2c-a90a-45aae4930784,Were there ever any WMD in Iraq?,"In January 2003, United Nations weapons inspectors reported that they had found no indication that Iraq possessed nuclear weapons or an active program. Some former UNSCOM inspectors disagree about whether the United States could know for certain whether or not Iraq had renewed production of weapons ...",no,"{'no': -1.4815336465835571, 'no_answer': 4.486413478851318, 'yes': -2.3255317211151123}"
51,8de36edd-c6fe-4d29-90e5-c0b3cd605a30,Are any U.S. battleships still active?,"When the last Iowa-class ship was finally stricken from the Naval Vessel Registry, no battleships remained in service or in reserve with any navy worldwide. A number are preserved as museum ships, either afloat or in drydock. The U.S. has eight battleships on display: Massachusetts, North Carolina, ...",no,"{'no': 6.306628227233887, 'no_answer': -1.8138644695281982, 'yes': -3.84908127784729}"


# Final output

The final output file is in a format suitable for the tydi evalutation script and contains no textual information.  The `confidence_score` is normalized to `[0,1]` by the score normalizer based the confidence score of the original mrc output, and the prediction of the question type classifier.

In [7]:
pd.read_json(out_file)

Unnamed: 0,example_id,start_position,end_position,passage_index,yes_no_answer,confidence_score
0,740c33ae-78b9-40e5-8444-b6c8d2776776,986,1020,2,0,0.697498
1,e7541e40-46ec-494c-b0d6-a9c435568f2b,385,388,1,0,0.044546
2,869198a8-fc4d-43b4-bed7-24a61c17d8ab,14805,14807,27,0,0.099508
3,308f64d3-2794-410c-b2d5-10472b7e6661,6703,6715,13,0,0.083534
4,87ade38f-9ae7-4a98-8558-335772c33843,2539,2544,6,0,0.081707
...,...,...,...,...,...,...
18665,6504cb42-77d4-4a7d-a8bd-00d7f2df8994,1,12,0,0,0.213102
18666,704baa32-153b-45fb-9ef2-ca9f2e7265fd,5770,5777,16,0,0.034656
18667,5383daad-3b82-4e34-ad04-0c08a3ff86f7,1483,1541,4,0,0.172497
18668,a737ab02-a0a0-4b09-9ea8-ea348d19e212,61,69,0,0,0.120389
