Skip to content


Switch branches/tags


Failed to load latest commit information.
Latest commit message
Commit time

MRQA 2019 Shared Task on Generalization


The MRQA 2019 Shared Task focuses on generalization in question answering. An effective question answering system should do more than merely interpolate from the training set to answer test examples drawn from the same distribution: it should also be able to extrapolate to out-of-distribution examples — a significantly harder challenge.

The format of the task is extractive question answering. Given a question and context passage, systems must find the word or phrase in the document that best answers the question. While this format is somewhat restrictive, it allows us to leverage many existing datasets, and its simplicity helps us focus on out-of-domain generalization, instead of other important but orthogonal challenges.

We release an official training dataset containing examples from existing extractive QA datasets, and evaluate submitted models on ten hidden test datasets. Both train and test datasets have the same format described above, but may differ in some of the following ways:

  • Passage distribution: Test examples may involve passages from different sources (e.g., science, news, novels, medical abstracts, etc) with pronounced syntactic and lexical differences.
  • Question distribution: Test examples may emphasize different styles of questions (e.g., entity-centric, relational, other tasks reformulated as QA, etc) which may come from different sources (e.g., crowdworkers, domain experts, exam writers, etc.)
  • Joint distribution: Test examples may vary according to the relationship of the question to the passage (e.g., collected independent vs. dependent of evidence, multi-hop, etc)

Each participant will submit a single QA system trained on the provided training data. We will then privately evaluate each system on the hidden test data.

This repository contains resources for accessing the official training and development data. If you are interested in participating, please fill out this form! We will e-mail participants who sign up of any important announcements regarding the shared task.

Quick Links


Updated 7/12/2019 to correct for minor exact-match discrepancies (See #11 for details.)

Updated 6/13/2019 to correct for duplicate context in HotpotQA (See #7 for details.)

Updated 5/29/2019 to correct for truncated detected_answers field (See #5 for details.)

We have adapted several existing datasets from their original formats and settings to conform to our unified extractive setting. Most notably:

  • We provide only a single, length-limited context.
  • There are no unanswerable or non-span answer questions.
  • All questions have at least one accepted answer that is found exactly in the context.

A span is judged to be an exact match if it matches the answer string after performing normalization consistent with the SQuAD dataset. Specifically:

  • The text is uncased.
  • All punctuation is stripped.
  • All articles {a, an, the} are removed.
  • All consecutive whitespace markers are compressed to just a single normal space ' '.

Training Data

Dataset Download MD5SUM Examples
SQuAD Link efd6a551d2697c20a694e933210489f8 86,588
NewsQA Link 182f4e977b849cb1dbfb796030b91444 74,160
TriviaQA Link e18f586152612a9358c22f5536bfd32a 61,688
SearchQA Link 612245315e6e7c4d8446e5fcc3dc1086 117,384
HotpotQA Link d212c7b3fc949bd0dc47d124e8c34907 72,928
NaturalQuestions Link e27d27bf7c49eb5ead43cef3f41de6be 104,071

Development Data


Dataset Download MD5SUM Examples
SQuAD Link 05f3f16c5c31ba8e46ff5fa80647ac46 10,507
NewsQA Link 5c188c92a84ddffe2ab590ac7598bde2 4,212
TriviaQA Link 5c9fdc633dfe196f1b428c81205fd82f 7,785
SearchQA Link 9217ad3f6925c384702f2a4e6d520c38 16,980
HotpotQA Link 125a96846c830381a8acff110ff6bd84 5,904
NaturalQuestions Link c0347eebbca02d10d1b07b9a64efe61d 12,836

Note: This in-domain data may be used for helping develop models. The final testing, however, will only contain out-of-domain data.


Dataset Download MD5SUM Examples
BioASQ Link 70752a39beb826a022ab21353cb66e54 1,504
DROP Link 070eb2ac92d2b2fc1b99abeda97ac37a 1,503
DuoRC Link b325c0ad2fa10e699136561ee70c5ddd 1,501
RACE Link ba8063647955bbb3ba63e9b17d82e815 674
RelationExtraction Link 266be75954fcb31b9dbfa9be7a61f088 2,948
TextbookQA Link 8b52d21381d841f8985839ec41a6c7f7 1,503

Note: As previously mentioned, the out-of-domain dataset have been modified from their original settings to fit the unified MRQA Shared Task paradigm (see MRQA Format). Once again, at a high level, the following two major modifications have been made:

  1. All QA-context pairs are extractive. That is, the answer is selected from the context and not via, e.g., multiple-choice.
  2. All contexts are capped at a maximum of 800 tokens. As a result, for longer contexts like Wikipedia articles, we only consider examples where the answer appears in the first 800 tokens.

As a result, some splits are harder than the original datasets (e.g., removal of multiple-choice in RACE), while some are easier (e.g., restricted context length in NaturalQuestions --- we use the short answer selection). Thus one should expect different performance ranges if comparing to previous work on these datasets.

Auxiliary Data

For additional sources of training data, we are whitelisting some non-QA datasets that may be helpful for multi-task learning or pretraining. If you have any other dataset in mind , please raise an issue or send us an email at .


  • SNLI
  • MultiNLI

Download Scripts

We have provided a convenience script to download all of the training and development data (that is released).

Please run:

./ path/to/store/downloaded/directory

To download the development data of the training datasets (in-domain), run:

./ path/to/store/downloaded/directory

To download the out-of-domain development data, run:

./ path/to/store/downloaded/directory

MRQA Format

All of the datasets for this task have been adapted to follow a unified format. They are stored as compressed JSONL files (with file extension .jsonl.gz).

The general format is:

  "header": {
    "dataset": <dataset name>,
    "split": <train|dev|test>,
  "context": <context text>,
  "context_tokens": [(token_1, offset_1), ..., (token_l, offset_l)],
  "qas": [
      "qid": <uuid>,
      "question": <question text>,
      "question_tokens": [(token_1, offset_1), ..., (token_q, offset_q)],
      "detected_answers": [
          "text": <answer text>,
          "char_spans": [[<start_1, end_1>], ..., [<start_n, end_n>]],
          "token_spans": [[<start_1, end_1>], ..., [<start_n, end_n>]],
      "answers": [<answer_text_1>, ..., <answer_text_m>]

Note that it is permissible to download the original datasets and use them as you wish. However, this is the format that the test data will be presented in.


  • context: This is the raw text of the supporting passage. Three special token types have been inserted: [TLE] precedes document titles, [DOC] denotes document breaks, and [PAR] denotes paragraph breaks. The maximum length of the context is 800 tokens.
  • context_tokens: A tokenized version of the supporting passage, using spaCy. Each token is a tuple of the token string and token character offset. The maximum number of tokens is 800.
  • qas: A list of questions for the given context.
  • qid: A unique identifier for the question. The qid is unique across all datasets.
  • question: The raw text of the question.
  • question_tokens: A tokenized version of the question. The tokenizer and token format is the same as for the context.
  • detected_answers: A list of answer spans for the given question that index into the context. For some datasets these spans have been automatically detected using searching heuristics. The same answer may appear multiple times in the text --- each of these occurrences is recorded. For example, if 42 is the answer, the context "The answer is 42. 42 is the answer.", has two occurrences marked.
    • text: The raw text of the detected answer.
    • char_spans: Inclusive [start, end] character spans (indexing into the raw context).
    • token_spans: Inclusive [start, end] token spans (indexing into the tokenized context).
  • answers: All accepted answer to the question, whether or not there is an exact match in the given context.


To view examples in the terminal please install requirements.txt (pip install requirements.txt) and then run:

python path/or/url

The script argument may be either a URL or a local file path. For example:



Answers are evaluated using exact match and token-level F1 metrics. The script is used to evaluate predictions on a given dataset:

python <url_or_filename> <predictions_file>

The predictions file must be a valid JSON file of qid, answer pairs:

  "qid_1": "answer span text 1",
  "qid_n": "answer span text N"

The final score for the MRQA shared task will be the macro-average across all test datasets.

Baseline Model

An implementation of a simple multi-task BERT-based baseline model is available in the baseline directory.

Below are our baseline results (I = in-domain, O = out-of-domain):

Dataset Multi-Task BERT-Base Multi-Task BERT-Large
(I) SQuAD 78.5 / 86.7 80.3 / 88.4
(I) HotpotQA 59.8 / 76.6 62.4 / 79.0
(I) TriviaQA Web 65.6 / 71.6 68.2 / 74.7
(I) NewsQA 50.8 / 66.8 49.6 / 66.3
(I) SearchQA 69.5 / 76.7 71.8 / 79.0
(I) NaturalQuestions 65.4 / 77.4 67.9 / 79.8
(O) DROP 25.7 / 34.5 34.6 / 43.8
(O) RACE 30.4 / 41.4 31.3 / 42.5
(O) BioASQ 47.1 / 62.7 51.9 / 66.8
(O) TextbookQA 44.9 / 53.9 47.4 / 55.7
(O) RelationExtraction 72.6 / 83.8 72.7 / 85.2
(O) DuoRC 44.8 / 54.6 46.8 / 58.0


Submission will be handled through the Codalab platform: see these instructions.

Note that submissions should start a local server that accepts POST requests of single JSON objects in our standard format, and returns a JSON prediction object. The official script (in this directory) will query this server to get predictions. The baseline directory includes an example implementation in We have chosen this format so that we can create interactive demos for all submitted models.


Codalab results for all models submitted to the shared task are available in the results directory. These files include the dev and test EM and F1 scores for every model and every dataset.


    title={{MRQA} 2019 Shared Task: Evaluating Generalization in Reading Comprehension},
    author={Adam Fisch and Alon Talmor and Robin Jia and Minjoon Seo and Eunsol Choi and Danqi Chen},
    booktitle={Proceedings of 2nd Machine Reading for Reading Comprehension (MRQA) Workshop at EMNLP},