Overview paper: https://trec.nist.gov/pubs/trec30/papers/Overview-DL.pdf
Note: You are viewing the guidelines for the 2021 edition of the TREC Deep Learning track. Please visit https://microsoft.github.io/msmarco/TREC-Deep-Learning for the upcoming / latest edition of the track.
Our main focus in 2021 is to get started on using a new, larger, cleaner corpus, which unifies the passage and document datasets. The new document dataset has been available since early July 2021, and the passage dataset was released in mid July 2021.
This leaves participants less than a month before the submission deadline of August 9th. We hope the community can come together and submit runs by: 1) Submitting standard approaches and baselines from TREC 2020, to see how these perform on the new datasets, 2) Implementing newer approaches too that you are working on now or have developed since TREC 2020, and 3) Trying some hybrid approaches that are enabled by the new document and passage corpus with passage-document mapping. For example, a passage task run could start with full ranking on the document corpus, then identifying candidate passages from top documents that lead in to passage reranking.
- 2019 website and overview paper
- 2020 website and overview paper
- August 9: Deadline for submitting runs for document and passage ranking tasks
- November 17-19: TREC conference
To participate in TREC please pre-register at the following website: https://ir.nist.gov/trecsubmit.open/application.html
The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).
Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in 2019 and 2020 aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.
In 2021, the track will continue to have the same tasks (document ranking and passage ranking) and goals. Similar to the previous year, one of the main goals of the track in 2021 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?
The Deep Learning Track has two tasks: Passage ranking and document ranking; and two subtasks in each case: full ranking and reranking. You can submit up to three runs for each of the subtasks.
Each task uses a large human-generated set of training labels, from the MS MARCO dataset. The two tasks use the same test queries. They also use the same form of training data with usually one positive training document/passage per training query. In the case of passage ranking, there is a direct human label that says the passage can be used to answer the query, whereas for training the document ranking task we infer document-level labels from the passage-level labels.
For both tasks, the participants are encouraged to study the efficacy of transfer learning methods. Our current training labels (from MS MARCO) are generated differently than the test labels (generated by NIST), although some labels from past years (mapped to the new corpus) may also be available. Participants can (and are encouraged to) also use external corpora for large scale language model pretraining, or adapt algorithms built for one task of the track (e.g. passage ranking) to the other task (e.g. document ranking). This allows participants to study a variety of transfer learning strategies.
Below the two tasks are described in more detail.
The first task focuses on document ranking. We have two subtasks related to this: Full ranking and top-100 reranking.
In the full ranking (retrieval) subtask, you are expected to rank documents based on their relevance to the question, where documents can be retrieved from the full document collection provided. You can submit up to 100 documents for this task. It models a scenario where you are building an end-to-end retrieval system.
In the reranking subtask, we provide you with an initial ranking of 100 documents from a simple IR system, and you are expected to rerank the documents in terms of their relevance to the question. This is a very common real-world scenario, since many end-to-end systems are implemented as retrieval followed by top-k reranking. The reranking subtask allows participants to focus on reranking only, without needing to implement an end-to-end system. It also makes those reranking runs more comparable, because they all start from the same set of 100 candidates.
Similar to the document ranking task, the passage ranking task also has a full ranking and reranking subtasks.
In context of full ranking (retrieval) subtask, given a question, you are expected to rank passages from the full collection in terms of their likelihood of containing an answer to the question. You can submit up to 100 passages for this end-to-end retrieval task.
In context of top-100 reranking subtask, we provide you with an initial ranking of 100 passages and you are expected to rerank these passages based on their likelihood of containing an answer to the question. In this subtask, we can compare different reranking methods based on the same initial set of 100 candidates, with the same rationale as described for the document reranking subtask.
Since the main asset in MS MARCO is the training data, and we do not have any new training data, the main purpose of this data release is to make the document/passage data larger, cleaner and more realistic. Some notes:
- Documents now have fewer problems with missing whitespace and character encoding. They are laid out in a way that is easier for relevance judges to read. These cleaner documents are also more amenable to document processing such as answer extraction.
- The document dataset is 3.7 times larger than the old document dataset. The passage dataset is 15.6 times larger than the old passage dataset.
- The old dataset had 2.8 passages per document, and the passages were selected in a way that reveals information about our train-dev-eval queries. With the old dataset, we did not release a passage-document mapping and asked participants not to generate such a mapping.
- The new dataset has 11.6 passages per document, selected using an algorithm that identifies the most promising passage candidates in a query-independent fashion.
- The new dataset has a known passage-document mapping, encouraging participants to consider how passage information may be used in document ranking and document information may be used in passage ranking.
- The release of larger, cleaner and more realistic data can form the basis of future tasks and MS MARCO leaderboard refreshes, but their first use is in TREC 2021.
To download large files more quickly and reliably use AzCopy (see instructions).
azcopy copy https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_doc.tar msmarco_v2_doc.tar
We also saw a suggestion for speeding up downloads without azcopy:
wget --header "X-Ms-Version: 2019-12-12" https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_doc.tar
Type | Filename | File size | Num Records | Format |
---|---|---|---|---|
Corpus | msmarco_v2_doc.tar | 32.3 GB | 11,959,635 | tar of 60 gzipped jsonl files |
Train | docv2_train_queries.tsv | 12.9 MB | 322,196 | tsv: qid, query |
Train | docv2_train_top100.txt.gz | 404.5 MB | 32,218,809 | TREC submission: qid, "Q0", docid, rank, score, runstring |
Train | docv2_train_qrels.tsv | 11.9 MB | 331,956 | TREC qrels format |
Dev 1 | docv2_dev_queries.tsv | 187.5 KB | 4,552 | tsv: qid, query |
Dev 1 | docv2_dev_top100.txt.gz | 5.6 MB | 455,200 | TREC submission: qid, "Q0", docid, rank, score, runstring |
Dev 1 | docv2_dev_qrels.tsv | 173.4 KB | 4,702 | TREC qrels format |
Dev 2 | docv2_dev2_queries.tsv | 205.0 KB | 5,000 | tsv: qid, query |
Dev 2 | docv2_dev2_top100.txt.gz | 6.1 MB | 500,000 | TREC submission: qid, "Q0", docid, rank, score, runstring |
Dev 2 | docv2_dev2_qrels.tsv | 190.9 KB | 5,178 | TREC qrels format |
Validation 1 (TREC test 2019) | msmarco-test2019-queries.tsv.gz | 4.2 KB | 200 | tsv: qid, query |
Validation 1 (TREC test 2019) | KB | TREC submission: qid, "Q0", docid, rank, score, runstring | ||
Validation 1 (TREC test 2019) | docv2_trec2019_qrels.txt.gz | 105 KB | 13,940 | qid, "Q0", docid, rating |
Validation 2 (TREC test 2020) | msmarco-test2020-queries.tsv.gz | 8.2 KB | 200 | tsv: qid, query |
Validation 2 (TREC test 2020) | KB | TREC submission: qid, "Q0", docid, rank, score, runstring | ||
Validation 2 (TREC test 2020) | docv2_trec2020_qrels.txt.gz | 60.9 KB | 7,942 | qid, "Q0", docid, rating |
Test (TREC test 2021) | 2021_queries.tsv | 24.0 KB | 477 | tsv: qid, query |
Test (TREC test 2021) | 2021_document_top100.txt.gz | 603.7 KB | 47,700 | TREC submission: qid, "Q0", docid, rank, score, runstring |
Test | 2021.qrels.docs.final.txt | 468 KB | 13,058 | qid, "Q0", docid, rating |
The document corpus is in jsonl format. Each document has:
- docid: Document identifier encodes the filename and starting position of the document's jsonl line in the corpus. For example,
msmarco_doc_31_726131
is in the filemsmarco_v2_doc/msmarco_doc_31
at position726131
. - url: The URL of the document
- title: The title of the document
- headings: A newline-separated list of headings that were extracted from the document, where the first heading may be a generated heading that describes the whole document (an alternate title).
- body: The body text of the document
If you unzip the corpus, you can quickly access a document using:
import json
def get_document(document_id):
(string1, string2, bundlenum, position) = document_id.split('_')
assert string1 == 'msmarco' and string2 == 'doc'
with open(f'./msmarco_v2_doc/msmarco_doc_{bundlenum}', 'rt', encoding='utf8') as in_fh:
in_fh.seek(int(position))
json_string = in_fh.readline()
document = json.loads(json_string)
assert document['docid'] == document_id
return document
document = get_document('msmarco_doc_31_726131')
print(document.keys())
Producing output:
dict_keys(['url', 'title', 'headings', 'body', 'docid'])
Type | Filename | File size | Num Records | Format |
---|---|---|---|---|
Corpus | msmarco_v2_passage.tar | 20.3 GB | 138,364,198 | tar of 70 gzipped jsonl files |
Train | passv2_train_queries.tsv | 11.1 MB | 277,144 | tsv: qid, query |
Train | passv2_train_top100.txt.gz | 324.9 MB | 27,713,673 | TREC submission: qid, "Q0", docid, rank, score, runstring |
Train | passv2_train_qrels.tsv | 11.1 MB | 287,889 | TREC qrels format |
Dev 1 | passv2_dev_queries.tsv | 160.7 KB | 3,903 | tsv: qid, query |
Dev 1 | passv2_dev_top100.txt.gz | 4.7 MB | 390,300 | TREC submission: qid, "Q0", docid, rank, score, runstring |
Dev 1 | passv2_dev_qrels.tsv | 161.2 KB | 4,074 | TREC qrels format |
Dev 2 | passv2_dev2_queries.tsv | 175.4 KB | 4.281 | tsv: qid, query |
Dev 2 | passv2_dev2_top100.txt.gz | 5.1 MB | 428,100 | TREC submission: qid, "Q0", docid, rank, score, runstring |
Dev 2 | passv2_dev2_qrels.tsv | 177.4 KB | 4,456 | TREC qrels format |
Test (TREC test 2021) | 2021_queries.tsv | 24.0 KB | 477 | tsv: qid, query |
Test (TREC test 2021) | 2021_passage_top100.txt.gz | 590.4 KB | 47,700 | TREC submission: qid, "Q0", docid, rank, score, runstring |
Test | 2021.qrels.pass.final.txt | 424 KB | 10,828 | qid, "Q0", docid, rating |
The passage corpus is also in jsonl format. Each passage has:
- pid: Passage identifier encodes the filename and starting position of the passage's jsonl line in the corpus. For example,
msmarco_passage_41_45753370
is in the filemsmarco_v2_passage/msmarco_passage_41
at position45753370
. - passage: The text of the passage.
- spans: The position of the passage sentence(s) in the originating document e.g.
(17789,17900),(17901,18096)
. - docid: The document ID of the passage's originating document e.g.
msmarco_doc_35_1343131017
.
The passage corpus can be accessed using the passage id, by adapting the python code listed for the document ID case above.
Passage "spans" use byte offsets, but the document text is in UTF-8, so to extract a span the span (x,y)
from body text you need to use:
doc_json['body'].encode()[x:y].decode()
You are generally allowed to use external information while developing your runs. When you submit your runs, please fill in a form listing what resources you used. This could include an external corpus such as Wikipedia or a pretrained model (e.g. word embeddings, BERT). This could also include the provided set of document ranking training data, but also optionally other data such as the passage ranking task labels or external labels or pretrained models. This will allow us to analyze the runs and break they down into types.
IMPORTANT NOTE: We are now dealing with multiple versions of MS MARCO ranking data, and all the other MS MARCO tasks as well. This new data release changes what is available and usable. Participants should be careful about using those datasets and must adhere to the following guidelines:
- You now are PERMITTED to use the passage-document mapping in your runs. For example, a passage ranking could be generated by first ranking the documents, then identifying all the passages from the top-k documents, then applying a passage reranking algorithm. In previous MS MARCO data, no passage-document mapping was available and we discouraged participants from generating such a mapping, so this approach was not possible.
- You are PROHIBITED from using the ORCAS data this year. You are also PROHIBITED from using any other information that tells us which of this year's documents (or passages) were also present in last year's corpus. We will study whether use of such information could cause some bias or leakage of ground truth, but for now it's prohibited. We may release an ORCAS update.
- Other than ORCAS you are PERMITTED to use any data listed above and from the TREC 2020 Deep Learning Track.
- You are PERMITTED to use any data listed below under the Additional resources section.
- You are PROHIBITED from using any other datasets from msmarco.org, such as the original QnA and NLGEN tasks, in your submission. The original MS MARCO dataset reveals some minor details of how they were constructed that would not be available in a real-world search engine; hence, should be avoided.
We will be following a similar format as the ones used by most TREC submissions, which is repeated below. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.
1 Q0 pid1 1 2.73 runid1
1 Q0 pid2 2 2.71 runid1
1 Q0 pid3 3 2.61 runid1
1 Q0 pid4 4 2.05 runid1
1 Q0 pid5 5 1.89 runid1
, where:
- the first column is the topic (query) number.
- the second column is currently unused and should always be "Q0".
- the third column is the official identifier of the retrieved passage in context of passage ranking task, and the identifier of the retrieved document in context of document ranking task.
- the fourth column is the rank the passage/document is retrieved.
- the fifth column shows the score (integer or floating point) that generated the ranking. This score must be in descending (non-increasing) order.
- The sixth column is the ID of the run you are submitting.
As the official evaluation set, we provide a set of test queries, where a subset will be judged by NIST assessors. For this purpose, NIST will be using depth pooling and construct separate pools for the passage ranking and document ranking tasks. Passages/documents in these pools will then be labelled by NIST assessors using multi-graded judgments, allowing us to measure NDCG. The same test queries are used for passage retrieval and document retrieval.
Besides our main evaluation using the NIST labels and NDCG, we also have sparse labels for the test queries, which already exist as part of the MS-Marco dataset. More information regarding how these sparse labels were obtained can be found at https://arxiv.org/abs/1611.09268. This allows us to calculate a secondary metric Mean Reciprocal Rank (MRR). For the full ranking setting, we also compute NCG to evaluate the performance of the candidate generation stage.
The main type of TREC submission is automatic, which means there was not manual intervention in running the test queries. This means you should not adjust your runs, rewrite the query, retrain your model, or make any other sorts of manual adjustments after you see the test queries. The ideal case is that you only look at the test queries to check that they ran properly (i.e. no bugs) then you submit your automatic runs. However, if you want to have a human in the loop for your run, or do anything else that uses the test queries to adjust your model or ranking, you can mark your run as manual. Manual runs are interesting, and we may learn a lot, but these are distinct from our main scenario which is a system that responds to unseen queries automatically.
We are sharing the following additional resources which we hope will be useful for the community.
Dataset | Filename | File size | Num Records | Format |
---|---|---|---|---|
Segmented document collection | msmarco_v2_doc_segmented.tar | 25.4 GB | 124,131,414 | tar |
Augmented passage collection | msmarco_v2_passage_augmented.tar | 20.0 GB | 138,364,198 | tar |
To check your downloads, compare to our md5sum data:
md5sum | filename |
---|---|
f2eead4b192683ae5fbd66f4d3f08b96 | docv2_dev2_qrels.tsv |
f000319f1893a7acdd60fdcae0703b95 | docv2_dev2_queries.tsv |
e03b5404e9027569c1aa794b1408d8a5 | docv2_dev2_top100.txt.gz |
aad92d731892ccb0cf9c4c2e37e0f0f1 | docv2_dev_qrels.tsv |
b05dc19f1d2b8ad729f189328a685aa1 | docv2_dev_queries.tsv |
4dd27d511748bede545cd7ae3fc92bf4 | docv2_dev_top100.txt.gz |
2f788d031c2ca29c4c482167fa5966de | docv2_train_qrels.tsv |
7821d8bef3971e12780a80a89a3e5cbd | docv2_train_queries.tsv |
b4d5915172d5f54bd23c31e966c114de | docv2_train_top100.txt.gz |
eea90100409a254fdb157b8e4e349deb | msmarco_v2_doc.tar |
05946bac48a8ffee62e160213eab3fda | msmarco_v2_passage.tar |
8ed8577fa459d34b59cf69b4daa2baeb | passv2_dev2_qrels.tsv |
565b84dfa7ccd2f4251fa2debea5947a | passv2_dev2_queries.tsv |
da532bf26169a3a2074fae774471cc9f | passv2_dev2_top100.txt.gz |
10f9263260d206d8fb8f13864aea123a | passv2_dev_qrels.tsv |
0fa4c6d64a653142ade9fc61d7484239 | passv2_dev_queries.tsv |
fee817a3ee273be8623379e5d3108c0b | passv2_dev_top100.txt.gz |
a2e37e9a9c7ca13d6e38be0512a52017 | passv2_train_qrels.tsv |
1835f44e6792c51aa98eed722a8dcc11 | passv2_train_queries.tsv |
7cd731ed984fccb2396f11a284cea800 | passv2_train_top100.txt.gz |
f18c3a75eb3426efeb6040dca3e885dc | msmarco_v2_doc_segmented.tar |
69acf3962608b614dbaaeb10282b2ab8 | msmarco_v2_passage_augmented.tar |
0bc85e3f2a6f798b91e18f0cd4a6bc6b | 2021_document_top100.txt.gz |
e2be2d307da26d1a3f76eb95507672a3 | 2021_passage_top100.txt.gz |
46d863434dda18300f5af33ee29c4b28 | 2021_queries.tsv |
- Nick Craswell (Microsoft)
- Bhaskar Mitra (Microsoft)
- Emine Yilmaz (UCL)
- Daniel Campos (University of Illinois at Urbana-Champaign)
- Jimmy Lin (University of Waterloo)
{% include_relative Notice.md %}