CODEC

Complex Document and Entity Collection

Overview

CODEC is a new document and entity ranking benchmark that focuses on complex research topics. We target essay-style information needs of social science researchers across history, economics, and politics. For example, ‘How has the UK’s Open Banking Regulation benefited Challenger Banks?’

CODEC includes 42 topics developed by researchers and a new focused web corpus with semantic annotations including entity links. It includes expert judgments on 6,186 document (147.3 per topic) and 11,323 entity (269.6 per topic) from diverse automatic and interactive manual runs. The manual runs include 387 query reformulations (9.2 per topic), providing data for query performance prediction and automatic rewriting evaluation.

Paper

This work will be presented at SIGIR 2022: https://arxiv.org/abs/2205.04546

Correct citation:

@inproceedings{mackie2022codec,
 title={CODEC: Complex Document and Entity Collection},
 author={Mackie, Iain and Owoicho, Paul and Gemmell, Carlos and Fischer, Sophie and MacAvaney, Sean and Dalton, Jeffery},
 booktitle={Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
 year={2022}
}

Dataset

CODEC provides 42 topics for document and entity retrieval:

CODEC full document corpus is available for research purpose: FULL. This is a SAMPLE.

CODEC entity KB is KILT's snapshot of Wikipedia (~30GB).

We provide entity links over document corpus: link (~13GB).

Colab demo showing indexing, query reformulations, entity links, and evaluation:

Dataset is available via ir-datasets.

Change Log

Major dataset changes historic users should be aware:

25th April: CODEC v1 released.

Tasks

CODEC is a test collection that provides two tasks: document ranking and entity ranking. This dataset benchmarks a social science researcher who is attempting to find supporting entities and documents that will form the basis of a long-form essay discussing the topic from various perspectives. The researcher would explore the topic to (1) identify relevant sources and (2) understand key concepts.

Document ranking systems have to return a relevance-ranked list of documents for a given natural language query. Entity ranking systems have to return a relevance-ranked list of entities for a given natural language query. Document ranking uses CODEC’s new document corpus and entity ranking uses KILT as the entity knowledge base. For the experimental setup, we provide four pre-defined ‘standard’ folds for k-fold cross-validation to allow parameter tuning. Initial retrieval or re-ranking of provided baseline runs can both be evaluated using this dataset.

Complex Topics

CODEC provides 42 complex topics which intend to benchmark the role of a researcher. Social science experts from history (history teacher, published history scholar), economics (FX trader, accountant, investment banker) and politics (political scientists, politician) helped to generate interesting and factually-grounded topics. The authors develop the following criteria for complex topics:

Open-ended essay-style
Natural language question
Multiple points of view
Concern multiple key entities
Complex
Requires knowledge

Each topic contains a query and narrative. The query is the question the researcher seeks to understand by exploring documents and entities, i.e., the text input posed to the search system. The narratives provide an overview of the topic (key concepts, arguments, facts, etc.) and allow non-domain-experts to understand the topic.

Document Corpus

We use Common Crawl to curate a 729,824 document corpus with focused content across finance, history, and politics.

The corpus is released in jsonline format with following fields:

id: Unique identifier is the MD5 hash of URL.
url: Location of the webpage (URL).
title: Title of the webpage if available.
contents: The text content of the webpage after removing any unnecessary advertising or formatting. New lines provide some structure between the extracted sections of the webpage, while still easy for neural systems to process.

Document distribution:

	Document Count
reuters.com	172,127
forbes.com	147,399
cnbc.com	100,842
britannica.com	93,484
latimes.com	88,486
usatoday.com	31,803
investopedia.com	21,459
bbc.co.uk	21,414
history.state.gov	9,187
brookings.edu	9,058
ehistory.osu.edu	8,805
history.com	6,749
spartacus-educational.com	3,904
historynet.com	3,811
historyhit.com	3,173
...	...
TOTAL	721,701

Entity KB

CODEC uses KILT’s Wikipedia KB for the entity ranking task, which is based on the 2019/08/01 Wikipedia snapshot. KILT contains 5.9M preprocessed articles which is freely available to use: link.

Judgments

CODEC uses a 2-stage assessment approach to balance adequate coverage of current systems while allowing annotators to explore topics using iterative search system. This creates 6,186 document judgments (147.3 per topic) and 11,323 entity judgments (269.6 per topic):

These raw judgements are released: link.

Judgment	Document Ranking	Entity Ranking
0	2,353	7,053
1	2,210	2,241
2	1,207	1,252
3	416	777
TOTAL	6,186	11,323

Query Reformulations

During assessment process, researchers use a live search system to explore the complex topic. We release the full 387 queries and mapped relevance judgment: link

An example of these manual query reformulations:

Entity-Centric Search

CODEC provides aligned document and entity judgments that allows for new entity-centric search models to be developed.

We also release REL entity links grounding our document corpus to entity KD: link

mention: Text spans in document that is linked to entity.
prediction: Top predicted entity link (Wikipedia title).
prediction_kilt: We map prediction entity link to KILT id to align with entity KB and entity judgments.
candidates: Top-k entity link candidates (Wikipedia title).
candidates_kilt: We map candidates entity links to KILT ids to align with entity KB and entity judgments.
conf_ed: Score of Flair NER model.
score: Scores of REL candidate selection model.

Evaluation

We provide TREC-style query-relevance files (entity rankings: link) and (document ranking: link).

The official measures for both tasks include MAP, NDCG@10, and Recall@1000.

System Performance

Systems:

Sparse retrieval BM25 and BM25+RM3 runs use Pyserini with Porter stemming and stopwords removed. We cross validate and release tuned paramters here.

ANCE is a dense retrieval model. We use an MS Marco fined-tune ANCE model and Pyserini’s wrapper for easy indexing. ANCE+FirstP takes the first 512 BERT tokens of each document to represent that document. While ANCE+MaxP shards the document into a maximum of four 512-token shards, with the maximum score representing the document.

T5 is state-of-the-art LM re-ranker that casts text re-ranking into a sequence-to-sequence. Pygaggle’s MonoT5 model, which is fine-tuned using MS Marco. We employ a max-passage approach similar to Nogueira et al. (2020) to re-rank all initial retrieval runs.

Document ranking:

	MAP	NDCG@10	Recall@1000
BM25	0.213	0.322	0.762
BM25+RM3	0.233	0.327	0.800
ANCE-MaxP	0.186	0.363	0.689
BM25+T5	0.340	0.468	0.762
BM25+RM3+T5	0.346	0.472	0.800
ANCE-MaxP+T5	0.316	0.481	0.689

Entity Ranking:

	MAP	NDCG@10	Recall@1000
BM25	0.181	0.397	0.615
BM25+RM3	0.209	0.412	0.685
ANCE-FirstP	0.076	0.269	0.340
BM25+T5	0.172	0.361	0.615
BM25+RM3+T5	0.179	0.362	0.685
ANCE-FirstP+T5	0.136	0.407	0.340

Future Work

We envision CODEC to be an evolving collection, with additional judgments and tasks added in the future, i.e. knowledge grounded generation, passage ranking, and entity linking. The topics could also be further enhanced with facet annotations and semantic annotations to support tail and non-KG entities research

Please suggest any future extensions or bug fixes on github or email (i.mackie.1@research.gla.ac.uk).

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
assets		assets
qrels		qrels
raw_judgments		raw_judgments
system_runs		system_runs
topics		topics
.gitignore		.gitignore
README.md		README.md
download_data.sh		download_data.sh
download_data_sample.sh		download_data_sample.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CODEC

Complex Document and Entity Collection

Table of Contents

Overview

Paper

Dataset

Change Log

Tasks

Complex Topics

Document Corpus

Entity KB

Judgments

Query Reformulations

Entity-Centric Search

Evaluation

System Performance

Future Work

About

Releases 1

Packages

Languages

grill-lab/CODEC

Folders and files

Latest commit

History

Repository files navigation

CODEC

Complex Document and Entity Collection

Table of Contents

Overview

Paper

Dataset

Change Log

Tasks

Complex Topics

Document Corpus

Entity KB

Judgments

Query Reformulations

Entity-Centric Search

Evaluation

System Performance

Future Work

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages