This repository contains a link to and description of the ARC-Sim dataset from our paper Recovering Lexically and Semantically Reused Texts, published at *SEM-21.
This dataset is derived from the ACL Anthology Reference Corpus (ARC). Please see that paper for more details about the original corpus and our paper for the processing and filtering steps we took. The data in ARC is from the ACL Anthology and is licensed under a Creative Commons 3.0 BY-NC-SA license. Thus, this data is also distributed under the same Creative Commons 3.0 BY-NC-SA license as the original. Please see the ACL Anthology copyright page for full details.
The dataset contains three files: train.jsonl.zip, val.jsonl.zip, and test.jsonl.zip. Each is a compressed jsonlines file, with one json record per line.
Each record contains the following fields:
- cited_paper_id: string, the ACL Anthology ID of the source paper, e.g. P08-1119.
- cited_paper_abstract: List[str], the abstract of the source paper, broken up into sentences.
- citing_paper_id: string, the ACL Anthology ID of the target paper, e.g. P10-1044. Note, as described in our paper, we only use a single section from each target paper as the target document for each pair.
- doc_label: int, 0 or 1, indicating whether the target paper section cites the source paper.
- sentence_labels: List[int], each int is either 0 or 1. Labels for each of the sentences in the target paper section indicating whether they contain a citation to the source paper.
- citing_paper_section_text: List[str], list of sentences in the target paper section. This list has the same length as sentence_labels. All citation marks have been removed from the text.
If you use our dataset, please cite our paper: Ansel MacLaughlin*, Shaobin Xu*, and David A Smith. Recovering Lexically and Semantically Reused Texts. In Proceedings of the Tenth Joint Conference on Lexical and Computational Semantics (*SEM-21), 2021