Skip to content

maclaughlin/ARC-Sim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ARC-Sim

This repository contains a link to and description of the ARC-Sim dataset from our paper Recovering Lexically and Semantically Reused Texts, published at *SEM-21.

Dataset: Information & Copyright

This dataset is derived from the ACL Anthology Reference Corpus (ARC). Please see that paper for more details about the original corpus and our paper for the processing and filtering steps we took. The data in ARC is from the ACL Anthology and is licensed under a Creative Commons 3.0 BY-NC-SA license. Thus, this data is also distributed under the same Creative Commons 3.0 BY-NC-SA license as the original. Please see the ACL Anthology copyright page for full details.

Dataset: Files & Format

The dataset contains three files: train.jsonl.zip, val.jsonl.zip, and test.jsonl.zip. Each is a compressed jsonlines file, with one json record per line.

Each record contains the following fields:

  • cited_paper_id: string, the ACL Anthology ID of the source paper, e.g. P08-1119.
  • cited_paper_abstract: List[str], the abstract of the source paper, broken up into sentences.
  • citing_paper_id: string, the ACL Anthology ID of the target paper, e.g. P10-1044. Note, as described in our paper, we only use a single section from each target paper as the target document for each pair.
  • doc_label: int, 0 or 1, indicating whether the target paper section cites the source paper.
  • sentence_labels: List[int], each int is either 0 or 1. Labels for each of the sentences in the target paper section indicating whether they contain a citation to the source paper.
  • citing_paper_section_text: List[str], list of sentences in the target paper section. This list has the same length as sentence_labels. All citation marks have been removed from the text.

Citation

If you use our dataset, please cite our paper: Ansel MacLaughlin*, Shaobin Xu*, and David A Smith. Recovering Lexically and Semantically Reused Texts. In Proceedings of the Tenth Joint Conference on Lexical and Computational Semantics (*SEM-21), 2021

About

ARC-Sim dataset from the paper Recovering Lexically and Semantically Reused Texts, published at *SEM-21

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published