ARC-Sim

This repository contains a link to and description of the ARC-Sim dataset from our paper Recovering Lexically and Semantically Reused Texts, published at *SEM-21.

Dataset: Information & Copyright

This dataset is derived from the ACL Anthology Reference Corpus (ARC). Please see that paper for more details about the original corpus and our paper for the processing and filtering steps we took. The data in ARC is from the ACL Anthology and is licensed under a Creative Commons 3.0 BY-NC-SA license. Thus, this data is also distributed under the same Creative Commons 3.0 BY-NC-SA license as the original. Please see the ACL Anthology copyright page for full details.

Dataset: Files & Format

The dataset contains three files: train.jsonl.zip, val.jsonl.zip, and test.jsonl.zip. Each is a compressed jsonlines file, with one json record per line.

Each record contains the following fields:

cited_paper_id: string, the ACL Anthology ID of the source paper, e.g. P08-1119.
cited_paper_abstract: List[str], the abstract of the source paper, broken up into sentences.
citing_paper_id: string, the ACL Anthology ID of the target paper, e.g. P10-1044. Note, as described in our paper, we only use a single section from each target paper as the target document for each pair.
doc_label: int, 0 or 1, indicating whether the target paper section cites the source paper.
sentence_labels: List[int], each int is either 0 or 1. Labels for each of the sentences in the target paper section indicating whether they contain a citation to the source paper.
citing_paper_section_text: List[str], list of sentences in the target paper section. This list has the same length as sentence_labels. All citation marks have been removed from the text.

Citation

If you use our dataset, please cite our paper: Ansel MacLaughlin*, Shaobin Xu*, and David A Smith. Recovering Lexically and Semantically Reused Texts. In Proceedings of the Tenth Joint Conference on Lexical and Computational Semantics (*SEM-21), 2021

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md
test.jsonl.zip		test.jsonl.zip
train.jsonl.zip		train.jsonl.zip
val.jsonl.zip		val.jsonl.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

test.jsonl.zip

test.jsonl.zip

train.jsonl.zip

train.jsonl.zip

val.jsonl.zip

val.jsonl.zip

Repository files navigation

ARC-Sim

Dataset: Information & Copyright

Dataset: Files & Format

Citation

About

Releases

Packages

maclaughlin/ARC-Sim

Folders and files

Latest commit

History

Repository files navigation

ARC-Sim

Dataset: Information & Copyright

Dataset: Files & Format

Citation

About

Resources

Stars

Watchers

Forks