Skip to content

kylase/annotated-reference-strings

Repository files navigation

Annotated Reference Strings Dataset

Introduction

annotated_reference_strings dataset consists of millions of reference strings synthesized to at most 17 CSL styles using CSL processor (citeproc-js) with the short sequence of tokens (segment) annotated as the variable it is derived from.

This library provide some utility to parse the raw annotated string to a sequence of tuples of token and its label.

For more information on the library and also the dataset, refer to the documentation.

Obtaining the dataset

The dataset is prepared in National University of Singapore (NUS), School of Computing (SoC), Web Information Retrieval / Natural Language Processing Group (WING) as part of a Master project.

You can obtain the dataset in parts or full in 2 ways as they are bundled in separated files:

If you are downloading from the Google Drive, it will be faster to download them by using gdown as Google will zip up the files if you download them through the web interface:

pip install gdown
gdown <url of the file>

If you are using Hugging Face's datasets library:

from datasets import load_dataset
dataset = load_dataset('yuanchuan/annotated_reference_strings')

Citing

If you are using the dataset, please cite the following:

@techreport{kee-nus-2021,
    author = {Yuan Chuan Kee},
    title = {Synthesis of a large dataset of annotated reference strings for developing citation parsers},
    institution = {National University of Singapore},
    year = {2021}
}

Releases

No releases published

Contributors 2

  •  
  •  

Languages