Annotated Reference Strings Dataset

Introduction

annotated_reference_strings dataset consists of millions of reference strings synthesized to at most 17 CSL styles using CSL processor (citeproc-js) with the short sequence of tokens (segment) annotated as the variable it is derived from.

This library provide some utility to parse the raw annotated string to a sequence of tuples of token and its label.

For more information on the library and also the dataset, refer to the documentation.

Obtaining the dataset

The dataset is prepared in National University of Singapore (NUS), School of Computing (SoC), Web Information Retrieval / Natural Language Processing Group (WING) as part of a Master project.

You can obtain the dataset in parts or full in 2 ways as they are bundled in separated files:

If you are downloading from the Google Drive, it will be faster to download them by using gdown as Google will zip up the files if you download them through the web interface:

pip install gdown
gdown <url of the file>

If you are using Hugging Face's datasets library:

from datasets import load_dataset
dataset = load_dataset('yuanchuan/annotated_reference_strings')

Citing

If you are using the dataset, please cite the following:

@techreport{kee-nus-2021,
    author = {Yuan Chuan Kee},
    title = {Synthesis of a large dataset of annotated reference strings for developing citation parsers},
    institution = {National University of Singapore},
    year = {2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
dataset		dataset
docs		docs
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
manifest.csv		manifest.csv
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Annotated Reference Strings Dataset

Introduction

Obtaining the dataset

Citing

About

Uh oh!

Releases

Contributors 2

Uh oh!

Languages

License

kylase/annotated-reference-strings

Folders and files

Latest commit

History

Repository files navigation

Annotated Reference Strings Dataset

Introduction

Obtaining the dataset

Citing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 2

Uh oh!

Languages