Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Influential references dataset

What is this?

It is a dataset that might serve as a reference for projects in natural language processing, machine learning or bibliometrics.

What problem does it try to solve?

Given a research paper, can you identify the most important references?

How was the data collected?

Authors of a paper are in the best position to determine whether a given reference had a strong influence on their research. In a blog post on March 20th 2012, we invited authors to help us create a gold-standard dataset of labeled references. The authors were directed to fill in an online form. The instructions on the form were as follows:

"We believe that most papers are based on 1, 2, 3 or 4 essential references. By an essential reference, we mean a reference that was highly influential or inspirational for the core ideas in your paper; that is, a reference that inspired or strongly influenced your new algorithm, your experimental design, or your choice of a research problem. Other references merely support the work. We believe that authors are the best experts to assess which references are essential. We are interested in automatically finding these references. To know how well we are doing, we need your help: please give us the title of a few of your papers and list for each paper the references that you feel are most essential, those without which the work would not have been possible."

Can I download the data?

We make the raw data available as we collected it up to March 6th 2013. We only removed the email addresses that were provided to us. The data is made available under the ODC Public Domain Dedication and Licence.

Is there a reference?

Xiaodan Zhu, Peter Turney, Daniel Lemire, Andre Vellino, Measuring academic influence: Not all citations are equal, Journal of the Association for Information Science and Technology 66 (2), 2015.

Who contributed to this project?

The data was gathered under the initiative of Daniel Lemire, Peter Turney and Andre Vellino.

The contributors are (in no particular order):

  • Suresh Venkatasubramanian (University of Utah, USA),
  • Indrė Žliobaitė (Bournemouth University, UK),
  • James Foulds (University of California, Irvine, USA),
  • David Eppstein (University of California, Irvine, USA),
  • Daniel Gayo-Avello (University of Oviedo, Spain),
  • Jim Harrington (Los Alamos National Laboratory, USA),
  • Simon Mitternacht (University of Bergen, Norway),
  • Matthias Gallé (Université Rennes 1, France),
  • Scott Guthery (Bell Telephone Laboratories, USA),
  • Marko Tkalčič (University of Ljubljana, Slovenia),
  • Jan Jensen (University of Copenhagen, Denmark),
  • Daniel Lemire (LICEF, Université du Québec, Canada),
  • Andre Vellino (National Research Council of Canada, Canada),
  • Peter Turney (National Research Council of Canada, Canada),
  • Felipe Pait (Escola de Eng. Maua, Brazil),
  • Kenneth L. Clarkson (IBM Almaden Research Center, USA),
  • Leonel Morgado (UTAD/GECAD, Portugal),
  • Symeon Papadopoulos (Informatics & Telematics Inst., Greece),
  • Kent E. Holsinger (University of Connecticut, USA),
  • David J. Harris (Washington University, USA),
  • Laurent Duval (IFP Energies nouvelles, France),
  • Falk Hüffner (Technische Universität Berlin, Germany),
  • Lynne Bowker (University of Ottawa, Canada),
  • Maarten van Emden (University of Victoria, Canada),
  • Adrian Groves (University of Oxford, UK),
  • Anthony Labarre (K. U. Leaven, Belgium),
  • George Foster (National Research Council of Canada, Canada),
  • Jérôme Darmont (Lyon 2, France),
  • Laurence Tratt (King's College London, UK),
  • Drew Sowersby (Texas State University, USA),
  • Daniel Mietchen (Open Knowledge Foundation, Germany),
  • Chandra Chekuri (University of Illinois, Urbana-Champaign, USA),
  • Michel Desmarais (École Polytechnique, Canada),
  • Mikołaj Morzy (Poznań University of Technology, Poland),
  • Nathanael Schaeffer (CNRS, France),
  • Morgan Price (Lawrence Berkeley National Laboratory, USA),
  • Chris Fournier (University of Ottawa, Canada),
  • Xiaodan Zhu (National Research Council of Canada, Canada),
  • Abhaya Agarwal (Carnegie Mellon University, USA),
  • Clifton Phua (Institute of Infocomm Research, Singapore).


Influential references dataset






No releases published


No packages published