Skip to content
Dataset of ML and NLP papers
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.


Type Name Latest commit message Commit time
Failed to load latest commit information.

ML and NLP paper data

This repository contains the data crawled and processed for the post series on ML and NLP publications.

The project was created by Marek Rei (@MarekRei). The country annotation was contributed by Jonas Pfeiffer (@PfeiffJo) and Andrew Caines (@cainesap).

Conference proceedings

The papers directory contains json files for each of the crawled conferences. Take a look inside to see the available metadata.

Country annotation

annotated_orgs.tsv contains the following columns in tab-separated format:

  • id
  • org_name - the name of the organization, as crawled
  • paper_count - the number of papers that matched that name, after initial processing
  • is_org - manually annotated field, indicating whether this is an actual organization or crawling noise
  • canonical_org_name - a canonical name for this organization, to match together different versions
  • country - manually annotated country name for each organization
  • example1 - an example paper where this organization was crawled from
  • example2 - another example
  • example3 - another example


This dataset is made available under the CC BY-NC 4.0 license.

You can’t perform that action at this time.