Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CITATION.cff to nltk #2880

merged 7 commits into from Nov 26, 2021

Add CITATION.cff to nltk #2880

merged 7 commits into from Nov 26, 2021


Copy link

@tomaarsen tomaarsen commented Nov 8, 2021

Closes #2874


Pull request overview

  • Add CITATION.cff to nltk.

CITATION.cff file

The schema guide for CITATION.cff files can be found here. The CITATION.cff which I've written consists of two parts:

  • The "NLTK" part, which represents a software.
  • The "NLTK book" part, which is for the 2009 book.

Most notably, the book part is listed under preferred-citation. When the "Cite this repository" button is clicked, this will be the part that it uses to generate the actual citation.

The "Cite this repository" button

Here's a quick screenshot of it:

And after clicking on it:

It can be seen on my feature/citation branch of my fork:

The NLTK software part

I'm open to discussions about this - citations are very important, and I would like to get a common agreement on the CITATION.cff. The next snippet is about the "Software" section of the citation, which provides information about NLTK as a software, but is not used in the BibTeX or APA citation.

cff-version: 1.2.0
title: >-
  Natural Language Toolkit
message: >-
  Please cite this software using the metadata from
type: software
  - name: "NLTK Team"
    email: ""
repository-code: ""
url: ""
license: Apache-2.0
  - "NLP",
  - "CL"
  - "natural language processing"
  - "computational linguistics"
  - "parsing"
  - "tagging"
  - "tokenizing"
  - "syntax"
  - "linguistics"
  - "language"
  - "natural language"
  - "text analytics"
  • title: I've opted for Natural Language Toolkit. nltk or NLTK are alternatives.
  • message: This is one of the default options, which lets users know to use the citation for the book instead.
  • authors: Here I've gone with the info as can be found in the file, as opposed to mentioning specific users. That said, we may want to list both individual users and then "NLTK Team" or "NLTK Contributors", as suggested in the third code block here.
    For example:
      - given-names: Steven
        family-names: Bird
      - given-names: Liling
        family-names: Tan
      - name: "NLTK Team"
        email: ""
  • license: I've listed Apache-2.0, as the code is licensed under that. That said, there are other licences for e.g. the documentation. We can also supply a list of licenses here.

The remainder of this section speaks for itself.

The NLTK book part

This section specifies the part of the CITATION.cff which is actually used in the citation generation. The snippet is as follows:

  title: >-
    Natural language processing with Python: analyzing
    text with the natural language toolkit
  type: book
    - given-names: Steven
      family-names: Bird
    - given-names: Ewan
      family-names: Klein
    - given-names: Edward
      family-names: Loper
  year: 2009
    name: "O'Reilly Media, Inc."
  • title: The full title of the book
  • type: A book, as opposed to software.
  • authors: Listed in the same order as the authors of the book, and the citations as provided by Google Scholar. I've provided ORCID ID's of Ewan Klein and Steven Bird. Obviously, these can be removed. Beyond that, more information can be added here about each author, but I left it fairly minimal. Also, one more thing to note: The order of authors from this PR, from Google Scholar, and from the book itself differ from the order of authors on the current README.
    This is definitely something to look into!
  • year: Speaks for itself. I decided to not include the month or date, as they don't seem to be included in the BibTeX that Google Scholar provides.
  • publisher: "O'Reilly Media, Inc." differs slightly from the citation that Google Scholar provides, which is " O'Reilly Media, Inc." (with an extra space before). This is also something to consider, as it affects the generated APA or BibTeX citation.

Generated citations:


Generated by Google Scholar:

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.".

Generated by GitHub from according to this CITATION.cff:

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. O'Reilly Media, Inc.

These are nearly identical - the difference is that the Google Scholar citation includes an additional space for the publisher, alongside extra quotes.


Generated by Google Scholar:

  title={Natural language processing with Python: analyzing text with the natural language toolkit},
  author={Bird, Steven and Klein, Ewan and Loper, Edward},
  publisher={" O'Reilly Media, Inc."}

Generated by GitHub from according to this CITATION.cff:

author = {Bird, Steven and Klein, Ewan and Loper, Edward},
publisher = {O'Reilly Media, Inc.},
title = {{Natural language processing with Python: analyzing text with the natural language toolkit}},
year = {2009}

Yet again these are nearly identical, with the same difference in publisher. Beyond that, the order is different, which should not matter. The label with which these can be referenced also differs, but that's also not a big deal.

For context, this is the citation that is currently suggested to be used according to the README:

Bird, Steven, Edward Loper and Ewan Klein (2009).
Natural Language Processing with Python.  O'Reilly Media Inc.

(Some notes: the differing order of authors, and the shrunk title)


This entire PR assumes that we still wish for citations to cite the book, rather than the software.

  • Tom Aarsen

Copy link

@iliakur iliakur commented Nov 9, 2021

After skimming, an idea for software title: Natural Language ToolKit (NLTK). This is commonly how acronyms are introduced in prose.

Copy link
Member Author

@tomaarsen tomaarsen commented Nov 9, 2021

That sounds like a good middle-ground.

@stevenbird stevenbird self-assigned this Nov 18, 2021
CITATION.cff Outdated Show resolved Hide resolved
@stevenbird stevenbird merged commit e2e94ef into nltk:develop Nov 26, 2021
16 checks passed
@tomaarsen tomaarsen deleted the feature/citation branch Nov 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

Successfully merging this pull request may close these issues.

3 participants