-
Notifications
You must be signed in to change notification settings - Fork 3k
Adding a Corpus
NLTK comes with a collection of corpora. All corpora are freely redistributable. They live in the gh-pages branch of the nltk_data repository.
Each corpus requires a corpus reader, plus an entry in the corpus package that allows the corpus to be imported (this entry associates an importable name with a corpus reader and a data source).
If there is not yet a suitable corpus reader, you will need to create one, and submit that as a pull request to the nltk repository.
We welcome contributions of new corpora to NLTK. Each new corpus imposes a maintenance burden and a possible risk of copyright infringement. So before contributing we ask that you:
- Only contribute corpora that have obtained a basic level of notability (e.g. there is a publication that describes it, and a community of people who are using it)
- Ensure that you have permission to redistribute the data, and can document this (e.g. there is an external website where the data is already posted with a licence)
- Use existing NLTK corpus readers where possible, or else contribute a well-documented corpus reader to NLTK
To add a corpus to NLTK, please follow these steps:
- Test that you can access the corpus using NLTK:
- put a copy in your local nltk_data directory. The default system location on Windows is
C:\nltk_data\corpora; and on Mac and Unix is/usr/share/nltk_data/corpora. - modify NLTK to support the corpus (see step 4 below)
- open a Python interpreter and check that you can access the corpus, e.g.
from nltk.corpus import corpus_name
- Post a new entry to the nltk_data issue tracker:
- https://github.com/nltk/nltk_data/issues/new
- include the corpus name in the title
- identify the source of the corpus (e.g. a URL)
- NEW: provide clear rationale: why is this corpus sufficiently useful and notable to be included in NLTK
- suggest an NLTK name for the corpus (a short string, use underscore to separate multiple words, cf existing names)
- identify an existing corpus reader for the corpus, or else explain how you plan to create one
- document the fact that the corpus is freely redistributable (e.g. available under a Creative Commons ShareAlike license; or invite its creator to add a comment via the issue tracker).
- wait for approval from someone in the NLTK team.
- Prepare a pull request to the nltk_data repository:
- fork the nltk_data repository: https://github.com/nltk/nltk_data/fork
- clone the repository and check out the
gh-pagesbranch -
mkdir packages/corpora/corpus_name(using the corpus name agreed in step 1 above) - include a
READMEfile (text file with no extension) at the top level, with the corpus name and source URL for the corpus at the top - make sure there are no extraneous files like
.svn,.DS_storeorREADME~. -
zip -r corpus_name corpus_name(NB you can use the-xoption to exclude unwanted files, e.g.,zip -r corpus_name corpus_name -x "*.DS_Store") - create
corpus_name.xmlusing the metadata template (see below); specifyunzip="1"if the corpus reader requires the data to be unzipped after being installed (usually not necessary). - add the corpus name to
/collections/all.xml - push this to your fork and submit a pull request, referencing the issue from step 1
(In response to this pull request, an NLTK developer needs to review the request, before running make in the top level of the nltk_data repository (gh-pages branch), and then pushing the changes).
- If necessary, prepare a corpus reader:
- fork the nltk repository: https://github.com/nltk/nltk/fork
- clone the repository (make sure you are on the
developbranch) - add the corpus reader in
nltk/nltk/corpus/reader - add an entry in
nltk/corpus/__init__.pyandnltk/corpus/reader/__init__.py - push this to your fork and submit a pull request, referencing the issue from step 1
<package id="<corpus_name>" name="<Corpus Name>"
copyright="Copyright (C) <YEAR> <NAME>"
author="<NAME>"
license="<LICENSE NAME OR URL>"
webpage="<URL FOR CORPUS>"
unzip="0"
/>