Novels Project Corpus
This repository contains all data relevant to the Novels Project. It is under version control so all changes are recorded.
- Each directory in
volumes/corresponds to a volume. The directory name is unimportant, the file
metadata.jsonin each directory describes the relationship between the volume and a work.
- The canonical identifiers for bibliographic records ("novels project identifiers") of works (not volumes) are available in a separate repository: novels-project/identifiers.
Work metadata, volume metadata, and plaintext are exposed via a simple
read-only REST server. This server can be run with the following command
(Python 3.4 and
The API has two endpoints:
/work(metadata for all works)
For example, the novel Glenarvon has
1235 and metadata concerning it and a list of associated volumes may be
As the plaintext of an edition of this novel is available, a list of associated
texts and their SHA-1 hashes is given in the response. The plain text version
of the third volume has hash
40d2491e07dd2f1c71413b65bc551804cb93b0f3 and may
be retrieved with:
curl -s http://127.0.0.1:8080/text/40d2491e07dd2f1c71413b65bc551804cb93b0f3
That's all there is!
The vast majority of records in
works.csv are from the two volumes edited by
Garside, Raven, and Schöwerling. There are a handful of additions as well as
some placeholders where data has only been partially entered.
The following fields shall uniquely identify a record:
Each directory shall contain two files:
- a text file with the extension
The SHA1 of the text file is stored in
Metadata is associated with each volume that links it to a work.
metadata.json contains a JSON-encoded dictionary which shall conform to the
work_idthe work with which this volume is associated
internet_archive_idInternet Archive id (e.g.,
volume_countmust be greater than or equal to
%Y-%m-%dof metadata creation
%Y-%m-%dof last metadata update
sha1hex-encoded SHA1 of the text file in the directory
extra_info(optionally empty) dictionary of non-essential information
The following fields uniquely identify a record:
The text used is the best available facsimile of the original edition
works.csv. The text may be derived from a subsequent edition
(e.g., the second edition) if the subsequent edition is the only one available
or if the scan is of considerably higher quality.
A plaintext version of each volume is available. This text is typically the plaintext derived from Optical Character Recogntion (OCR), trimmed of OCR'd library stamps and other extraneous material which occurs at the beginning and ending of the scan. In recognition that this trimming process is not easily reproducible, patches will be provided in a future version that specifies how to produce the version contained in the repository from the original OCR.
In other cases, particularly when the scan is of very low quality, the
plaintext version may be derived from another source, including manual entry.
All these cases are recorded in
nonfree directory contains volumes and associated plaintexts that are not
available on the Internet Archive. The
metadata.json conforms to a schema
identical to the one described above but
If the original source of a plaintext is HTML, a plaintext version is generated using html2text version 1.3.2a.
- Validate all
metadata.jsonfiles using JSON Schema.