Note: If you are looking for a full-fledged Gutenberg catalog library in Python, check out gutenberg or gutenbergpy.
Get the pickled native python Gutenberg catalog here (93MB).
The catalog is a list of dictionaries with one dictionary for each source in Gutenberg catalog. Example:
{
'id': '24767',
'title': ["Jack O' Judgment"],
'subject': ['PR', 'Detective and mystery stories, English'],
'type': ['Text'],
'language': ['en'],
'author': ['Wallace, Richard Horatio Edgar', 'Wallace, Edgar'],
'author_birth': ['1875'],
'author_death': ['1932'],
'bookshelf': [],
'format': ['application/epub+zip', ...],
'publisher': ['Project Gutenberg'],
'rights': ['Public domain in the USA.'],
'date_issued': ['2008-03-06'],
'num_downloads': ['59'],
}
This repository also contains some basic functionality to interface with the Gutenberg catalog. These are:
GutenbergCatalog
class that allows searching for book ids by author etc.get_text
andstrip_headers
functions to download book texts and clean them. (These are copied from gutenberg library.)
catalog = GutenbergCatalog('gutenberg.pkl')
# search by author
book_ids = catalog.filter_by('author', 'Russell, Bertrand')
# get info about book
book_id = 2529
book_metadata = catalog.get_metadata(book_id)
# download and clean text for book
text = strip_headers(get_text(book_id))
The catalog file is generated by the parse_gutenberg_catalog.py
script. You can modify this script to customize the parsing step and generate your own catalog file.