Skip to content

Preprocessed Gutenberg catalog and simple Gutenberg utilities

License

Notifications You must be signed in to change notification settings

gokererdogan/gutenberg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Preprocessed Gutenberg Catalog and Simple Gutenberg Utilities

Note: If you are looking for a full-fledged Gutenberg catalog library in Python, check out gutenberg or gutenbergpy.

Get the pickled native python Gutenberg catalog here (93MB).

The catalog is a list of dictionaries with one dictionary for each source in Gutenberg catalog. Example:

{
  'id': '24767',
  'title': ["Jack O' Judgment"], 
  'subject': ['PR', 'Detective and mystery stories, English'],
  'type': ['Text'],
  'language': ['en'],
  'author': ['Wallace, Richard Horatio Edgar', 'Wallace, Edgar'],
  'author_birth': ['1875'],
  'author_death': ['1932'],
  'bookshelf': [],
  'format': ['application/epub+zip', ...],
  'publisher': ['Project Gutenberg'],
  'rights': ['Public domain in the USA.'],
  'date_issued': ['2008-03-06'],
  'num_downloads': ['59'],
}

This repository also contains some basic functionality to interface with the Gutenberg catalog. These are:

  • GutenbergCatalog class that allows searching for book ids by author etc.
  • get_text and strip_headers functions to download book texts and clean them. (These are copied from gutenberg library.)
catalog = GutenbergCatalog('gutenberg.pkl')
# search by author
book_ids = catalog.filter_by('author', 'Russell, Bertrand')

# get info about book
book_id = 2529
book_metadata = catalog.get_metadata(book_id)

# download and clean text for book
text = strip_headers(get_text(book_id))

The catalog file is generated by the parse_gutenberg_catalog.py script. You can modify this script to customize the parsing step and generate your own catalog file.

About

Preprocessed Gutenberg catalog and simple Gutenberg utilities

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages