Get texts from Project Gutenberg, extract and format.
Code originally came from the Open Shakespeare project codebase https://github.com/okfn/shakespeare.
Download the gutenberg.py script (or clone the entire repo).
Use the script as follows:
./gutenberg.py {url-to-raw-gutenberg-text}
The cleaned version of the text will then be printed to standard out.
Running the tests:
nostests tests/test_gutenberg.py
Note that we have test data in tests/data
.
Copyright 2005-2012 Open Knowledge Foundation. All material licensed under the MIT license: