Skip to content

Latest commit

 

History

History

data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Data Acquisition

Fundametally, all bookworm needs is a .txt file. If you can assemble your own material using material that you already own, go ahead!

  • Project Gutenberg is a brilliant resource of freely available, out of copyright textual material, providing room for a lot of exploration of historical literature. Click on a book and download the Plain Text UTF-8 copy.
  • The British Library / Microsoft OCR project sought to digitise a significant portion of the Library's historic texts using Optical Character Recognition (OCR) back in 2007. Computer vision was still pretty nascent at that point and the project also stopped short of its intended volume, but there's room for a lot of interesting work to be done there.