Scan a folder of document files of all types and extract the text into a CSV suitable for Overview
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
tessdata
README.md
docs2csv.rb
tika-app-1.4.jar

README.md

docs2csv

Scan a folder of document files of all types and extract the text into a CSV suitable for import into Overview. Currently supports TXT, PDF, JPG, HTML, MHTML, RTF, and Microsoft Word, PowerPoint and Excel.

PDFs will be OCRd if -o set and they contain no text, or always if -f set. JPGs will be OCRd if -o set.

First you will need to install

  • Poppler, for pdfimages (and pdftotext on some systems) On Linux, use aptitude, apt-get or yum:

    aptitude install poppler-utils poppler-data

    On the Mac, you can install from source or use MacPorts:

    sudo port install poppler | brew install poppler

  • Tesseract, for OCR

    [aptitude | port | brew] install [tesseract | tesseract-ocr]

    Without Tesseract installed, you'll still be able to extract text from documents, but you won't be able to automatically OCR them.

Typical usage

ruby docs2csv.rb -r -o directory-to-scan [outputfile]

If outputfile is omitted, docs2csv will write the CSV to stdout.

This scans the directory recursively, and OCRs any PDFs which may need it. Other options:

-l, --list                       Only list files, do not process
-r, --recurse                    Scan directory recursively
-o, --ocr                        OCR jpgs and pdfs that do not contain text
-f, --force-ocr                  Force OCR on all pdfs

Viewing the original files from Overview

The extracted text will be shown in the Overview document viewer, but not the original document pages. You can view the original files in your browser via Overview's "source file" links, if you start up a simple web server like this:

python -m SimpleHTTPServer

The "source file" links use the URL column that docs2csv writes, which has addresses of the form http://localhost:8000/[filename]. You need to run this server from the same directory where you originally ran docs2csv, as these file URLs are relative.