PHP
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
models/PDFCreate
LICENSE.txt
PDFCreatePlugin.php
README.md
plugin.ini

README.md

PDF Create

Omeka plugin that creates OCR'd PDFs from TIFFs. If you have multiple TIFFs for a single item, this provides any easy way to aggregate the TIFFs into a single file for easy viewing/downloading.

Generates OCR via Tesseract.

Stores OCR'd text via PdfText plugin's metadata element for site searching.

Aggregates multiple TIFFs for one item into single OCR'd PDF/a-1b PDF via Ghostscript. When the aggregated PDF is created, it can be found at http://example.com/path/to/your/files/directory/pdfs/ITEM_ID.pdf

Install

This plugin requires the PdfText plugin

The server-side software needed to peform the OCR extraction is Ghostscript and Tesseract. This is the exact versions of the required software verified to work with this plugin (running on Red Hat Enterprise Linux 7):

  • GPL Ghostscript 9.07 (2013-02-14)
  • Tesseract 3.04.01
    • leptonica 1.73
      • libjpeg 6b (libjpeg-turbo 1.2.90)
        • libpng 1.5.13
        • libtiff 4.0.3
        • zlib 1.2.7
  • Download the tessdata 3.04.00 tarball
    • mv all eng.* files to /usr/local/share/tessdata/
  • Download the file "pdf.ttf" found here to /usr/local/share/tessdata/
    • Without this updated pdf.ttf when two or more PDFs are aggregated into a single PDF via Ghostscript the resulting OCR will have spaces between every letter, essentially ruining the OCR. Essentially the tesseract and ghostscript fonts don't map perfectly, but this file fixes that.