Skip to content

Perform OCR on images within Nuxeo with Tesseract and hOCR

Notifications You must be signed in to change notification settings

nuxeo-sandbox/nuxeo-platform-hocr

Repository files navigation

Nuxeo Platform - OCR Tools

Build Status

OCR tools that produce a searchable PDF with extracted results.

Dependencies

Tesseract - OCR Engine

hocr-tools - hOCR Processing Tools

Installation

  1. Use your favorite package manager to install the latest 4+ version of tesseract.
  2. Use pip and python 2.7 to install the set of hocr-tools for Nuxeo:
pip install git+https://github.com/nuxeo-sandbox/hocr-tools/

Build

Build with maven (at least 3.3)

mvn clean install

Test with tesseract, hocr-tools installation, add to maven build: -Dnuxeo.test.hocr=true

Package built here: nuxeo-platform-hocr-package/target

Install with nuxeoctl mp-install <package>

Use

This package provides the javascript.AttachOCRPDFF automation script. The script works with JPEG, PNG, and GIF blobs. Extraction and combination of PDF documents can be done with a variety of platform operations.

Individual operations are also provided to generate new automation chains and processes within the platform. In general, processing should follow the example pattern:

  • Document.OCRPrepare - scale, binarize and prepare the image for processing.
  • Blob.OCRExtract - perform OCR on the input blob, produce hOCR.
  • Blob.OCRPostProcess - scale the hOCR back to the original image dimensions.
  • Document.OCRCombine - combine the scaled hOCR output with the input image. Produces a searchable PDF document.

Support

These features are sand-boxed and not yet part of the Nuxeo Production platform.

These solutions are provided for inspiration and we encourage customers to use them as code samples and learning resources.

This is a moving project (no API maintenance, no deprecation process, etc.) If any of these solutions are found to be useful for the Nuxeo Platform in general, they will be integrated directly into platform, not maintained here.

Licensing

Apache License, Version 2.0

About Nuxeo

Nuxeo, developer of the leading Content Services Platform, is reinventing enterprise content management (ECM) and digital asset management (DAM). Nuxeo is fundamentally changing how people work with data and content to realize new value from digital information. Its cloud-native platform has been deployed by large enterprises, mid-sized businesses and government agencies worldwide. Customers like Verizon, Electronic Arts, ABN Amro, and the Department of Defense have used Nuxeo's technology to transform the way they do business. Founded in 2008, the company is based in New York with offices across the United States, Europe, and Asia.

Learn more at www.nuxeo.com.

About

Perform OCR on images within Nuxeo with Tesseract and hOCR

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published