Permalink
Switch branches/tags
Nothing to show
Find file Copy path
118 lines (62 sloc) 9.28 KB

PDF Extraction Resources

A list of tools and resources for working with PDF files. Adapted from https://pdfliberation.wordpress.com/

Open source PDF technologies

  • Apache PDFBox - General purpose PDF library written in Java.

  • Tabula - Open source PDF table extraction tool written in Java and Ruby by Manuel Aristarán. Makes calls to PDFBox. Table extraction powered by http://github.com/jazzido/tabula-extractor.

  • PDF Extraction Toolkit - Java framework built on PDFBox by Tamir Hassan for performing document analysis of PDF files and creating custom conversion methods to HTML and other formats.

  • PDFExtract - Text extraction library that extends both PDFBox and Poppler. Written in Java by Øyvind Berg, the tool is no longer under active development but may contain code that can be reused by hackathon participants. Download Page: http://elacin.github.io/PDFExtract/.

  • PDF2SVG - Java tool developed by Peter Murray-Rust that converts PDFs to Scalable Vector Graphics (SVG) files that can be rendered by most modern browsers. PDF2SVG, which is based on PDFBox, is a component of the larger AMI suite of open source tools created for the purpose of liberating scientific documents. Another component, SVG2XML converts the SVG files to HTML and is currently under heavy development.

  • Poppler (pdftotext, pdfinfo, pdfimages) - Command line tools to extract text, metadata, and bitmap images from PDF files, written in C++, forked from Xpdf.

  • Ashima PDF Table Extractor - Table extraction tool built in Python and based on Poppler.

  • Coolwanglu - PDF to HMTL converter based on Poppler.

  • PDF2XML - Open source converter based on XPDF library developed by Hervé Déjean.

  • Xpdf (pdftotext, pdfinfo, pdfimages) - Command line tools to extract text, metadata, and bitmap images from PDF files. Also includes a page rasterizer (pdftoppm).

  • MuPDF - General purpose, open source PDF toolkit written in C by Artifex, the developers of GhostScript. The mudraw component has a basic text extraction utility.

  • PDFMiner - Open source PDF extraction library written in Python.

  • PDFTables - Table extraction tool based on PDFMiner and also written in Python.

  • Doc⚡split - A command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)

  • DocHive - Open source tool based on Tesseract and ImageMagick that extracts data from scanned PDFs.

  • Node PDF Extract - Javascript library that reads PDFs with embedded text as well as scanned PDFs. Built on both Poppler and Tesseract.

  • Ocrad - "GNU Ocrad is an OCR (Optical Character Recognition) program based on a feature extraction method. It reads images in pbm (bitmap), pgm (greyscale) or ppm (color) formats and produces text in byte (8-bit) or UTF-8 formats."

  • GOCR - "GOCR is an OCR (Optical Character Recognition) program, developed under the GNU Public License. It converts scanned images of text back to text files."

OCR Technologies:

  • Tesseract - Open source OCR library. This tool does not work directly with PDFs, but a shell script or package can be used to convert a PDF to a TIFF which can be analyzed with Tesseract.

  • ABBYY FineReader - Commercial OCR tool which works directly with PDFs. ABBYY also offers a cloud OCR API

  • Nuance OmniPage - Commercial OCR tool which works directly with PDFs.

  • Captricity - Web based service that uses a mixture of technology and human labor to convert uploaded documents into structured data.

Low-cost commercial PDF technologies:

Enterprise-Level ETL Solutions

Enterprise-Level (Cost > $1000) Extract Transfer Load (ETL) Solutions that Directly Read PDFs

Reviews, Listings and Comparisons: