Extract Annotations from PDF files
Java
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
.gitignore
AGPL.txt
README.md
marginalia2svg.xsl
pom.xml

README.md

Name

Marginalia - extract annotations from PDF files

Description

This is just an experiment with PDF file format, iText, and maven

Background

We are using the computer as a paper simulator, which is like tearing the wings off a 747 and driving it as a bus on the highway. -- Ted Nelson

PDF files are crap, but we somehow have to live with them for a while. This application contains some experiment in at least making use of annotations in PDF files. Annotations should be publications of their own, but they are hidden in PDF or other proprietary software.

There is free software, open source PDF reader that supports annotations. There is no eBook-reader or similar device that is a convenient to use to make annotations, as a physical piece of paper and a pen. So far.

To get into the PDF file format, which in fact is more like a database or a file system, you need PDF parser libraries, like iText.

PDF Reference

Adobe PDF Reference Archives.

PDF annotations are defined in chapter 8.4 of the PDF reference (2006), page 604-647. see here.

Usage of the developer snapshot

First download, clone or fork the project, e.g.

$ git clone git://github.com/nichtich/marginalia.git $ cd marginalia

You need maven2 to compile and run this application. Try

$ mvn compile

If you are lucky, maven will install all required dependencies and compile. You can then run marginalia in the development environment:

$ mvn exec:java -Dexec.args="yourfile.pdf"

To create a single jar that includes all dependencies, call:

 $ mvn assembly:single

You can then copy the jar file to a place of your choice and run it via:

$ java -jar target/marginalia-0.0.1dev-jar-with-dependencies.jar yourfile.pdf

Extracting text

To extract annotated text, you can use the pdftotext command line tool from poppler (maybe I better move from iText to poppler). For instance if you have an annotation on page 1 with:

rect="52.559917,437.8619,286.3729,528.27844"

and page size is 595 x 842 pts (A4). Then the crop area can be calculated with x = 52, y = 842 - 528.27 = 313, W = 233, H = 91.

$ pdftotext -layout -nopgbrk -f 1 -l 1 -x 52 -y 313 -W 233 -H 91 your.pdf && cat your.txt

Sure this should be automized, and it does not cover details.

Converting annotations to SVG

The XML output contains some custom Marginalia elements. With these and the script marginalia2svg.xsl you can convert extracted annotations to SVG.

$ xsltproc marginalia2svg.xsl youroutput.xml

Author

Jakob Voss jakob.voss@gbv.de