A converter. More than just a converter. Something you won't believe.
Ruby Python HTML Other
Switch branches/tags
Nothing to show
Permalink
Failed to load latest commit information.
bin Remove dependency on external libreoffice and run locally Feb 3, 2015
config Ussing the new infrad deploy procedure Nov 16, 2016
encoders Staging and production both have old versions of libreoffice, so adde… Feb 3, 2015
forward Fixed typo in mail_to Dec 5, 2014
lib Handle nil dubject May 5, 2015
processors actually do ocr when the uploaded image is not tiff… also convert upl… Feb 21, 2013
public Prettied up the response emails Nov 25, 2014
spec Staging and production both have old versions of libreoffice, so adde… Feb 3, 2015
views update api to allow upload of 'already' converted file as the 'conver… May 13, 2013
.gitignore Re-add Gemfile.lock Dec 27, 2013
Capfile capify Feb 21, 2013
Gemfile Ussing the new infrad deploy procedure Nov 16, 2016
Gemfile.lock Ussing the new infrad deploy procedure Nov 16, 2016
LICENSE Add MIT License Feb 21, 2013
README.md Better documentation Dec 4, 2014
Rakefile AutoHeathen: efactored slightly, and added unit tests Nov 24, 2014
VERSION Bumped the version Dec 5, 2014
app.rb Merged autoheathen code into develop branch Dec 3, 2014
config.rb Better wkhtmltopdf configuration Dec 4, 2014
config.ru configure app to work with relative url root Feb 21, 2013
executioner.rb Make the Java executioner handle large stdout streams Dec 28, 2013
heathen.rb Display also the executed command in error messages Dec 27, 2013
inquisitor.rb Cache URLs for 60 seconds at most Dec 28, 2013
utils.rb correctly handle missing parameters Feb 21, 2013

README.md

    )                          )          )
 ( /(        (       *   )  ( /(       ( /(
 )\()) (     )\    ` )  /(  )\()) (    )\())
((_)\  )\ ((((_)(   ( )(_))((_)\  )\  ((_)\
 _((_)((_) )\ _ )\ (_(_())  _((_)((_)  _((_)
| || || __|(_)_\(_)|_   _| | || || __|| \| |
| __ || _|  / _ \    | |   | __ || _| | .` |
|_||_||___|/_/ \_\   |_|   |_||_||___||_|\_|

Convert the heathens.

Heathen is a service for converting pretty much anything to PDF, where "anything" means:

  • Word documents
  • HTML documents
  • Images
  • HTML/Images found at a URL

Additionally, given an image as input, Heathen can parse text found in the image and generate a PDF that is searchable and has selectable text.

There is a client gem here: https://github.com/ifad/heathen-client.

Heathen is built primarily with Sinatra and Dragonfly.

Why

Why you might want to use Heathen:

  • Word is evil and must be purified.
  • You are writing a web application (or many web applications) that need to be able to produce PDFs. With Heathen, you just write your document as a normal HTML document and send it off to Heathen to get a PDF.
  • You are an organization that has many scanned documents stored as images that are therefore not searchable or indexable. Send them off to Heathen and get searchable, selectable PDFs.

Installation & Prerequisites

Heathen requires the following libraries/binaries to be installed, and the versions given are those with which Heathen has been tested:

  • LibreOffice + PyUNO (3.6)
  • Python (2.7)
  • ImageMagick (6.8.2-4)
  • wkhtmltopdf (0.12.1)
  • tesseract (3.02.02)
  • redis (2.4.15)

To install, clone this repository:

git clone git@github.com:ifad/heathen.git

Then bundle:

bundle

If you're using OpenSuSE, we maintain all these binary dependences as RPM packages on the SuSE build server, so they are just a zypper ar and a zypper install away.

Running

You need to specify where Heathen will store files by setting the HEATHEN_STORAGE_ROOT environment variable. Heathen creates three subdirectories at this path: cache, file, and tmp.

  • With rackup:

    HEATHEN_STORAGE_ROOT="/path/to/storage" rackup -E <environment>

  • With pow:

    Add the following to your .powenv file:

    export HEATHEN_STORAGE_ROOT="/path/to/storage"
    export RACK_ENV="<environment>"
    

Additionally, if you plan on running Heathen at a subdirectory, for example, http://my.webapps.example.com/heathen, you will need to set RACK_RELATIVE_URL_ROOT="/heathen"

NOTE: For Word/Excel conversion, heathen needs a LibreOffice/OpenOffice daemon running, listening on localhost on TCP port 8100. This is the command we use to start it (make sure libreoffice-pyuno is installed):

soffice --headless --accept='socket,host=127.0.0.1,port=8100;urp;'

NOTE: If you are running Heathen behind Unicorn or something similar, you will need to increase the allowed response time to be much greater than 30 seconds. There are plans to make Heathen behave asynchronously in the future which should avoid the need for this kind of configuration.

NOTE: If you want to specify the path to the wkhtmltopdf binary you can set the WKHTMLTOPDF environment variable

export WKHTMLTOPDF="/path/to/wkhtmltopdf"

Use It

While Heathen is meant to be used as an API (see https://github.com/ifad/heathen-client), it also has a minimal front-end.

Navigate to the app, select a word document to upload, and select pdf from the Action select box. Submit the form and if everything goes well, you will see some JSON data containing two urls, keyed by original and converted. Navigate to the converted url and wait for the browser to ask where you want to save your pdf.

Now upload an image containing some text, like a scanned document. Follow the same steps, and note that your pdf has selectable text and is searchable.

Re-visit the converted url and note that the response is immediate.

How It Works

Heathen makes extensive use of the absolutely wonderful Dragonfly, and packages everything up as a micro-application with Sinatra.

You give Heathen some content (either a file or URL) and an action by POSTing to /convert. Heathen does a minimal amount of checking to make sure the request is sane and that the supplied content will (probably) be processable, puts the file in permanent storage, and gives you back two links, but does no actual processing.

The original url simply points to the unaltered content that was originally uploaded.

The converted url is actually serialized information that Dragonfly uses to kick-off the processing steps. Visiting this link will convert the original content to pdf, which is why visiting the link the first time can take a long time. However, since Heathen also uses Rack::Cache, the processed content is placed in the cache and never needs to be processed again.

If you want content to be re-processed the next time it is requested, you can run rake heathen:cache:clear.

Additionally, when content is sent to Heathen, a hash is computed which acts as a key to the processed content. That is, if you upload some content that Heathen has already seen (even if the filename is different), you will get back the exact same response as the first time, meaning that if the content is already cached, GETting the converted url will respond immediately.

To remove these mappings, you can run rake heathen:redis:clear.

For convenience, to clear both redis and the cache, plus any residual temp files that may not have been cleaned up by Heathen, you can run rake heathen:clear.

HTML to PDF

In order to convert html to pdf heathen uses wkhtmltopdf. If you want to use any of the options explained in the documentation you can set them using meta tags in the html where the name of the tag sould start with heathen.

<meta name='heathen-foo' content='bars'>

AutoHeathen

AutoHeathen is a utility to allow people to email the documents they want converted to Heathen, which then converts them and either returns the converted documents to sender, or forwards them on to a configured email address. The general idea is that bin/autoheathen will be called by a mail forwarding script, which pipes the email into $stdin.

As an example for configuration, here is how we did it with postfix:

  1. Postfix relay

/etc/postfix/aliases:

heathen.rts: :include:/app/heathen/current/config/heathen.rts

(remember to run sudo postalias /etc/postfix/aliases afterwards)

  1. Forwarding script

/app/heathen/current/config/heathen.rts:

|"(cd /app/heathen/current; /home/heathen/.rbenv/bin/rbenv exec bundle exec bin/autoheathen -r -C config/autoheathen.yml>> log/autoheathen.log 2>&1)"

  1. Config

Edit autoheathen.yml to suit the location of your heathen app (possibly via a reverse proxy like nginx if the unicorns are listening on a domain socket rather than TCP/IP).

CVHeathen

This is a utility to convert a directory full of files. The script walks the entire directory tree and converts the files it finds, writing the result to the equivalent position in a new directory. The script also handles email files (.msg and .eml suffixes). For these, attachments are converted and saved back into the email (which is saved in mbox, or .eml form).

bin/cvheathen [-v] -i {in-dir} -o {out-dir}

-v = verbose

In order to convert .msg files to a readable form, the utility uses a Perl library (Email::Outlook::Message). To install the appropriate Debian packages, run:

sudo apt-get install libemail-outlook-message-perl libemail-sender-perl

License

MIT

Copyright

© IFAD 2014