Skip to content
Scripts to auto-OCR PDFs, translate output, and generate epubs
Python Jupyter Notebook TeX Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
templates Requires pandoc 2.8 Aug 24, 2019
LICENSE Add installer, uninstaller, and wrapper script, Aug 28, 2019 Some doc updates Sep 17, 2019 Fix guess_language for <20 character lines, move binaries to /usr/bin… Aug 28, 2019 Add installer, uninstaller, and wrapper script, Aug 28, 2019 Fix print statements for Python3 Sep 11, 2019 Update eoat-trans to allow OpenNMT REST server translations, now work… Sep 12, 2019 Fix print statements for Python3 Sep 11, 2019 Fix print statements for Python3 Sep 11, 2019 Add installer, uninstaller, and wrapper script, Aug 28, 2019 Fix print statements for Python3 Sep 11, 2019 Some doc updates Sep 17, 2019 Fix guess_language for <20 character lines, move binaries to /usr/bin… Aug 28, 2019
requirements.txt Reduce requirements to what we really need for Python3 Sep 12, 2019


An in-progress set of tools for creating epubs in multiple languages

Scripts to scan a PDF, auto-translate, process, and create epub and PDF output. Getting the dependencies in order can be tricky, currently working on a public AMI and possibly Docker instance.


  • Python, most scripts have been tested with 2.7 and 3.6; some of the onmt-helper scripts require Python 3.6. Non-built-in modules used include: guess_language pycountry. guess_language won't detect properly unless you also install pyenchant (Fedora or Ubuntu packages are fine) and guess-language-spirit. requirements.txt shows my EC2 instance's pip freeze output
  • ImageMagick
  • poppler-utils
  • tesseract
  • For translation, Google Translate API (pip install gcloud google-cloud-translate with GOOGLE_APPLICATION_CREDENTIALS in your env) and Python module or translate-shell.
  • texlive with xetex: Recommend installing the entire CTAN distribution (i.e., not using yum or apt-get but using the instructions from and do this before installing pandoc
  • pandoc 2.8
  • ebook-viewer (optional; to view output)
  • For LaTeX, applicable language packs (for example, you'll need to sudo apt install texlive-lang-cyrillic for russian, texlive-lang-spanish for Spanish, etc)
  • epubcheck & kindlegen (optional, if you're planning on generating Kindle deliverables)

Quick Start

  1. Check out this project:

    git checkout

  2. Install the tools:

    cd epub-ocr-and-translate && sudo sh

  3. Create a working directory:

    mkdir my_working_directory

  4. Copy the PDF you want to process into your working directory:

    cd my_working_directory && cp /path/to/my/input.pdf .

  5. Run eoat-tool ocr:

    eoat-tool ocr input.pdf eng

    (where eng is the three letter language code of your source doc).

  6. Translate the OCRed output:

    eoat-tool trans -i output-from-step-five.txt -s en -t fr

  7. Split the translated files into chapter files:

    eoat-tool split -i -d "CHAPTER"

  8. Run the make tool to add metadata and create Makefiles to be used for PDF and epub creation:

    eoat-tool make

  9. Edit variables.yaml with your intended metadata.

  10. Build your deliverables:

    eoat-tool build fr

    (where fr is the two letter language code of the translation you want to epub/PDF)

Standard workflow:

  1. OCR a PDF with

    Given a PDF file, OCR and clean it up a little. Requires ImageMagick, tesseract, pdfinfo/pdfseparate.


    sh filename.pdf eng

    where eng is the three letter language code of the source document. See list here. Source language document is very important! Note that this may take awhile, depending on the number of pages in your PDF.

  2. Translate a file with uses Google's Translate API, which costs $ or translate-shell, which is awesome, but you can and will get blocked by translation engines, so it's not great for large texts (but you can specify google, bing, yandex, etc). WARNING! translate-shell with many of the engines, even Google, can be unreliable because engines WILL block you after a certain number of characters. For important work, Google Cloud API is still unfortunately your best bet, though pricey, like $10/million characters. You can also now use OpenNMT Simple REST Server as an input, the script assumes it's running locally if you set the engine to opennmt.

    Usage for python -i source_text_file -s two-letter-source_lang -t two-letter-target_lang [-e trans|gcloud] [-w wait_seconds]

    -e is optional, uses translate-shell by default. There's a default two second wait between translation requests, you can change this with -w.

  3. Cut files into individual markdown files for each chapter using


    python -i filename.txt -d CHAPTER

    (where CHAPTER is the chapter-delimiter; accepts UTF-8, so you can use other languages where necessary. For example, if your source text is Russian, you could use "Глава").

  4. Edit markdown output as needed. This is probably the hardest part. Good luck! You may want to skip this and run the other steps to see how much more post-processing work you've got to do.

  5. Build a Makefile that will generate your epub and PDF files from your Markdown source with

    Creates a make file that will output epub and PDF, gathering all *.md files in the current directory. Requires xetex and pandoc. And ebook-viewer if you want to pop open the output. You should only have to run this once; if you run it again at some point, make sure you've deleted all autogenerated langmd files created using step 6.

    Usage: python

  6. Create PDF and epub files using

    Usage: sh two_letter_lang requires, which can also be used in standalone mode. It takes a master file that contains two languages, like:

     Это по русски
     This is in English
     Это по русски
     This is in English

    And exports the language you specify. Usage is python input_file two_letter_lang_code

Other scripts:

  • The file cleanup components from without the ocr part.

  • Utility to clean lines that match a provided regex from two language corpus training files. Useful if you're plugging OpenNMT into your system.

  • Utility to feed back a babel-friendly language package name when provided a two-letter language code (used by eoat-build).

  • Installs these scripts to /opt and symlinks to /usr/bin so that you can run from any working directory.

  • Given a filename and language code, searches file for language that matches the language code and exports into a new file. Used by eoat-build.

  • A basic translator in bash, just runs translate-shell. For translations, has more extensibility and features, but sometimes you just need to do a quick run.

  • Wrapper script for the basic core building tools.

  • Uninstalls eoat-tools from /opt/ and unlinks eoat utilities /usr/bin.

  • onmt-helpers directory: Assistive scripts that may help if you're using your own translation engine with OpenNMT-py.

You can’t perform that action at this time.