An in-progress set of tools for creating epubs in multiple languages
Scripts to scan a PDF, auto-translate, process, and create epub and PDF output. Getting the dependencies in order can be tricky, currently working on a public AMI and possibly Docker instance.
- Python, most scripts have been tested with 2.7 and 3.6; some of the onmt-helper scripts require Python 3.6. Non-built-in modules used include: google.cloud guess_language pycountry. guess_language won't detect properly unless you also install pyenchant (Fedora or Ubuntu packages are fine) and guess-language-spirit. requirements.txt shows my EC2 instance's pip freeze output
- For translation, Google Translate API (
pip install gcloud google-cloud-translatewith GOOGLE_APPLICATION_CREDENTIALS in your env) and Python module or translate-shell.
- texlive with xetex: Recommend installing the entire CTAN distribution (i.e., not using yum or apt-get but using the instructions from https://www.tug.org/texlive/quickinstall.html) and do this before installing pandoc
- pandoc 2.8
- ebook-viewer (optional; to view output)
- For LaTeX, applicable language packs (for example, you'll need to
sudo apt install texlive-lang-cyrillicfor russian,
texlive-lang-spanishfor Spanish, etc)
- epubcheck & kindlegen (optional, if you're planning on generating Kindle deliverables)
Check out this project:
git checkout https://github.com/jenh/epub-ocr-and-translate
Install the tools:
cd epub-ocr-and-translate && sudo sh install.sh
Create a working directory:
Copy the PDF you want to process into your working directory:
cd my_working_directory && cp /path/to/my/input.pdf .
Run eoat-tool ocr:
eoat-tool ocr input.pdf eng
(where eng is the three letter language code of your source doc).
Translate the OCRed output:
eoat-tool trans -i output-from-step-five.txt -s en -t fr
Split the translated files into chapter files:
eoat-tool split -i output-from-step-six.2lang.md -d "CHAPTER"
Run the make tool to add metadata and create Makefiles to be used for PDF and epub creation:
Edit variables.yaml with your intended metadata.
Build your deliverables:
eoat-tool build fr
(where fr is the two letter language code of the translation you want to epub/PDF)
OCR a PDF with
Given a PDF file, OCR and clean it up a little. Requires ImageMagick, tesseract, pdfinfo/pdfseparate.
sh eoat-ocr.sh filename.pdf eng
engis the three letter language code of the source document. See list here. Source language document is very important! Note that this may take awhile, depending on the number of pages in your PDF.
Translate a file with
eoat-trans.pyuses Google's Translate API, which costs $ or
translate-shell, which is awesome, but you can and will get blocked by translation engines, so it's not great for large texts (but you can specify google, bing, yandex, etc). WARNING! translate-shell with many of the engines, even Google, can be unreliable because engines WILL block you after a certain number of characters. For important work, Google Cloud API is still unfortunately your best bet, though pricey, like $10/million characters. You can also now use OpenNMT Simple REST Server as an input, the script assumes it's running locally if you set the engine to opennmt.
Usage for eoat-trans.py:
python eoat-trans.py -i source_text_file -s two-letter-source_lang -t two-letter-target_lang [-e trans|gcloud] [-w wait_seconds]
-e is optional, uses translate-shell by default. There's a default two second wait between translation requests, you can change this with -w.
Cut files into individual markdown files for each chapter using
python eoat-split.py -i filename.txt -d CHAPTER
(where CHAPTER is the chapter-delimiter; accepts UTF-8, so you can use other languages where necessary. For example, if your source text is Russian, you could use "Глава").
Edit markdown output as needed. This is probably the hardest part. Good luck! You may want to skip this and run the other steps to see how much more post-processing work you've got to do.
Build a Makefile that will generate your epub and PDF files from your Markdown source with
Creates a make file that will output epub and PDF, gathering all *.md files in the current directory. Requires xetex and pandoc. And ebook-viewer if you want to pop open the output. You should only have to run this once; if you run it again at some point, make sure you've deleted all autogenerated langmd files created using step 6.
Create PDF and epub files using
sh eoat-build.sh two_letter_lang
eoat-printlang.py, which can also be used in standalone mode. It takes a master file that contains two languages, like:
Это по русски This is in English Это по русски This is in English
And exports the language you specify. Usage is
python eoat-printlang.py input_file two_letter_lang_code
eoat-cleanup.sh: The file cleanup components from eoat-ocr.sh without the ocr part.
eoat-corpusclean.py: Utility to clean lines that match a provided regex from two language corpus training files. Useful if you're plugging OpenNMT into your system.
eoat-expandlang.py: Utility to feed back a babel-friendly language package name when provided a two-letter language code (used by eoat-build).
eoat-install.sh: Installs these scripts to /opt and symlinks to /usr/bin so that you can run from any working directory.
eoat-printlang.py: Given a filename and language code, searches file for language that matches the language code and exports into a new file. Used by eoat-build.
eoat-process.sh: A basic translator in bash, just runs translate-shell. For translations, eoat-trans.py has more extensibility and features, but sometimes you just need to do a quick run.
eoat-tool.py: Wrapper script for the basic core building tools.
eoat-uninstall.sh: Uninstalls eoat-tools from /opt/ and unlinks eoat utilities /usr/bin.
onmt-helpers directory: Assistive scripts that may help if you're using your own translation engine with OpenNMT-py.