Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add languages for tesseract-ocr in the image? #33

Closed
xiongyw opened this issue Dec 15, 2015 · 2 comments
Closed

How to add languages for tesseract-ocr in the image? #33

xiongyw opened this issue Dec 15, 2015 · 2 comments

Comments

@xiongyw
Copy link

xiongyw commented Dec 15, 2015

Sorry I am new to docker. I just pull the latest, and want to use language chi_sim in tesseract, but it seems this language support is not installed by default, as it complains:

~/work/tmp$ docker run -v "$(pwd):/home/docker" ocrmypdf 31.pdf 31-ocr.pdf -l chi_sim
The installed version of tesseract does not have language data for the following requested languages:
chi_sim

It seems the tesseract used by the docker image is different from the system's tesseract-ocr package, with which I installed the language package by "apt-get install tesseract-ocr-chi-sim".

How to update the docker image for including the desired language support? And how to check which languages are supported (like "tesseract --list-langs" in the system)?

Thanks a lot.

@jbarlow83
Copy link
Collaborator

I'll add more languages next time I update ocrmypdf.

The Dockerfile specifies how the container was built. It provides its own copy of tesseract and will not use the one on your machine, or anything else about your machine. It's like a lightweight virtual machine.

You can jump inside an ocrmypdf container, modify it, and save the changes as your own private image. (A container is an instance of image.)

In your case it would go something like this (not tested, made up on the spot):

$ docker run -t -i ocrmypdf /bin/bash
root@container:/# apt-get install tesseract-ocr-chi-sim
root@container:/# exit
$ docker commit -m "Added Chinese simplified" -a "Your Name"

See here:
https://docs.docker.com/engine/userguide/dockerimages/

@jbarlow83
Copy link
Collaborator

I decided to produce a second version of the container which provides all Tesseract's languages.

You can use this command to download it. Then Chinese (Simplified and Traditional) will be available.

docker pull jbarlow83/ocrmypdf-polyglot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants