Tesseract base API extension is based on tesserocr to implement OCR (Optical Character Recognition).
From the Projects's tab, click on Import from git and copy and paste the URL of the current page (i.e. https://github.com/loko-ai/tesseract-base-api):
Once the project is downloaded, click and open it.In order to start the project remember to press the play button on the right of the project's name.
You'll find the PyTessBaseAPI extension on the bottom of blocks' list. Choose a file in the File Reader component and click on Run.
In the Console you'll visualize the extracted text.
Let's now see how to custom the extension (See more here Custom extensions).
Click right on the project's name on Open in editor (configure your editor using the Loko's settings first):
Otherwise, you can open your project directly on the Loko's directory (i.e. ~/loko/projects/tesseract-base-api
).
First of all, install the required libraries in the Dockerfile:
sudo apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config
Then, create you venv, named venv, using python3.7 and install requirements.lock.
In /tesseract-base-api/services/services.py
you'll find the PyTessBaseAPI component:
pyTess = Component('PyTessBaseAPI',
inputs=[Input(id='input', label='extract', service='extract', to='output')],
outputs=[Output(id='output')],
description='A simple custom component to allow an alternative of Tesseract usage (based on PyTessBaseAPI)')
save_extensions([pyTess])
We are defining all the block's information: inputs, outputs, args, description. When you run the script, the component will be
saved as a json into /tesseract-base-api/extensions/components.json
and showed in your block's list.
See more here https://loko-extensions.readthedocs.io/en/latest/.
The input of the component is linked to the service /extract
:
@bp.post("/extract")
@doc.consumes(doc.File(name="file"), location="formData", content_type="multipart/form-data", required=True)
@extract_value_args(file=True)
async def test(file, args):
content = file[0].body
ret = OCR(content)
if isinstance(ret, dict):
return json(ret)
return raw(ret)
Parameter file represents the input of the block, while args represents the configuration of the block (we don't use any configuration in this case).
In /tesseract-base-api/business/ocr.py
you'll find the implementation of the OCR:
from io import BytesIO
from tesserocr import PyTessBaseAPI
import pdf2image as pdf2image
from PIL import Image
import magic
mime = magic.Magic(mime=True)
from business.text import JOINER_FACTORY
class Tesseract:
def __init__(self, join_mode="text", join_str=None):
self.joiner = JOINER_FACTORY(join_mode)(join_str=join_str)
def __call__(self, file, lang="ita"):
images = self.get_images(file)
texts = [self.get_text(img) for img in images]
return self.joiner(texts)
def get_text(self, image, lang="ita"):
with PyTessBaseAPI(lang=lang) as api:
api.SetImage(image)
text = api.GetUTF8Text()
return text
def get_images(self, file):
if isinstance(file, str):
file = open(file, "rb").read()
mt = mime.from_buffer(file)
file = BytesIO(file)
if "image" in mt:
return [Image.open(file)]
if "pdf" in mt:
return self._page_split(file)
raise Exception("not supported extension {}".format(mt))
def _page_split(self, file):
'''@file: Bytes or path'''
if isinstance(file, str):
return pdf2image.convert_from_path(file)
return pdf2image.convert_from_bytes(file.read())
OCR = Tesseract(join_mode="json")
Once you prepared your components and the services they are linked to, you have to configure the Dockerfile of your container:
FROM python:3.7-slim
RUN apt-get update --fix-missing && apt-get install -y gcc tesseract-ocr wget libmagic-dev ffmpeg libsm6 libxext6 g++ libleptonica-dev libtesseract-dev && rm -rf /var/cache/apt
RUN rm /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata
RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata --directory-prefix=/usr/share/tesseract-ocr/4.00/tessdata
RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/ita.traineddata --directory-prefix=/usr/share/tesseract-ocr/4.00/tessdata
RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/spa.traineddata --directory-prefix=/usr/share/tesseract-ocr/4.00/tessdata
ADD ./requirements.lock /
RUN pip install -r /requirements.lock
ARG GATEWAY
ENV GATEWAY=$GATEWAY
ADD . /plugin
ENV PYTHONPATH=$PYTHONPATH:/plugin
ENV LC_ALL=C
WORKDIR /plugin/services
EXPOSE 8080
CMD python -m sanic services.app --host=0.0.0.0 --port=8080
When you stop your project and click again on the play button, Loko builds a new image, and you're ready to use your extension.