Tesseract base API

Tesseract base API extension is based on tesserocr to implement OCR (Optical Character Recognition).

From the Projects's tab, click on Import from git and copy and paste the URL of the current page (i.e. https://github.com/loko-ai/tesseract-base-api):

Once the project is downloaded, click and open it.

In order to start the project remember to press the play button on the right of the project's name.

You'll find the PyTessBaseAPI extension on the bottom of blocks' list. Choose a file in the File Reader component and click on Run.

In the Console you'll visualize the extracted text.

Let's now see how to custom the extension (See more here Custom extensions).

Click right on the project's name on Open in editor (configure your editor using the Loko's settings first):

Otherwise, you can open your project directly on the Loko's directory (i.e. ~/loko/projects/tesseract-base-api).

First of all, install the required libraries in the Dockerfile:

sudo apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config

Then, create you venv, named venv, using python3.7 and install requirements.lock.

Services

In /tesseract-base-api/services/services.py you'll find the PyTessBaseAPI component:

pyTess = Component('PyTessBaseAPI',
                   inputs=[Input(id='input', label='extract', service='extract', to='output')],
                   outputs=[Output(id='output')],
                   description='A simple custom component to allow an alternative of Tesseract usage (based on PyTessBaseAPI)')


save_extensions([pyTess])

We are defining all the block's information: inputs, outputs, args, description. When you run the script, the component will be saved as a json into /tesseract-base-api/extensions/components.json and showed in your block's list. See more here https://loko-extensions.readthedocs.io/en/latest/.

The input of the component is linked to the service /extract:

@bp.post("/extract")
@doc.consumes(doc.File(name="file"), location="formData", content_type="multipart/form-data", required=True)
@extract_value_args(file=True)
async def test(file, args):
    content = file[0].body

    ret = OCR(content)

    if isinstance(ret, dict):
        return json(ret)
    return raw(ret)

Parameter file represents the input of the block, while args represents the configuration of the block (we don't use any configuration in this case).

OCR

In /tesseract-base-api/business/ocr.py you'll find the implementation of the OCR:

from io import BytesIO

from tesserocr import PyTessBaseAPI
import pdf2image as pdf2image
from PIL import Image
import magic

mime = magic.Magic(mime=True)

from business.text import JOINER_FACTORY


class Tesseract:

    def __init__(self, join_mode="text", join_str=None):
        self.joiner = JOINER_FACTORY(join_mode)(join_str=join_str)

    def __call__(self, file, lang="ita"):

        images = self.get_images(file)
        texts = [self.get_text(img) for img in images]

        return self.joiner(texts)

    def get_text(self, image, lang="ita"):
        with PyTessBaseAPI(lang=lang) as api:
            api.SetImage(image)
            text = api.GetUTF8Text()
        return text

    def get_images(self, file):

        if isinstance(file, str):
            file = open(file, "rb").read()

        mt = mime.from_buffer(file)
        file = BytesIO(file)

        if "image" in mt:
            return [Image.open(file)]
        if "pdf" in mt:
            return self._page_split(file)

        raise Exception("not supported extension {}".format(mt))

    def _page_split(self, file):
        '''@file: Bytes or path'''
        if isinstance(file, str):
            return pdf2image.convert_from_path(file)
        return pdf2image.convert_from_bytes(file.read())

OCR = Tesseract(join_mode="json")

Dockerfile

Once you prepared your components and the services they are linked to, you have to configure the Dockerfile of your container:

FROM python:3.7-slim
RUN apt-get update --fix-missing && apt-get install -y gcc tesseract-ocr wget libmagic-dev ffmpeg libsm6 libxext6 g++ libleptonica-dev libtesseract-dev && rm -rf /var/cache/apt
RUN rm /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata
RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata --directory-prefix=/usr/share/tesseract-ocr/4.00/tessdata
RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/ita.traineddata --directory-prefix=/usr/share/tesseract-ocr/4.00/tessdata
RUN wget https://github.com/tesseract-ocr/tessdata/raw/main/spa.traineddata --directory-prefix=/usr/share/tesseract-ocr/4.00/tessdata
ADD ./requirements.lock /
RUN pip install -r /requirements.lock
ARG GATEWAY
ENV GATEWAY=$GATEWAY
ADD . /plugin
ENV PYTHONPATH=$PYTHONPATH:/plugin
ENV LC_ALL=C
WORKDIR /plugin/services
EXPOSE 8080
CMD python -m sanic services.app --host=0.0.0.0 --port=8080

When you stop your project and click again on the play button, Loko builds a new image, and you're ready to use your extension.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
business		business
dao		dao
extensions		extensions
model		model
services		services
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
loko.project		loko.project
requirements.lock		requirements.lock
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tesseract base API

Services

OCR

Dockerfile

About

Releases

Packages

Languages

loko-ai/tesseract-base-api

Folders and files

Latest commit

History

Repository files navigation

Tesseract base API

Services

OCR

Dockerfile

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages