Skip to content

Commit

Permalink
Merge branch 'feature/modernhocr'
Browse files Browse the repository at this point in the history
  • Loading branch information
jbarlow83 committed Dec 3, 2023
2 parents 2affa83 + 445617a commit 5b2f2e6
Show file tree
Hide file tree
Showing 122 changed files with 4,451 additions and 2,446 deletions.
10 changes: 4 additions & 6 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,20 +22,18 @@ jobs:
strategy:
matrix:
include:
- os: ubuntu-22.04
python: "3.9"
- os: ubuntu-22.04
python: "3.10"
- os: ubuntu-22.04
python: "3.11"
- os: ubuntu-22.04
python: "3.9"
python: "3.10"
tesseract5: true
- os: ubuntu-latest
python: "3.12"
tesseract5: true
#- os: ubuntu-latest
# python: "pypy3.9"
# - os: ubuntu-latest
# python: "pypy3.10"

env:
OS: ${{ matrix.os }}
Expand Down Expand Up @@ -219,7 +217,7 @@ jobs:
- uses: actions/setup-python@v4
name: Setup Python
with:
python-version: "3.9"
python-version: "3.10"
cache: "pip"

- name: Make wheels and sdist
Expand Down
2 changes: 1 addition & 1 deletion .readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ formats:
build:
os: ubuntu-22.04
tools:
python: "3.9"
python: "3.10"

python:
install:
Expand Down
8 changes: 8 additions & 0 deletions .reuse/dep5
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,14 @@ Copyright: Kai-Uwe Behrmann <www.behrmann.name>
ColorSolutions <www.basICColor.com>
License: Zlib

Files: src/ocrmypdf/data/pdf.ttf
Copyright: (C) 2014 Ray Smith
(C) 2015 Ken Sharp
(C) 2016 James R. Barlow
(C) 2016 Jeff Breidenbach
(C) 2017 Zdenko Podobný
License: Apache-2.0

Files: tests/resources/3small.pdf
Copyright: (C) 2014 Euskaldunaa
(C) 2017 James R. Barlow
Expand Down
62 changes: 31 additions & 31 deletions docs/advanced.rst
Original file line number Diff line number Diff line change
Expand Up @@ -239,46 +239,46 @@ rendering
OCRmyPDF has these PDF renderers: ``sandwich`` and ``hocr``. The
renderer may be selected using ``--pdf-renderer``. The default is
``auto`` which lets OCRmyPDF select the renderer to use. Currently,
``auto`` always selects ``sandwich``.
``auto`` always selects ``hocr``.

The ``sandwich`` renderer
-------------------------
The ``hocr`` renderer
---------------------

The ``sandwich`` renderer uses Tesseract's new text-only PDF feature,
which produces a PDF page that lays out the OCR in invisible text. This
page is then "sandwiched" onto the original PDF page, allowing lossless
application of OCR even to PDF pages that contain other vector objects.
.. versionchanged:: 16.0.0

Currently this is the best renderer for most uses, however it is
implemented in Tesseract so OCRmyPDF cannot influence it. Currently some
problematic PDF viewers like Mozilla PDF.js and macOS Preview have
problems with segmenting its text output, and
mightrunseveralwordstogether.
In both renderers, a text-only layer is rendered and sandwiched (overlaid)
on to either the original PDF page, or newly rasterized version of the
original PDF page (when ``--force-ocr`` is used). In this way, loss
of PDF information is generally avoided. (You may need to disable PDF/A
conversion and optimization to eliminate all lossy transformations.)

When image preprocessing features like ``--deskew`` are used, the
original PDF will be rendered as a full page and the OCR layer will be
placed on top.
The current approach used by the new hOCR renderer is a re-implementation
of Tesseract's PDF renderer, using the same Glyphless font and general
ideas, but fixing many technical issues that impeded it. The new hocr
provides better text placement accuracy, avoids issues with word
segmentation, and provides better positioning of skewed text.

The ``hocr`` renderer
---------------------
Using the experimental API, it is also possible to edit the OCR output
from Tesseract, using any tool that is capable of editing hOCR files.

Older versions of this renderer did not support non-Latin languages, but
it is now universal.

The ``hocr`` renderer works with older versions of Tesseract. The image
layer is copied from the original PDF page if possible, avoiding
potentially lossy transcoding or loss of other PDF information. If
preprocessing is specified, then the image layer is a new PDF. (You may
need to disable PDF/A conversion nad optimization to eliminate all
lossy transformations.)
The ``sandwich`` renderer
-------------------------

Unlike ``sandwich`` this renderer is implemented within OCRmyPDF; anyone
looking to customize how OCR is presented should look here. A major
disadvantage of this renderer is it not capable of correctly handling
text outside the Latin alphabet (specifically, it supports the ISO 8859-1
character set). Pull requests to improve the situation are welcome.
The ``sandwich`` renderer uses Tesseract's text-only PDF feature,
which produces a PDF page that lays out the OCR in invisible text.

Currently, this renderer has the best compatibility with Mozilla's
PDF.js viewer.
Currently some problematic PDF viewers like Mozilla PDF.js and macOS
Preview have problems with segmenting its text output, and
mightrunseveralwordstogether. It also does not implement right to left
fonts (Arabic, Hebrew, Persian). The output of this renderer cannot
be edited. The sandwich renderer is retained for testing.

This works in all versions of Tesseract.
When image preprocessing features like ``--deskew`` are used, the
original PDF will be rendered as a full page and the OCR layer will be
placed on top.

Rendering and rasterizing options
=================================
Expand Down
6 changes: 3 additions & 3 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -454,7 +454,7 @@ Cygwin64

First install the the following prerequisite Cygwin packages using ``setup-x86_64.exe``::

python39 (or later)
python310 (or later)
python3?-devel
python3?-pip
python3?-lxml
Expand Down Expand Up @@ -552,7 +552,7 @@ manager. ``pip`` cannot provide them.

The following versions are required:

- Python 3.9 or newer
- Python 3.10 or newer
- Ghostscript 9.55 or newer
- Tesseract 4.1.1 or newer
- jbig2enc 0.29 or newer
Expand Down Expand Up @@ -588,7 +588,7 @@ unfortunately, the ``pip install`` command cannot satisfy all of them.
Installing HEAD revision from sources
=====================================

If you have ``git`` and Python 3.9 or newer installed, you can install
If you have ``git`` and Python 3.10 or newer installed, you can install
from source. When the ``pip`` installer runs, it will alert you if
dependencies are missing.

Expand Down
7 changes: 3 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,16 @@ dynamic = ["version"]
description = "OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched"
readme = "README.md"
license = { text = "MPL-2.0" }
requires-python = ">=3.9"
requires-python = ">=3.10"
dependencies = [
"Pillow>=10.0.1",
"deprecation>=2.1.0",
"img2pdf>=0.4.4",
"packaging>=20",
"pdfminer.six>=20220319",
"pikepdf>=8.7.1",
"pikepdf>=8.8.0",
"pluggy>=0.13.0",
"reportlab>=3.6.8",
"rich>=13",
"typing-extensions>=4;python_version<'3.10'",
]
authors = [{ name = "James R. Barlow", email = "james@purplerock.ca" }]
classifiers = [
Expand Down Expand Up @@ -58,6 +56,7 @@ test = [
"pytest-cov>=3.0.0",
"pytest-xdist>=2.5.0",
"python-xmp-toolkit==2.0.1", # also requires apt-get install libexempi3
"reportlab>=3.6.8",
"types-Pillow",
"types-humanfriendly",
]
Expand Down
8 changes: 6 additions & 2 deletions src/ocrmypdf/_concurrent.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import threading
from abc import ABC, abstractmethod
from collections.abc import Iterable
from typing import Callable, TypeVar
from typing import Any, Callable, TypeVar

from ocrmypdf._progressbar import NullProgressBar, ProgressBar

Expand All @@ -19,6 +19,10 @@ def _task_noop(*_args, **_kwargs):
return


def _task_finished_noop(_result: Any, pbar: ProgressBar):
pbar.update()


class Executor(ABC):
"""Abstract concurrent executor."""

Expand Down Expand Up @@ -66,7 +70,7 @@ def __call__(
if not worker_initializer:
worker_initializer = _task_noop
if not task_finished:
task_finished = _task_noop
task_finished = _task_finished_noop
if not task:
task = _task_noop

Expand Down
22 changes: 1 addition & 21 deletions src/ocrmypdf/_exec/unpaper.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from decimal import Decimal
from pathlib import Path
from subprocess import PIPE, STDOUT
from tempfile import TemporaryDirectory
from typing import Union

from packaging.version import Version
Expand All @@ -26,27 +27,6 @@
# https://github.com/Flameeyes/unpaper/blob/main/doc/basic-concepts.md


if sys.version_info >= (3, 10):
from tempfile import TemporaryDirectory
else:
from tempfile import TemporaryDirectory as _TemporaryDirectory

class TemporaryDirectory(_TemporaryDirectory):
"""Shim to consume ignore_cleanup_errors kwarg on Python 3.9 and older.
The argument is consumed without action. If users are getting errors related
to temporary file cleanup, they should upgrade to Python 3.10 which properly
cleans up temporary directories on Windows.
See: https://github.com/python/cpython/pull/24793
"""

def __init__(self, ignore_cleanup_errors=False, **kwargs):
super().__init__(**kwargs)

del _TemporaryDirectory


UNPAPER_IMAGE_PIXEL_LIMIT = 256 * 1024 * 1024

DecFloat = Union[Decimal, float]
Expand Down
2 changes: 1 addition & 1 deletion src/ocrmypdf/_graft.py
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ def _graft_text_layer(
strip_invisible_text(self.pdf_base, base_page)

base_page.contents_add(
new_text_layer, prepend=self.render_mode == RenderMode.UNDERNEATH
new_text_layer, prepend=self.render_mode == RenderMode.ON_TOP
)

_update_resources(obj=base_page.obj, font=font, font_key=font_key)
10 changes: 5 additions & 5 deletions src/ocrmypdf/_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -743,13 +743,13 @@ def render_hocr_page(hocr: Path, page_context: PageContext) -> Path:
dpi = get_page_square_dpi(page_context, calculate_image_dpi(page_context))
debug_mode = options.pdf_renderer == 'hocrdebug'

hocrtransform = HocrTransform(hocr_filename=hocr, dpi=dpi.to_scalar()) # square
hocrtransform.to_pdf(
HocrTransform(
hocr_filename=hocr,
dpi=dpi.to_scalar(), # square
debug=debug_mode,
).to_pdf(
out_filename=output_file,
image_filename=None,
show_bounding_boxes=False if not debug_mode else True,
invisible_text=True if not debug_mode else False,
interword_spaces=True,
)
return output_file

Expand Down
10 changes: 0 additions & 10 deletions src/ocrmypdf/_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@
OutputFileAccessError,
)
from ocrmypdf.helpers import is_file_writable, monotonic, safe_symlink
from ocrmypdf.hocrtransform import HOCR_OK_LANGS
from ocrmypdf.subprocess import check_external_program

# -------------
Expand Down Expand Up @@ -83,15 +82,6 @@ def check_options_languages(


def check_options_output(options: Namespace) -> None:
is_latin = set(options.languages).issubset(HOCR_OK_LANGS)

if options.pdf_renderer.startswith('hocr') and not is_latin:
log.warning(
"The 'hocr' PDF renderer is known to cause problems with one "
"or more of the languages in your document. Use "
"`--pdf-renderer auto` (the default) to avoid this issue."
)

if options.output_type == 'none' and options.output_file not in (os.devnull, '-'):
raise BadArgsError(
"Since you specified `--output-type none`, the output file "
Expand Down
2 changes: 0 additions & 2 deletions src/ocrmypdf/builtin_plugins/concurrency.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,6 @@
UserInit = Callable[[], None]
WorkerInit = Callable[[Queue, UserInit, int], None]

RichTqdmProgressAdapter = RichProgressBar # Deprecated shim; remove in OCRmyPDF 16


def log_listener(q: Queue):
"""Listen to the worker processes and forward the messages to logging.
Expand Down
4 changes: 2 additions & 2 deletions src/ocrmypdf/builtin_plugins/tesseract_ocr.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ def check_options(options):

# Decide on what renderer to use
if options.pdf_renderer == 'auto':
options.pdf_renderer = 'sandwich'
options.pdf_renderer = 'hocr'

if not tesseract.has_thresholding() and options.tesseract_thresholding != 0:
log.warning(
Expand Down Expand Up @@ -216,7 +216,7 @@ def version():

@staticmethod
def creator_tag(options):
tag = '-PDF' if options.pdf_renderer == 'sandwich' else ''
tag = '-PDF' if options.pdf_renderer == 'sandwich' else 'hOCR'
return f"Tesseract OCR{tag} {TesseractOcrEngine.version()}"

def __str__(self):
Expand Down
Binary file added src/ocrmypdf/data/pdf.ttf
Binary file not shown.

0 comments on commit 5b2f2e6

Please sign in to comment.