Merge branch 'feature/modernhocr'

ocrmypdf · Dec 3, 2023 · 5b2f2e6 · 5b2f2e6
2 parents 2affa83 + 445617a
commit 5b2f2e6
Show file tree

Hide file tree

Showing 122 changed files with 4,451 additions and 2,446 deletions.
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -22,20 +22,18 @@ jobs:
     strategy:
       matrix:
         include:
-          - os: ubuntu-22.04
-            python: "3.9"
           - os: ubuntu-22.04
             python: "3.10"
           - os: ubuntu-22.04
             python: "3.11"
           - os: ubuntu-22.04
-            python: "3.9"
+            python: "3.10"
             tesseract5: true
           - os: ubuntu-latest
             python: "3.12"
             tesseract5: true
-            #- os: ubuntu-latest
-            #  python: "pypy3.9"
+          # - os: ubuntu-latest
+          #   python: "pypy3.10"
 
     env:
       OS: ${{ matrix.os }}
@@ -219,7 +217,7 @@ jobs:
       - uses: actions/setup-python@v4
         name: Setup Python
         with:
-          python-version: "3.9"
+          python-version: "3.10"
           cache: "pip"
 
       - name: Make wheels and sdist

diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -19,7 +19,7 @@ formats:
 build:
   os: ubuntu-22.04
   tools:
-    python: "3.9"
+    python: "3.10"
 
 python:
   install:

diff --git a/.reuse/dep5 b/.reuse/dep5
@@ -124,6 +124,14 @@ Copyright:  Kai-Uwe Behrmann <www.behrmann.name>
             ColorSolutions <www.basICColor.com>
 License: Zlib
 
+Files: src/ocrmypdf/data/pdf.ttf
+Copyright: (C) 2014 Ray Smith
+ (C) 2015 Ken Sharp
+ (C) 2016 James R. Barlow
+ (C) 2016 Jeff Breidenbach
+ (C) 2017 Zdenko Podobný
+License: Apache-2.0
+
 Files: tests/resources/3small.pdf
 Copyright: (C) 2014 Euskaldunaa
  (C) 2017 James R. Barlow

diff --git a/docs/advanced.rst b/docs/advanced.rst
@@ -239,46 +239,46 @@ rendering
 OCRmyPDF has these PDF renderers: ``sandwich`` and ``hocr``. The
 renderer may be selected using ``--pdf-renderer``. The default is
 ``auto`` which lets OCRmyPDF select the renderer to use. Currently,
-``auto`` always selects ``sandwich``.
+``auto`` always selects ``hocr``.
 
-The ``sandwich`` renderer
--------------------------
+The ``hocr`` renderer
+---------------------
 
-The ``sandwich`` renderer uses Tesseract's new text-only PDF feature,
-which produces a PDF page that lays out the OCR in invisible text. This
-page is then "sandwiched" onto the original PDF page, allowing lossless
-application of OCR even to PDF pages that contain other vector objects.
+.. versionchanged:: 16.0.0
 
-Currently this is the best renderer for most uses, however it is
-implemented in Tesseract so OCRmyPDF cannot influence it. Currently some
-problematic PDF viewers like Mozilla PDF.js and macOS Preview have
-problems with segmenting its text output, and
-mightrunseveralwordstogether.
+In both renderers, a text-only layer is rendered and sandwiched (overlaid)
+on to either the original PDF page, or newly rasterized version of the
+original PDF page (when ``--force-ocr`` is used). In this way, loss
+of PDF information is generally avoided. (You may need to disable PDF/A
+conversion and optimization to eliminate all lossy transformations.)
 
-When image preprocessing features like ``--deskew`` are used, the
-original PDF will be rendered as a full page and the OCR layer will be
-placed on top.
+The current approach used by the new hOCR renderer is a re-implementation
+of Tesseract's PDF renderer, using the same Glyphless font and general
+ideas, but fixing many technical issues that impeded it. The new hocr
+provides better text placement accuracy, avoids issues with word
+segmentation, and provides better positioning of skewed text.
 
-The ``hocr`` renderer
----------------------
+Using the experimental API, it is also possible to edit the OCR output
+from Tesseract, using any tool that is capable of editing hOCR files.
+
+Older versions of this renderer did not support non-Latin languages, but
+it is now universal.
 
-The ``hocr`` renderer works with older versions of Tesseract. The image
-layer is copied from the original PDF page if possible, avoiding
-potentially lossy transcoding or loss of other PDF information. If
-preprocessing is specified, then the image layer is a new PDF. (You may
-need to disable PDF/A conversion nad optimization to eliminate all
-lossy transformations.)
+The ``sandwich`` renderer
+-------------------------
 
-Unlike ``sandwich`` this renderer is implemented within OCRmyPDF; anyone
-looking to customize how OCR is presented should look here. A major
-disadvantage of this renderer is it not capable of correctly handling
-text outside the Latin alphabet (specifically, it supports the ISO 8859-1
-character set). Pull requests to improve the situation are welcome.
+The ``sandwich`` renderer uses Tesseract's text-only PDF feature,
+which produces a PDF page that lays out the OCR in invisible text.
 
-Currently, this renderer has the best compatibility with Mozilla's
-PDF.js viewer.
+Currently some problematic PDF viewers like Mozilla PDF.js and macOS
+Preview have problems with segmenting its text output, and
+mightrunseveralwordstogether. It also does not implement right to left
+fonts (Arabic, Hebrew, Persian). The output of this renderer cannot
+be edited. The sandwich renderer is retained for testing.
 
-This works in all versions of Tesseract.
+When image preprocessing features like ``--deskew`` are used, the
+original PDF will be rendered as a full page and the OCR layer will be
+placed on top.
 
 Rendering and rasterizing options
 =================================

diff --git a/docs/installation.rst b/docs/installation.rst
@@ -454,7 +454,7 @@ Cygwin64
 
 First install the the following prerequisite Cygwin packages using ``setup-x86_64.exe``::
 
-    python39 (or later)
+    python310 (or later)
     python3?-devel
     python3?-pip
     python3?-lxml
@@ -552,7 +552,7 @@ manager. ``pip`` cannot provide them.
 
 The following versions are required:
 
--  Python 3.9 or newer
+-  Python 3.10 or newer
 -  Ghostscript 9.55 or newer
 -  Tesseract 4.1.1 or newer
 -  jbig2enc 0.29 or newer
@@ -588,7 +588,7 @@ unfortunately, the ``pip install`` command cannot satisfy all of them.
 Installing HEAD revision from sources
 =====================================
 
-If you have ``git`` and Python 3.9 or newer installed, you can install
+If you have ``git`` and Python 3.10 or newer installed, you can install
 from source. When the ``pip`` installer runs, it will alert you if
 dependencies are missing.
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -10,18 +10,16 @@ dynamic = ["version"]
 description = "OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched"
 readme = "README.md"
 license = { text = "MPL-2.0" }
-requires-python = ">=3.9"
+requires-python = ">=3.10"
 dependencies = [
   "Pillow>=10.0.1",
   "deprecation>=2.1.0",
   "img2pdf>=0.4.4",
   "packaging>=20",
   "pdfminer.six>=20220319",
-  "pikepdf>=8.7.1",
+  "pikepdf>=8.8.0",
   "pluggy>=0.13.0",
-  "reportlab>=3.6.8",
   "rich>=13",
-  "typing-extensions>=4;python_version<'3.10'",
 ]
 authors = [{ name = "James R. Barlow", email = "james@purplerock.ca" }]
 classifiers = [
@@ -58,6 +56,7 @@ test = [
   "pytest-cov>=3.0.0",
   "pytest-xdist>=2.5.0",
   "python-xmp-toolkit==2.0.1", # also requires apt-get install libexempi3
+  "reportlab>=3.6.8",
   "types-Pillow",
   "types-humanfriendly",
 ]

diff --git a/src/ocrmypdf/_concurrent.py b/src/ocrmypdf/_concurrent.py
@@ -8,7 +8,7 @@
 import threading
 from abc import ABC, abstractmethod
 from collections.abc import Iterable
-from typing import Callable, TypeVar
+from typing import Any, Callable, TypeVar
 
 from ocrmypdf._progressbar import NullProgressBar, ProgressBar
 
@@ -19,6 +19,10 @@ def _task_noop(*_args, **_kwargs):
     return
 
 
+def _task_finished_noop(_result: Any, pbar: ProgressBar):
+    pbar.update()
+
+
 class Executor(ABC):
     """Abstract concurrent executor."""
 
@@ -66,7 +70,7 @@ def __call__(
         if not worker_initializer:
             worker_initializer = _task_noop
         if not task_finished:
-            task_finished = _task_noop
+            task_finished = _task_finished_noop
         if not task:
             task = _task_noop
 

diff --git a/src/ocrmypdf/_exec/unpaper.py b/src/ocrmypdf/_exec/unpaper.py
@@ -14,6 +14,7 @@
 from decimal import Decimal
 from pathlib import Path
 from subprocess import PIPE, STDOUT
+from tempfile import TemporaryDirectory
 from typing import Union
 
 from packaging.version import Version
@@ -26,27 +27,6 @@
 # https://github.com/Flameeyes/unpaper/blob/main/doc/basic-concepts.md
 
 
-if sys.version_info >= (3, 10):
-    from tempfile import TemporaryDirectory
-else:
-    from tempfile import TemporaryDirectory as _TemporaryDirectory
-
-    class TemporaryDirectory(_TemporaryDirectory):
-        """Shim to consume ignore_cleanup_errors kwarg on Python 3.9 and older.
-
-        The argument is consumed without action. If users are getting errors related
-        to temporary file cleanup, they should upgrade to Python 3.10 which properly
-        cleans up temporary directories on Windows.
-
-        See: https://github.com/python/cpython/pull/24793
-        """
-
-        def __init__(self, ignore_cleanup_errors=False, **kwargs):
-            super().__init__(**kwargs)
-
-    del _TemporaryDirectory
-
-
 UNPAPER_IMAGE_PIXEL_LIMIT = 256 * 1024 * 1024
 
 DecFloat = Union[Decimal, float]

diff --git a/src/ocrmypdf/_graft.py b/src/ocrmypdf/_graft.py
@@ -313,7 +313,7 @@ def _graft_text_layer(
                 strip_invisible_text(self.pdf_base, base_page)
 
             base_page.contents_add(
-                new_text_layer, prepend=self.render_mode == RenderMode.UNDERNEATH
+                new_text_layer, prepend=self.render_mode == RenderMode.ON_TOP
             )
 
             _update_resources(obj=base_page.obj, font=font, font_key=font_key)
diff --git a/src/ocrmypdf/_pipeline.py b/src/ocrmypdf/_pipeline.py
@@ -743,13 +743,13 @@ def render_hocr_page(hocr: Path, page_context: PageContext) -> Path:
     dpi = get_page_square_dpi(page_context, calculate_image_dpi(page_context))
     debug_mode = options.pdf_renderer == 'hocrdebug'
 
-    hocrtransform = HocrTransform(hocr_filename=hocr, dpi=dpi.to_scalar())  # square
-    hocrtransform.to_pdf(
+    HocrTransform(
+        hocr_filename=hocr,
+        dpi=dpi.to_scalar(),  # square
+        debug=debug_mode,
+    ).to_pdf(
         out_filename=output_file,
         image_filename=None,
-        show_bounding_boxes=False if not debug_mode else True,
-        invisible_text=True if not debug_mode else False,
-        interword_spaces=True,
     )
     return output_file
 

diff --git a/src/ocrmypdf/_validation.py b/src/ocrmypdf/_validation.py
@@ -28,7 +28,6 @@
     OutputFileAccessError,
 )
 from ocrmypdf.helpers import is_file_writable, monotonic, safe_symlink
-from ocrmypdf.hocrtransform import HOCR_OK_LANGS
 from ocrmypdf.subprocess import check_external_program
 
 # -------------
@@ -83,15 +82,6 @@ def check_options_languages(
 
 
 def check_options_output(options: Namespace) -> None:
-    is_latin = set(options.languages).issubset(HOCR_OK_LANGS)
-
-    if options.pdf_renderer.startswith('hocr') and not is_latin:
-        log.warning(
-            "The 'hocr' PDF renderer is known to cause problems with one "
-            "or more of the languages in your document.  Use "
-            "`--pdf-renderer auto` (the default) to avoid this issue."
-        )
-
     if options.output_type == 'none' and options.output_file not in (os.devnull, '-'):
         raise BadArgsError(
             "Since you specified `--output-type none`, the output file "

diff --git a/src/ocrmypdf/builtin_plugins/concurrency.py b/src/ocrmypdf/builtin_plugins/concurrency.py
@@ -30,8 +30,6 @@
 UserInit = Callable[[], None]
 WorkerInit = Callable[[Queue, UserInit, int], None]
 
-RichTqdmProgressAdapter = RichProgressBar  # Deprecated shim; remove in OCRmyPDF 16
-
 
 def log_listener(q: Queue):
     """Listen to the worker processes and forward the messages to logging.

diff --git a/src/ocrmypdf/builtin_plugins/tesseract_ocr.py b/src/ocrmypdf/builtin_plugins/tesseract_ocr.py
@@ -146,7 +146,7 @@ def check_options(options):
 
     # Decide on what renderer to use
     if options.pdf_renderer == 'auto':
-        options.pdf_renderer = 'sandwich'
+        options.pdf_renderer = 'hocr'
 
     if not tesseract.has_thresholding() and options.tesseract_thresholding != 0:
         log.warning(
@@ -216,7 +216,7 @@ def version():
 
     @staticmethod
     def creator_tag(options):
-        tag = '-PDF' if options.pdf_renderer == 'sandwich' else ''
+        tag = '-PDF' if options.pdf_renderer == 'sandwich' else 'hOCR'
         return f"Tesseract OCR{tag} {TesseractOcrEngine.version()}"
 
     def __str__(self):

diff --git a/src/ocrmypdf/data/pdf.ttf b/src/ocrmypdf/data/pdf.ttf