You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
--output-type auto (the default) again produces PDF/A whenever it can,
matching OCRmyPDF 16's "PDF/A by default" behavior. It first tries the fast
Ghostscript-free conversion (validated by veraPDF when available) and now
falls back to Ghostscript when that cannot produce PDF/A, only emitting a
regular PDF when even Ghostscript cannot safely convert (for example, an
input with non-embedded CID/CJK fonts, per {issue}1561). A consequence is
that the default path may once again invoke Ghostscript, which is slower and
may transcode images; use --output-type pdf to skip PDF/A conversion
entirely.
Fixed detection of veraPDF 1.30.0 and newer: recent builds print JVM
warnings before their version string, which caused OCRmyPDF to report
veraPDF as unavailable and skip the fast PDF/A path.
OCRmyPDF no longer silently corrupts a non-embedded CID (CJK) text layer when
producing PDF/A ({issue}1561). PDF/A requires all fonts to be embedded, so
Ghostscript substitutes and re-embeds non-embedded CID fonts — such as the OCR
text layer Adobe Acrobat adds to scanned CJK documents — which mangles the
text and destroys searchability. OCRmyPDF now detects non-embedded CID fonts
before conversion: with --output-type auto (the default) it produces a
regular PDF and preserves the existing text layer, and with an explicit --output-type pdfa* it stops with an error rather than emit corrupted
output. Use --output-type pdf to keep the text layer, or --force-ocr to
rebuild it with embedded fonts.
Writing the output PDF to standard output (ocrmypdf input.pdf -) is now
protected against corruption at the operating system level. Previously
OCRmyPDF relied on no in-process code — third-party libraries, plugins, or
stray print() calls — ever writing to stdout; a single accidental write
would silently corrupt the PDF. The command line program now saves the real
stdout at startup, before plugins are loaded or any worker process/thread is
started, and redirects file descriptor 1 to stderr, so that only OCRmyPDF's
final PDF output can reach stdout. A consequence is that a plugin which
intentionally prints to stdout will have that output redirected to stderr.
Added the public API function {func}ocrmypdf.configure_stdout_protection,
which installs this same protection. Like {func}ocrmypdf.configure_logging,
it is optional and intended for callers that want command-line-like behavior;
applications that manage their own standard output should not call it.
Fixed an uncaught UnicodeDecodeError when processing a PDF whose /DocumentInfo dictionary contains a /Name key encoded in Latin-1 (or
another non-UTF-8 encoding), such as /Saks#e5r. repair_docinfo_nuls now
treats such a block as malformed, logs a message, and continues instead of
crashing the pipeline ({issue}1540). Current pikepdf releases tolerate these
keys by surrogate-escaping them, but older versions raised while iterating the
dictionary.