Skip to content

v1.17.0 — Vertical-script CJK reading order and smarter rendering

Latest

Choose a tag to compare

@jztan jztan released this 19 Jun 23:56
· 25 commits to develop since this release

Highlights

Japanese and Chinese PDFs with vertical (tategaki / 直排) layout are now reconstructed into correct reading order — top-to-bottom, right-to-left columns recovered from glyph geometry with no new dependency. Dense multi-article 広報/magazine pages are split into articles, and decorative-font mojibake is filtered. Validated against a Japanese coherence corpus; Traditional Chinese works by the same geometry but isn't corpus-validated yet. Non-CJK PDFs skip this work via a cheap pre-gate — the writing-mode step is ~2.9× faster on a non-CJK document.

pdf_render_pages is more robust: dense pages auto-downsample to fit the transport budget, oversized pages fall back to a file reference instead of failing, and you can now render a high-DPI crop of a page region via clip. CJK keyword searches get an advisory steering you to semantic mode, where matching is more reliable.

Changes

Added

  • Vertical-script (tategaki / 直排) reading-order reconstruction for Japanese and Chinese PDFs, with article segmentation on dense 広報/magazine pages and mojibake filtering.
  • Non-CJK PDFs short-circuit the per-glyph layout parse — writing-mode detection ~2.9× faster on a non-CJK document (≈30% off full reading-order extraction on a 216-page Latin synthetic).
  • cjk_keyword_warning on pdf_search for CJK queries, steering callers to mode='semantic'; new pdf-mcp[cjk] install extra (warn-only, results unchanged).
  • pdf_render_pages transport byte budget: dense pages auto-downsample (render_downsampled), unfittable pages fall back to a full-res file reference (render_oversized_pages), and each image reports its render DPI in _meta.dpi.
  • pdf_render_pages region rendering via optional clip=[x0,y0,x1,y1] (page fractions, 0..1) for high-DPI crops of dense pages.

Fixed

  • Table detection no longer reports false-positive tables on dense prose pages; a detection is dropped only when its bounding box spans at least 80% of the page in both width and height (2 of 96 detected tables dropped, both confirmed false positives).
  • Embedding vectors are now L2-normalized for all models, restoring the dot == cosine contract for semantic search; previously some models (e.g. intfloat/multilingual-e5-large) had inflated scores and low_confidence stuck at False. The default bge-small-en-v1.5 is unchanged.

Security

Installation

pip install pdf-mcp==1.17.0

Links