Highlights
Japanese and Chinese PDFs with vertical (tategaki / 直排) layout are now reconstructed into correct reading order — top-to-bottom, right-to-left columns recovered from glyph geometry with no new dependency. Dense multi-article 広報/magazine pages are split into articles, and decorative-font mojibake is filtered. Validated against a Japanese coherence corpus; Traditional Chinese works by the same geometry but isn't corpus-validated yet. Non-CJK PDFs skip this work via a cheap pre-gate — the writing-mode step is ~2.9× faster on a non-CJK document.
pdf_render_pages is more robust: dense pages auto-downsample to fit the transport budget, oversized pages fall back to a file reference instead of failing, and you can now render a high-DPI crop of a page region via clip. CJK keyword searches get an advisory steering you to semantic mode, where matching is more reliable.
Changes
Added
- Vertical-script (tategaki / 直排) reading-order reconstruction for Japanese and Chinese PDFs, with article segmentation on dense 広報/magazine pages and mojibake filtering.
- Non-CJK PDFs short-circuit the per-glyph layout parse — writing-mode detection ~2.9× faster on a non-CJK document (≈30% off full reading-order extraction on a 216-page Latin synthetic).
cjk_keyword_warningonpdf_searchfor CJK queries, steering callers tomode='semantic'; newpdf-mcp[cjk]install extra (warn-only, results unchanged).pdf_render_pagestransport byte budget: dense pages auto-downsample (render_downsampled), unfittable pages fall back to a full-res file reference (render_oversized_pages), and each image reports its render DPI in_meta.dpi.pdf_render_pagesregion rendering via optionalclip=[x0,y0,x1,y1](page fractions, 0..1) for high-DPI crops of dense pages.
Fixed
- Table detection no longer reports false-positive tables on dense prose pages; a detection is dropped only when its bounding box spans at least 80% of the page in both width and height (2 of 96 detected tables dropped, both confirmed false positives).
- Embedding vectors are now L2-normalized for all models, restoring the
dot == cosinecontract for semantic search; previously some models (e.g.intfloat/multilingual-e5-large) had inflated scores andlow_confidencestuck atFalse. The defaultbge-small-en-v1.5is unchanged.
Security
- Floor
pydantic-settings>=2.14.2to clear GHSA-4xgf-cpjx-pc3j.
Installation
pip install pdf-mcp==1.17.0