Skip to content

v1.12.1

Choose a tag to compare

@jztan jztan released this 12 May 13:45
· 278 commits to develop since this release

What's New in v1.12.1

Fixed

  • pdf_search total_matches in keyword mode could disagree with len(matches) after the 1.12.0 tokenisation fix — multi-word queries like pgvector latency returned 4 matches with total_matches: 0 because the literal phrase didn't appear anywhere even though both tokens did. total_matches now equals len(matches) in every mode, and get_fts_page_counts counts token occurrences (not literal-phrase) so page_match_counts keeps its per-page intensity signal in keyword mode.
  • Heuristic section detector emitted body paragraphs that started with a heading-shaped prefix (e.g. "Section 2: This paragraph discusses ...") as the section title, because the regex fired on the prefix even when the rest of the line was prose. A stricter _looks_like_clean_heading shape check (≤120 chars, no mid-string . or ; ) now runs after the scored signals; candidates that fail it still produce a section boundary but with title: None.
  • pdf_search section-mode previously inferred title_source from cached PDF metadata at response time, which meant a section search called before pdf_info populated the metadata cache reported title_source: "heading_detected" for every match — even when derive_sections actually took the TOC path. title_source is now set at detection time on the Section dataclass and persisted on the FTS row, so the field is correct regardless of call order.

Added

  • pdf_search hybrid-mode matches now carry per-match low_confidence (true when there's no keyword hit on the page AND the underlying semantic cosine is below confidence_threshold — pages with literal-term hits stay confident regardless of cosine) plus semantic_score, mirroring the semantic-mode flag added in 1.12.0. Response-level all_results_low_confidence and confidence_threshold are present in both modes. Matches are NOT dropped when low-confidence — agents decide whether to surface "couldn't find it but here's the closest" vs "couldn't find it."
  • pdf_search section-mode matches now carry a title_source field: "toc", "heading_detected", or null. Sections with title_source: null also have title: null so agents can show the page range without rendering a synthesised label.
  • Property test test_total_matches_equals_len_matches_property asserts the invariant len(matches) == total_matches across all modes × queries (including multi-word tokenised queries), so a future regression fails CI.

Changed

  • Semantic-mode all_low_confidence renamed to all_results_low_confidence for parity with the new hybrid-mode field.
  • New title_source UNINDEXED column on pdf_section_fts. Pre-1.12.1 section indexes are dropped and recreated on first launch (FTS5 does not support ALTER ADD COLUMN); sections re-index lazily on the next section-mode call per PDF.

Docs

  • Browser demo (pages/index.html, served at pdf-mcp.jztan.com): search mock now tokenises queries (whitespace AND), counts token occurrences for page_match_counts, and sets total_matches = matches.length to mirror the server's 1.12.1 keyword path. Demo footer bumped to v0.4.

Installation

pip install pdf-mcp==1.12.1

Links