Release pydocs-mcp 0.3.0 · msobroza/pydocs-mcp

pip install pydocs-mcp

PageIndex node enrichment — each LLM-visible tree node now carries its real
signature (params + type hints + return annotation), its decorators, and a
docstring excerpt, beyond the generated summary. Tunable via doc_excerpt
(sections | full | off) and doc_excerpt_max_chars. A non-destructive
schema auto-refresh (v9) re-extracts the metadata on next index without
re-embedding unchanged chunks.
Token-counted tree budget — the serialized tree handed to the LLM is bounded
in real tiktoken tokens (previously whitespace words, which under-counted code
~3× and could overflow the model's context window with a 400
context_length_exceeded). max_tree_words → max_tree_tokens
(int | None; None auto-derives from the configured model's context window).
Over-budget pruning is content-first — drop per-node doc excerpts before whole
nodes. Adds tiktoken as a runtime dependency.
BM25 → tree two-stage rerank — opt-in rerank_candidates mode on the
llm_tree_reasoning step scopes the LLM-visible tree to a prior BM25/dense
candidate set and writes its ranked picks back as the pipeline's final ranking
(with a repoqa_bm25_tree_rerank benchmark config).
Persist chunks.qualified_name (schema v7) so tree-reasoning picks resolve to
the correct chunks.

sentence_transformers embedding provider (provider: sentence_transformers)
serving Qwen/Qwen3-Embedding-0.6B and other SentenceTransformer models via
torch — a GPU-reliable on-device dense embedder (torch frees CUDA memory
between sequential index-builds). Opt-in via the [sentence-transformers]
extra. New EmbeddingConfig knobs max_seq_length / normalize /
query_prompt_name (the first two fold into the pipeline hash; the
query-only prompt does not).

The onnx embedding provider (OnnxEmbedder and the onnx_file /
query_instruction config fields). The torch-backed sentence_transformers
provider replaces it for on-device Qwen3-Embedding — onnxruntime leaked the
CUDA arena across the benchmark's sequential index-builds.

--gpu flag on serve, index, and watch (and the benchmark runner)
to run all embedder inference — FastEmbed, the sentence_transformers
provider, and PyLate late-interaction — on CUDA. No YAML change; covers both
index-time and query-time embedding. The execution device is excluded from the
pipeline / index-cache hash, so toggling --gpu shares the same .tq /
fast-plaid index and never forces a re-index (it is a latency knob, not a
quality change).
EmbeddingConfig.device (cpu / cuda) wiring through build_embedder
into the FastEmbed and sentence_transformers embedders;
AppConfig.with_device(gpu=...) stamps the device after config load. GPU
runtimes (onnxruntime-gpu, fastembed-gpu, CUDA torch) are documented in
INSTALL.md, not auto-installed.

Provide feedback