pydocs-mcp 0.3.0
Published to PyPI: https://pypi.org/project/pydocs-mcp/0.3.0/
pip install pydocs-mcpAdded (LLM tree-reasoning — enrichment, token budget, two-stage rerank)
- PageIndex node enrichment — each LLM-visible tree node now carries its real
signature (params + type hints + return annotation), its decorators, and a
docstring excerpt, beyond the generated summary. Tunable viadoc_excerpt
(sections|full|off) anddoc_excerpt_max_chars. A non-destructive
schema auto-refresh (v9) re-extracts the metadata on next index without
re-embedding unchanged chunks. - Token-counted tree budget — the serialized tree handed to the LLM is bounded
in realtiktokentokens (previously whitespace words, which under-counted code
~3× and could overflow the model's context window with a 400
context_length_exceeded).max_tree_words→max_tree_tokens
(int | None;Noneauto-derives from the configured model's context window).
Over-budget pruning is content-first — drop per-node doc excerpts before whole
nodes. Addstiktokenas a runtime dependency. - BM25 → tree two-stage rerank — opt-in
rerank_candidatesmode on the
llm_tree_reasoningstep scopes the LLM-visible tree to a prior BM25/dense
candidate set and writes its ranked picks back as the pipeline's final ranking
(with arepoqa_bm25_tree_rerankbenchmark config). - Persist
chunks.qualified_name(schema v7) so tree-reasoning picks resolve to
the correct chunks.
Added (on-device dense embeddings)
sentence_transformersembedding provider (provider: sentence_transformers)
servingQwen/Qwen3-Embedding-0.6Band other SentenceTransformer models via
torch — a GPU-reliable on-device dense embedder (torch frees CUDA memory
between sequential index-builds). Opt-in via the[sentence-transformers]
extra. NewEmbeddingConfigknobsmax_seq_length/normalize/
query_prompt_name(the first two fold into the pipeline hash; the
query-only prompt does not).
Removed
- The
onnxembedding provider (OnnxEmbedderand theonnx_file/
query_instructionconfig fields). The torch-backedsentence_transformers
provider replaces it for on-device Qwen3-Embedding — onnxruntime leaked the
CUDA arena across the benchmark's sequential index-builds.
Added (GPU inference)
--gpuflag onserve,index, andwatch(and the benchmark runner)
to run all embedder inference — FastEmbed, thesentence_transformers
provider, and PyLate late-interaction — on CUDA. No YAML change; covers both
index-time and query-time embedding. The execution device is excluded from the
pipeline / index-cache hash, so toggling--gpushares the same.tq/
fast-plaid index and never forces a re-index (it is a latency knob, not a
quality change).EmbeddingConfig.device(cpu/cuda) wiring throughbuild_embedder
into the FastEmbed and sentence_transformers embedders;
AppConfig.with_device(gpu=...)stamps the device after config load. GPU
runtimes (onnxruntime-gpu,fastembed-gpu, CUDA torch) are documented in
INSTALL.md, not auto-installed.