Theme: RAG foundation + Python streaming (buffered) + FFI work-loop architecture + configurable embedding model + HTTP timeout fix + ReAct loop hardening.
Changed — ReAct loop retry/timeout hardening (7 fixes)
Problem: Engine ReAct loop had multiple retry/timeout bugs: retries didn't consume iteration budget (max_iter=1 could make 4+ LLM calls), Timeout middleware caused infinite retries, no wall-clock timeout, Retry middleware reset per iteration, no <think> tag handling.
Fixes (based on competitive analysis of LangChain, OpenAI Agents SDK, CrewAI, AutoGen):
- Wall-clock timeout:
agent_config.max_execution_time : float option— loop checks elapsed time, returnsTimeouterror if exceeded - Retries consume iterations: retry path now passes
iterations + 1(was unchanged) — industry consensus from all competitors - Timeout middleware on_error removed: eliminates infinite-retry causal chain (Timeout mw → retryable=true → Retry mw → repeat)
- Retry budget per-invocation: removed per-iteration reset of retry counter — 3 retries is the total, not per-iteration
- Graceful degradation:
agent_config.early_stopping_method(Force|Generate) — when iterations exhausted andGenerate, makes one final LLM call for best-effort answer <think>/<reasoning>tag stripping:json_extract.mlnow strips reasoning blocks before JSON parsing — prevents spurious repair loops with DeepSeek-R1, QwQ, MiniMax-M3- Context-length error classification: engine detects context-length-exceeded errors from provider messages, applies context strategy, retries
New types: Types.early_stopping_method = Force | Generate
New agent_config fields: max_execution_time : float option, early_stopping_method : early_stopping_method
Changed — HTTP request timeout (fixes engine hang on long prompts)
Root cause: cohttp-eio Client.call and Buf_read.take_all had no timeout. When LLM response was slow (correlated with 800-1500 char prompts), the HTTP read blocked indefinitely. Combined with the single-threaded work loop, one stuck request wedged the entire Runtime.
Fix: Added Http_client.with_timeout — each do_request/do_request_streaming forks a daemon fiber that sleeps 60s then fails the switch. Timeout errors are mapped to Types.Timeout (not Invalid_input), enabling Retry middleware to retry automatically.
Known limitation: MCP HTTP/SSE transport (mcp_transport_http.ml) and fetch_url builtin tool do not yet have timeouts. A stuck MCP server or URL fetch can still wedge the Runtime. Deferred to v0.5.2.
Changed — Streaming architecture (buffered, no daemon thread)
Root cause fixed: Python _StreamReader previously ran par_invoke_stream on a daemon threading.Thread that had no OCaml domain lock, causing Fatal: no domain lock held on every streaming call. Fix: removed the daemon thread entirely. _StreamReader now calls par_invoke_stream on the main thread. The OCaml work loop buffers chunks internally and returns them all with the final result as JSON. Python parses the chunks array and yields Events.
Trade-off: chunks arrive all at once after the LLM completes (buffered, not incremental). True incremental streaming is planned for v0.5.2.
Changed — Configurable embedding model
Added embedding_model : string option to the Openai provider config variant. When set, overrides the default "text-embedding-3-small". Example:
["Openai", {"api_key": "...", "embedding_model": "Qwen/Qwen3-Embedding-8B"}]The Ollama variant does not yet have this field — Ollama embeddings use the OpenAI default (tracked as known limitation).
Changed — Dead code cleanup
Removed import queue, import threading, _DONE sentinel from runtime.py (no longer needed after streaming refactor).
Changed — Error handling
_StreamReader._fetch now raises PARInvokeError on status != "ok" instead of silently returning an empty iterator.
Changed — Documentation
Updated docs/sdk/streaming.md implementation notes to describe the buffered architecture. Updated invoke_stream docstring in runtime.py.
Real API Verification (SiliconFlow)
All 5 endpoints verified against real API:
- embed (Qwen3-Embedding-8B, 4096 dims): PASS
- add_documents: PASS
- invoke (Qwen2.5-7B-Instruct): PASS
- invoke_with_rag: PASS
- invoke_stream (4 chunks, no crash): PASS
Test Count
- 998 OCaml tests
- 57 Python tests (1 skipped)
Install
curl -fsSL https://raw.githubusercontent.com/jcz2020/par/main/install.sh | bashOr upgrade: par update
macOS: binary is unsigned. Run
xattr -cr ""once after install.
Full changelog: https://github.com/jcz2020/par/blob/main/CHANGES.md