Skip to content

v0.5.1

Latest

Choose a tag to compare

@github-actions github-actions released this 23 Jun 18:45
· 22 commits to main since this release

Theme: RAG foundation + Python streaming (buffered) + FFI work-loop architecture + configurable embedding model + HTTP timeout fix + ReAct loop hardening.

Changed — ReAct loop retry/timeout hardening (7 fixes)

Problem: Engine ReAct loop had multiple retry/timeout bugs: retries didn't consume iteration budget (max_iter=1 could make 4+ LLM calls), Timeout middleware caused infinite retries, no wall-clock timeout, Retry middleware reset per iteration, no <think> tag handling.

Fixes (based on competitive analysis of LangChain, OpenAI Agents SDK, CrewAI, AutoGen):

  1. Wall-clock timeout: agent_config.max_execution_time : float option — loop checks elapsed time, returns Timeout error if exceeded
  2. Retries consume iterations: retry path now passes iterations + 1 (was unchanged) — industry consensus from all competitors
  3. Timeout middleware on_error removed: eliminates infinite-retry causal chain (Timeout mw → retryable=true → Retry mw → repeat)
  4. Retry budget per-invocation: removed per-iteration reset of retry counter — 3 retries is the total, not per-iteration
  5. Graceful degradation: agent_config.early_stopping_method (Force | Generate) — when iterations exhausted and Generate, makes one final LLM call for best-effort answer
  6. <think>/<reasoning> tag stripping: json_extract.ml now strips reasoning blocks before JSON parsing — prevents spurious repair loops with DeepSeek-R1, QwQ, MiniMax-M3
  7. Context-length error classification: engine detects context-length-exceeded errors from provider messages, applies context strategy, retries

New types: Types.early_stopping_method = Force | Generate
New agent_config fields: max_execution_time : float option, early_stopping_method : early_stopping_method

Changed — HTTP request timeout (fixes engine hang on long prompts)

Root cause: cohttp-eio Client.call and Buf_read.take_all had no timeout. When LLM response was slow (correlated with 800-1500 char prompts), the HTTP read blocked indefinitely. Combined with the single-threaded work loop, one stuck request wedged the entire Runtime.

Fix: Added Http_client.with_timeout — each do_request/do_request_streaming forks a daemon fiber that sleeps 60s then fails the switch. Timeout errors are mapped to Types.Timeout (not Invalid_input), enabling Retry middleware to retry automatically.

Known limitation: MCP HTTP/SSE transport (mcp_transport_http.ml) and fetch_url builtin tool do not yet have timeouts. A stuck MCP server or URL fetch can still wedge the Runtime. Deferred to v0.5.2.

Changed — Streaming architecture (buffered, no daemon thread)

Root cause fixed: Python _StreamReader previously ran par_invoke_stream on a daemon threading.Thread that had no OCaml domain lock, causing Fatal: no domain lock held on every streaming call. Fix: removed the daemon thread entirely. _StreamReader now calls par_invoke_stream on the main thread. The OCaml work loop buffers chunks internally and returns them all with the final result as JSON. Python parses the chunks array and yields Events.

Trade-off: chunks arrive all at once after the LLM completes (buffered, not incremental). True incremental streaming is planned for v0.5.2.

Changed — Configurable embedding model

Added embedding_model : string option to the Openai provider config variant. When set, overrides the default "text-embedding-3-small". Example:

["Openai", {"api_key": "...", "embedding_model": "Qwen/Qwen3-Embedding-8B"}]

The Ollama variant does not yet have this field — Ollama embeddings use the OpenAI default (tracked as known limitation).

Changed — Dead code cleanup

Removed import queue, import threading, _DONE sentinel from runtime.py (no longer needed after streaming refactor).

Changed — Error handling

_StreamReader._fetch now raises PARInvokeError on status != "ok" instead of silently returning an empty iterator.

Changed — Documentation

Updated docs/sdk/streaming.md implementation notes to describe the buffered architecture. Updated invoke_stream docstring in runtime.py.

Real API Verification (SiliconFlow)

All 5 endpoints verified against real API:

  • embed (Qwen3-Embedding-8B, 4096 dims): PASS
  • add_documents: PASS
  • invoke (Qwen2.5-7B-Instruct): PASS
  • invoke_with_rag: PASS
  • invoke_stream (4 chunks, no crash): PASS

Test Count

  • 998 OCaml tests
  • 57 Python tests (1 skipped)


Install

curl -fsSL https://raw.githubusercontent.com/jcz2020/par/main/install.sh | bash

Or upgrade: par update

macOS: binary is unsigned. Run xattr -cr "" once after install.

Full changelog: https://github.com/jcz2020/par/blob/main/CHANGES.md