Skip to content

Releases: golikovichev/phoenix2pytest

v1.0.0

22 Jun 18:52

Choose a tag to compare

First stable release.

Highlights since 0.1.0:

  • Batch trace-to-test generation (/batch): synthesise tests for many annotated traces in one request, grouped by failure mode and folded into one parametrised test per mode.
  • /health probe for the web service.
  • Generated code is validated as parseable Python before it is returned; the web endpoints return HTTP 502 on a non-Python model reply instead of writing a broken test file.
  • Demo page redesign (card layout, Single/Batch nav, prefilled runnable example).
  • Actions pinned to commit SHAs, OpenSSF Scorecard analysis, least-privilege workflow permissions.

The pipeline (Phoenix trace to Gemini extractor to synthesiser to pytest file) is stable within the documented scope; see the Limits section in the README.

Full changelog in CHANGELOG.md.

v0.1.0 - first hackathon-submission cut

22 May 21:44

Choose a tag to compare

First tagged release. Built for the Google Cloud Rapid Agent Hackathon, Arize track. Devpost submission target: June 2026.

What ships

  • Pydantic v2 schema for the trace scenario pipeline: TraceScenario, ExtractorResponse, plus FailureMode and AssertionStrategy literals. Closed vocabularies, blank-string normalisation, list-item filtering, extra="forbid" on both models. Source: src/phoenix2pytest/schema.py.
  • 51-trace demo dataset covering six failure modes (hallucination, format break, off-topic drift, stale real-time data, wrong reasoning, refusal bug). Mix: 15 real (elicited from live Gemini calls), 35 synthetic (hand-curated), 1 real-harvested from Reddit via Bright Data with source attribution preserved. Source: tests/data/demo_dataset.json.
  • Phoenix ingestion script that emits the demo dataset as OpenInference spans, either via live Gemini calls (for real entries) or stored failure outputs (for synthetic and real-harvested). Source: scripts/ingest_demo_dataset.py.
  • FastAPI web UI for browsing failures and generating regression tests. Quickstart documented in the README.
  • Bright Data harvest CLI for collecting real-world LLM failure narratives at scale. Two discovery modes: keyword and subreddit URL. Source: scripts/bd_harvest.py.
  • README: problem framing, architecture diagram, comparison vs DeepEval / Opik / pytest-evals / Langfuse, web UI quickstart, Cloud Run deploy notes.

Status

Alpha. The web UI is the supported entry point. The vertical-slice end-to-end pipeline (Phoenix span -> Gemini extractor -> synthesiser -> generated pytest file) is scaffolded as a script but not yet wired into the package. End-to-end wiring ships in 0.2.0.

Test surface

135 unit tests pass across Python 3.11 / 3.12 / 3.13. Integration tests (Phoenix MCP server) pass under CI. Lint (ruff format + ruff check) clean.

License

MIT.