v0.1.0 - first hackathon-submission cut
First tagged release. Built for the Google Cloud Rapid Agent Hackathon, Arize track. Devpost submission target: June 2026.
What ships
- Pydantic v2 schema for the trace scenario pipeline:
TraceScenario,ExtractorResponse, plusFailureModeandAssertionStrategyliterals. Closed vocabularies, blank-string normalisation, list-item filtering,extra="forbid"on both models. Source:src/phoenix2pytest/schema.py. - 51-trace demo dataset covering six failure modes (hallucination, format break, off-topic drift, stale real-time data, wrong reasoning, refusal bug). Mix: 15 real (elicited from live Gemini calls), 35 synthetic (hand-curated), 1 real-harvested from Reddit via Bright Data with source attribution preserved. Source:
tests/data/demo_dataset.json. - Phoenix ingestion script that emits the demo dataset as OpenInference spans, either via live Gemini calls (for
realentries) or stored failure outputs (forsyntheticandreal-harvested). Source:scripts/ingest_demo_dataset.py. - FastAPI web UI for browsing failures and generating regression tests. Quickstart documented in the README.
- Bright Data harvest CLI for collecting real-world LLM failure narratives at scale. Two discovery modes: keyword and subreddit URL. Source:
scripts/bd_harvest.py. - README: problem framing, architecture diagram, comparison vs DeepEval / Opik / pytest-evals / Langfuse, web UI quickstart, Cloud Run deploy notes.
Status
Alpha. The web UI is the supported entry point. The vertical-slice end-to-end pipeline (Phoenix span -> Gemini extractor -> synthesiser -> generated pytest file) is scaffolded as a script but not yet wired into the package. End-to-end wiring ships in 0.2.0.
Test surface
135 unit tests pass across Python 3.11 / 3.12 / 3.13. Integration tests (Phoenix MCP server) pass under CI. Lint (ruff format + ruff check) clean.
License
MIT.