Release v0.1.0 - first hackathon-submission cut · golikovichev/phoenix2pytest

First tagged release. Built for the Google Cloud Rapid Agent Hackathon, Arize track. Devpost submission target: June 2026.

What ships

Pydantic v2 schema for the trace scenario pipeline: TraceScenario, ExtractorResponse, plus FailureMode and AssertionStrategy literals. Closed vocabularies, blank-string normalisation, list-item filtering, extra="forbid" on both models. Source: src/phoenix2pytest/schema.py.
51-trace demo dataset covering six failure modes (hallucination, format break, off-topic drift, stale real-time data, wrong reasoning, refusal bug). Mix: 15 real (elicited from live Gemini calls), 35 synthetic (hand-curated), 1 real-harvested from Reddit via Bright Data with source attribution preserved. Source: tests/data/demo_dataset.json.
Phoenix ingestion script that emits the demo dataset as OpenInference spans, either via live Gemini calls (for real entries) or stored failure outputs (for synthetic and real-harvested). Source: scripts/ingest_demo_dataset.py.
FastAPI web UI for browsing failures and generating regression tests. Quickstart documented in the README.
Bright Data harvest CLI for collecting real-world LLM failure narratives at scale. Two discovery modes: keyword and subreddit URL. Source: scripts/bd_harvest.py.
README: problem framing, architecture diagram, comparison vs DeepEval / Opik / pytest-evals / Langfuse, web UI quickstart, Cloud Run deploy notes.

Status

Alpha. The web UI is the supported entry point. The vertical-slice end-to-end pipeline (Phoenix span -> Gemini extractor -> synthesiser -> generated pytest file) is scaffolded as a script but not yet wired into the package. End-to-end wiring ships in 0.2.0.

Test surface

135 unit tests pass across Python 3.11 / 3.12 / 3.13. Integration tests (Phoenix MCP server) pass under CI. Lint (ruff format + ruff check) clean.

License

MIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.0 - first hackathon-submission cut

Choose a tag to compare

Sorry, something went wrong.