Releases · golikovichev/phoenix2pytest

First stable release.

Highlights since 0.1.0:

Batch trace-to-test generation (/batch): synthesise tests for many annotated traces in one request, grouped by failure mode and folded into one parametrised test per mode.
/health probe for the web service.
Generated code is validated as parseable Python before it is returned; the web endpoints return HTTP 502 on a non-Python model reply instead of writing a broken test file.
Demo page redesign (card layout, Single/Batch nav, prefilled runnable example).
Actions pinned to commit SHAs, OpenSSF Scorecard analysis, least-privilege workflow permissions.

The pipeline (Phoenix trace to Gemini extractor to synthesiser to pytest file) is stable within the documented scope; see the Limits section in the README.

Full changelog in CHANGELOG.md.

First tagged release. Built for the Google Cloud Rapid Agent Hackathon, Arize track. Devpost submission target: June 2026.

What ships

Pydantic v2 schema for the trace scenario pipeline: TraceScenario, ExtractorResponse, plus FailureMode and AssertionStrategy literals. Closed vocabularies, blank-string normalisation, list-item filtering, extra="forbid" on both models. Source: src/phoenix2pytest/schema.py.
51-trace demo dataset covering six failure modes (hallucination, format break, off-topic drift, stale real-time data, wrong reasoning, refusal bug). Mix: 15 real (elicited from live Gemini calls), 35 synthetic (hand-curated), 1 real-harvested from Reddit via Bright Data with source attribution preserved. Source: tests/data/demo_dataset.json.
Phoenix ingestion script that emits the demo dataset as OpenInference spans, either via live Gemini calls (for real entries) or stored failure outputs (for synthetic and real-harvested). Source: scripts/ingest_demo_dataset.py.
FastAPI web UI for browsing failures and generating regression tests. Quickstart documented in the README.
Bright Data harvest CLI for collecting real-world LLM failure narratives at scale. Two discovery modes: keyword and subreddit URL. Source: scripts/bd_harvest.py.
README: problem framing, architecture diagram, comparison vs DeepEval / Opik / pytest-evals / Langfuse, web UI quickstart, Cloud Run deploy notes.

Status

Alpha. The web UI is the supported entry point. The vertical-slice end-to-end pipeline (Phoenix span -> Gemini extractor -> synthesiser -> generated pytest file) is scaffolded as a script but not yet wired into the package. End-to-end wiring ships in 0.2.0.

Test surface

135 unit tests pass across Python 3.11 / 3.12 / 3.13. Integration tests (Phoenix MCP server) pass under CI. Lint (ruff format + ruff check) clean.

License

MIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What ships

Status

Test surface

License

Uh oh!

Releases: golikovichev/phoenix2pytest

v1.0.0

Uh oh!

v0.1.0 - first hackathon-submission cut

What ships

Status

Test surface

License

Uh oh!