DriftGate

Catch visual drift before it ships. DriftGate turns design-system conformance into a CI gate: when front-end code changes, it renders the page in a real browser, scores it against a DESIGN_SYSTEM.md, and runs a bounded fix loop until the screen conforms — or an iteration cap is hit.

Live demo: https://pragatig25.github.io/driftgate/ — a pre-recorded playback of the real loop converging (38 → 64 → 88 → 93) on a sample page. Zero backend, zero cost.

The problem & the business value

Off-brand colours, broken spacing and drifted typography slip past code review and reach production, where they erode brand trust and turn into costly hotfixes. Visual QA is usually manual, slow, and inconsistent.

DriftGate makes it automatic and cheap:

	Without DriftGate	With DriftGate
Visual QA / PR	~15–30 min manual review	< 60 s automated gate
Where drift is caught	In production (expensive)	Pre-merge, in CI
Cost / run	Reviewer time	~$0.01 (Haiku + prompt caching)
Consistency	Varies by reviewer	Deterministic hard gate

Why this is not "an LLM that fiddles with CSS"

The gate is hybrid, and the split is deliberate:

Layer	Role	Deterministic?
Pixel diff vs baseline	Hard gate — fails the build	Yes
Design-token assertions (computed styles)	Hard gate — fails the build	Yes
Claude-vision conformance score	Advisory — informs, never blocks alone	No

A non-deterministic model can therefore never wrongly block a PR. The vision critic explains why a screen drifts from the design language and proposes a fix; the deterministic checks decide pass/fail.

Two surfaces, one engine

CI gate (visual-qe gate) — runs in GitHub Actions on PRs, posts a conformance report. Never auto-edits; it only reports a suggested diff.
Hosted demo (visual_qe_loop.api.app) — sandboxed, rate-limited, access-code gated. Screenshots a submitted URL or a built-in sample and applies CSS-only suggestions. Never executes untrusted code.

The capture layer has two drivers behind one interface: the Playwright MCP driver for the local interactive agent (Claude edits code, then drives the browser as tools), and the Playwright library driver for CI and the demo (no Claude-Code runtime).

Quickstart

pip install -e ".[playwright,api,dev]"
python -m playwright install --with-deps chromium
cp .env.example .env          # add ANTHROPIC_API_KEY (and VQE_ACCESS_CODE for the live demo)

Usage

# Score a single URL against the design system
visual-qe score --url https://example.com

# Run the bounded fix loop on a local HTML file
visual-qe loop --file ./samples/saas-landing.html

# CI gate (writes a markdown report, fails on hard-gate)
visual-qe gate --report-path conformance-report.md

# Hosted demo backend (the static demo/ talks to this)
uvicorn visual_qe_loop.api.app:app --reload

Architecture

Front-end change ─ URL or .html
       │
       ▼
 CAPTURE  ── Playwright ──►  screenshot.png  +  computed styles
  (MCP driver = local agent · library driver = CI & demo)
       │
       ├───────────────►  DETERMINISTIC HARD GATE  ── blocks the build
       │                   • pixel diff vs baseline (Pillow/numpy)
       │                   • design-token assertions (computed styles)
       ▼
 CLAUDE VISION CRITIC ──►  ConformanceReport {score, violations}   ◄ ADVISORY only
  cached DESIGN_SYSTEM.md · forced tool-use · effort(Opus)/temp(Haiku)
       │
       ▼
 BOUNDED FIX LOOP   guardrails: max_iters · threshold · no-improvement
  score → propose CSS patch → apply → re-render → re-score
       │
       ├──►  CI GATE   → markdown report on the PR, fails on hard-gate
       └──►  DEMO      → iteration cards: before → after, score climbing

Reproducibility & cost

The DESIGN_SYSTEM.md + rubric are sent as a cached prompt block — they are large, static, and re-sent every loop iteration, so prompt caching is the main cost lever.
The critic is model-aware: Opus uses output_config.effort; Haiku/Sonnet use temperature. (The Anthropic API has no seed param, and Opus rejects temperature.)
Token / iteration / cost are logged per run via structlog.

Self-regression

tests/fixtures/golden/ holds screenshots with known expected scores. The critic is regression-tested against them, so a prompt or model change that shifts scoring is caught in CI.

Project layout

visual_qe_loop/
  capture/        base interface + Playwright library driver + MCP driver
  diff/           pixel diff + design-token extractor
  critic/         Claude-vision critic + prompts (cached design system)
  loop/           bounded fixer with guardrails
  models/         Pydantic contracts (ConformanceReport, DesignSystem, LoopResult)
  observability/  structlog config + cost accounting
  api/            FastAPI demo backend
demo/             static, self-contained demo page (GitHub Pages)

Security

Secrets live only in .env (git-ignored). Never commit API keys.
The demo backend blocks SSRF (rejects non-public hosts), is rate-limited, and is access-code gated so only people you share the code with can spend your tokens.
URL runs are report-only and rendered read-only — untrusted code is never executed.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
demo		demo
samples		samples
scripts		scripts
tests		tests
visual_qe_loop		visual_qe_loop
.env.example		.env.example
.gitignore		.gitignore
DESIGN_SYSTEM.md		DESIGN_SYSTEM.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DriftGate

The problem & the business value

Why this is not "an LLM that fiddles with CSS"

Two surfaces, one engine

Quickstart

Usage

Architecture

Reproducibility & cost

Self-regression

Project layout

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DriftGate

The problem & the business value

Why this is not "an LLM that fiddles with CSS"

Two surfaces, one engine

Quickstart

Usage

Architecture

Reproducibility & cost

Self-regression

Project layout

Security

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages