Skip to content

pragatig25/driftgate

Repository files navigation

DriftGate

Catch visual drift before it ships. DriftGate turns design-system conformance into a CI gate: when front-end code changes, it renders the page in a real browser, scores it against a DESIGN_SYSTEM.md, and runs a bounded fix loop until the screen conforms — or an iteration cap is hit.

Python CI License: MIT

Live demo: https://pragatig25.github.io/driftgate/ — a pre-recorded playback of the real loop converging (38 → 64 → 88 → 93) on a sample page. Zero backend, zero cost.


The problem & the business value

Off-brand colours, broken spacing and drifted typography slip past code review and reach production, where they erode brand trust and turn into costly hotfixes. Visual QA is usually manual, slow, and inconsistent.

DriftGate makes it automatic and cheap:

Without DriftGate With DriftGate
Visual QA / PR ~15–30 min manual review < 60 s automated gate
Where drift is caught In production (expensive) Pre-merge, in CI
Cost / run Reviewer time ~$0.01 (Haiku + prompt caching)
Consistency Varies by reviewer Deterministic hard gate

Why this is not "an LLM that fiddles with CSS"

The gate is hybrid, and the split is deliberate:

Layer Role Deterministic?
Pixel diff vs baseline Hard gate — fails the build Yes
Design-token assertions (computed styles) Hard gate — fails the build Yes
Claude-vision conformance score Advisory — informs, never blocks alone No

A non-deterministic model can therefore never wrongly block a PR. The vision critic explains why a screen drifts from the design language and proposes a fix; the deterministic checks decide pass/fail.

Two surfaces, one engine

  • CI gate (visual-qe gate) — runs in GitHub Actions on PRs, posts a conformance report. Never auto-edits; it only reports a suggested diff.
  • Hosted demo (visual_qe_loop.api.app) — sandboxed, rate-limited, access-code gated. Screenshots a submitted URL or a built-in sample and applies CSS-only suggestions. Never executes untrusted code.

The capture layer has two drivers behind one interface: the Playwright MCP driver for the local interactive agent (Claude edits code, then drives the browser as tools), and the Playwright library driver for CI and the demo (no Claude-Code runtime).

Quickstart

pip install -e ".[playwright,api,dev]"
python -m playwright install --with-deps chromium
cp .env.example .env          # add ANTHROPIC_API_KEY (and VQE_ACCESS_CODE for the live demo)

Usage

# Score a single URL against the design system
visual-qe score --url https://example.com

# Run the bounded fix loop on a local HTML file
visual-qe loop --file ./samples/saas-landing.html

# CI gate (writes a markdown report, fails on hard-gate)
visual-qe gate --report-path conformance-report.md

# Hosted demo backend (the static demo/ talks to this)
uvicorn visual_qe_loop.api.app:app --reload

Architecture

Front-end change ─ URL or .html
       │
       ▼
 CAPTURE  ── Playwright ──►  screenshot.png  +  computed styles
  (MCP driver = local agent · library driver = CI & demo)
       │
       ├───────────────►  DETERMINISTIC HARD GATE  ── blocks the build
       │                   • pixel diff vs baseline (Pillow/numpy)
       │                   • design-token assertions (computed styles)
       ▼
 CLAUDE VISION CRITIC ──►  ConformanceReport {score, violations}   ◄ ADVISORY only
  cached DESIGN_SYSTEM.md · forced tool-use · effort(Opus)/temp(Haiku)
       │
       ▼
 BOUNDED FIX LOOP   guardrails: max_iters · threshold · no-improvement
  score → propose CSS patch → apply → re-render → re-score
       │
       ├──►  CI GATE   → markdown report on the PR, fails on hard-gate
       └──►  DEMO      → iteration cards: before → after, score climbing

Reproducibility & cost

  • The DESIGN_SYSTEM.md + rubric are sent as a cached prompt block — they are large, static, and re-sent every loop iteration, so prompt caching is the main cost lever.
  • The critic is model-aware: Opus uses output_config.effort; Haiku/Sonnet use temperature. (The Anthropic API has no seed param, and Opus rejects temperature.)
  • Token / iteration / cost are logged per run via structlog.

Self-regression

tests/fixtures/golden/ holds screenshots with known expected scores. The critic is regression-tested against them, so a prompt or model change that shifts scoring is caught in CI.

Project layout

visual_qe_loop/
  capture/        base interface + Playwright library driver + MCP driver
  diff/           pixel diff + design-token extractor
  critic/         Claude-vision critic + prompts (cached design system)
  loop/           bounded fixer with guardrails
  models/         Pydantic contracts (ConformanceReport, DesignSystem, LoopResult)
  observability/  structlog config + cost accounting
  api/            FastAPI demo backend
demo/             static, self-contained demo page (GitHub Pages)

Security

  • Secrets live only in .env (git-ignored). Never commit API keys.
  • The demo backend blocks SSRF (rejects non-public hosts), is rate-limited, and is access-code gated so only people you share the code with can spend your tokens.
  • URL runs are report-only and rendered read-only — untrusted code is never executed.

License

MIT © Pragati Gupta

About

DriftGate — catch visual drift before it ships. Design-system conformance as a CI gate: pixel diff + token assertions (hard) plus a Claude-vision critic (advisory).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors