feat(evals): eval system with LLM judge, JSONL push, and React UI by phucbm · Pull Request #8 · openwalletvn/web

phucbm · 2026-05-22T16:17:22Z

Summary

Migrate 8 eval test cases from inline code to evals/cases/*.json
Add LLM judge (Groq response_format: json_object) scoring 1-100 per run
Write JSONL per run and push to openwalletvn/evals repo via GitHub API
New evals/ top-level directory — completely separate from admin/ (blog)
evals/server.ts — Express on port 3006 (trigger endpoint only)
evals/ui/ — Vite+React UI: RunList, RunDetail, PromptCompare, TriggerButton
evals/wrangler.toml — Cloudflare Pages config for evals.openwallet.vn
.github/workflows/eval-run.yml — runs evals on PR + workflow_dispatch
.github/workflows/evals-site.yml — deploys UI to Cloudflare Pages on push to main

Usage

pnpm evals          # start UI (localhost:3005) + backend (port 3006, internal)
npx tsx scripts/eval-chat.ts  # run evals manually

Pending secrets (GitHub repo settings)

EVALS_GITHUB_TOKEN — PAT with contents:write on openwalletvn/evals
CLOUDFLARE_API_TOKEN + CLOUDFLARE_ACCOUNT_ID — for evals-site.yml deploy

Test plan

npx tsx scripts/eval-chat.ts → 8/8 pass, avg score 81, JSONL pushed to evals repo
pnpm build → Next.js build clean
pnpm evals:build → Vite UI build clean
git diff origin/preview -- admin/server.ts → empty (blog admin untouched)

🤖 Generated with Claude Code

- Migrate 8 test cases from inline array to evals/cases/*.json - Add LLM judge (Groq json_object mode) with score + reasoning - Write JSONL per run, push to openwalletvn/evals via GitHub API - Add evals/server.ts (port 3006, trigger endpoint) - Add evals/ui/ — Vite+React UI with RunList, RunDetail, PromptCompare - Add evals/wrangler.toml for Cloudflare Pages deploy - Add .github/workflows/eval-run.yml and evals-site.yml - pnpm evals → single port 3005, /server proxy to backend Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vercel · 2026-05-22T16:17:28Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
openwallet	Ready	Preview, Comment	May 23, 2026 1:29am

- Fix CWD bug in server.ts (parent of web/ → web/) - Add SSE streaming: POST trigger returns runId, GET stream/:runId pipes stdout in real-time via EventEmitter fan-out - Add triggered_by (ui/cli/ci) and system_prompt fields to EvalResult - UI: two-column layout, dark terminal progress panel, card-based run list with tag summary and trigger badge, run detail with full system prompt + expandable AI responses + judge/rule disagreement flag - Rewrite 13 eval cases with realistic Vietnamese queries: shopee+supermarket cashback, no annual fee, travel abroad, installment 0%, hallucination guards for fabricated rates, out-of-scope: gold/stocks/real estate - Add .claude/docs/evals.md maintenance documentation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Single case selector: dropdown grouped by tag, runs just one case - Push to GitHub checkbox: skip polluting results with WIP runs - Delete run: button per card proxies GitHub Contents API delete - Re-run failures: button appears when selected run has failing cases - Live progress: SSE streaming with current case name + N/M counter - System prompt per run: expandable with GitHub blob link at that SHA - Trigger source tracking: ui/cli/ci badge on each run card - Score trend arrow vs previous run, disagreement flag (rule vs judge) - Copy button on AI response, inline re-run per test case card - Relative timestamps with absolute on hover, descriptive button labels Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…X features Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

phucbm and others added 2 commits May 22, 2026 23:16

chore: gitignore evals/ui/node_modules and dist

4892446

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vercel Bot had a problem deploying to Preview May 22, 2026 16:17 Failure

vercel Bot deployed to Preview May 22, 2026 16:52 View deployment

phucbm and others added 2 commits May 23, 2026 08:21

docs(evals): update evals.md with case filter, push toggle, delete, U…

a71b106

…X features Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vercel Bot deployed to Preview May 23, 2026 01:29 View deployment

phucbm merged commit 6ab56f0 into preview May 24, 2026
2 of 3 checks passed

phucbm deleted the feature/evals-system branch May 24, 2026 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): eval system with LLM judge, JSONL push, and React UI#8

feat(evals): eval system with LLM judge, JSONL push, and React UI#8
phucbm merged 5 commits into
previewfrom
feature/evals-system

phucbm commented May 22, 2026

Uh oh!

vercel Bot commented May 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

phucbm commented May 22, 2026

Summary

Usage

Pending secrets (GitHub repo settings)

Test plan

Uh oh!

vercel Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 22, 2026 •

edited

Loading