Skip to content

feat(evals): eval system with LLM judge, JSONL push, and React UI#8

Merged
phucbm merged 5 commits into
previewfrom
feature/evals-system
May 24, 2026
Merged

feat(evals): eval system with LLM judge, JSONL push, and React UI#8
phucbm merged 5 commits into
previewfrom
feature/evals-system

Conversation

@phucbm
Copy link
Copy Markdown
Contributor

@phucbm phucbm commented May 22, 2026

Summary

  • Migrate 8 eval test cases from inline code to evals/cases/*.json
  • Add LLM judge (Groq response_format: json_object) scoring 1-100 per run
  • Write JSONL per run and push to openwalletvn/evals repo via GitHub API
  • New evals/ top-level directory — completely separate from admin/ (blog)
  • evals/server.ts — Express on port 3006 (trigger endpoint only)
  • evals/ui/ — Vite+React UI: RunList, RunDetail, PromptCompare, TriggerButton
  • evals/wrangler.toml — Cloudflare Pages config for evals.openwallet.vn
  • .github/workflows/eval-run.yml — runs evals on PR + workflow_dispatch
  • .github/workflows/evals-site.yml — deploys UI to Cloudflare Pages on push to main

Usage

pnpm evals          # start UI (localhost:3005) + backend (port 3006, internal)
npx tsx scripts/eval-chat.ts  # run evals manually

Pending secrets (GitHub repo settings)

  • EVALS_GITHUB_TOKEN — PAT with contents:write on openwalletvn/evals
  • CLOUDFLARE_API_TOKEN + CLOUDFLARE_ACCOUNT_ID — for evals-site.yml deploy

Test plan

  • npx tsx scripts/eval-chat.ts → 8/8 pass, avg score 81, JSONL pushed to evals repo
  • pnpm build → Next.js build clean
  • pnpm evals:build → Vite UI build clean
  • git diff origin/preview -- admin/server.ts → empty (blog admin untouched)

🤖 Generated with Claude Code

phucbm and others added 2 commits May 22, 2026 23:16
- Migrate 8 test cases from inline array to evals/cases/*.json
- Add LLM judge (Groq json_object mode) with score + reasoning
- Write JSONL per run, push to openwalletvn/evals via GitHub API
- Add evals/server.ts (port 3006, trigger endpoint)
- Add evals/ui/ — Vite+React UI with RunList, RunDetail, PromptCompare
- Add evals/wrangler.toml for Cloudflare Pages deploy
- Add .github/workflows/eval-run.yml and evals-site.yml
- pnpm evals → single port 3005, /server proxy to backend

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
openwallet Ready Ready Preview, Comment May 23, 2026 1:29am

- Fix CWD bug in server.ts (parent of web/ → web/)
- Add SSE streaming: POST trigger returns runId, GET stream/:runId
  pipes stdout in real-time via EventEmitter fan-out
- Add triggered_by (ui/cli/ci) and system_prompt fields to EvalResult
- UI: two-column layout, dark terminal progress panel, card-based
  run list with tag summary and trigger badge, run detail with full
  system prompt + expandable AI responses + judge/rule disagreement flag
- Rewrite 13 eval cases with realistic Vietnamese queries:
  shopee+supermarket cashback, no annual fee, travel abroad,
  installment 0%, hallucination guards for fabricated rates,
  out-of-scope: gold/stocks/real estate
- Add .claude/docs/evals.md maintenance documentation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
phucbm and others added 2 commits May 23, 2026 08:21
- Single case selector: dropdown grouped by tag, runs just one case
- Push to GitHub checkbox: skip polluting results with WIP runs
- Delete run: button per card proxies GitHub Contents API delete
- Re-run failures: button appears when selected run has failing cases
- Live progress: SSE streaming with current case name + N/M counter
- System prompt per run: expandable with GitHub blob link at that SHA
- Trigger source tracking: ui/cli/ci badge on each run card
- Score trend arrow vs previous run, disagreement flag (rule vs judge)
- Copy button on AI response, inline re-run per test case card
- Relative timestamps with absolute on hover, descriptive button labels

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…X features

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@phucbm phucbm merged commit 6ab56f0 into preview May 24, 2026
2 of 3 checks passed
@phucbm phucbm deleted the feature/evals-system branch May 24, 2026 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant