ModelSweep

A GUI-first evaluation workbench for local LLMs running on Ollama. Build test suites, run evaluations across models, execute code in Docker containers, score with cloud judges and peer voting, and visualize everything through interactive dashboards.

Screenshots


Dashboard — model status, evaluation stats	Test Suites — create, import, manage

Suite Editor — scenarios with test cases	Model Browser — search and install models

What It Does

ModelSweep connects to your local Ollama instance, detects installed models, and lets you run structured evaluations across them. You create test suites (or use built-in ones), select which models to test, and watch results stream in real-time. For coding problems, the models' code is executed in isolated Docker containers against real test cases.

Three scoring layers work together:

Auto-scoring — gate checks catch broken responses (empty, refused, gibberish, looping)
Cloud judge (GPT-4o, Claude, etc.) — scores each response on accuracy, helpfulness, clarity, and instruction following, with detailed code reviews for coding suites
Peer judging — models judge each other's responses in round-robin comparisons with written reasoning

Features

7 evaluation modes: Standard, Tool Calling, Multi-turn Conversation, Adversarial/Red Team, Coding Sandbox, Vision, and RAG
Docker code execution — models write code, it runs in isolated containers (Python, JS, Go, Rust) against test cases with pass/fail results
LLM-as-Judge with cloud providers (OpenAI, Anthropic, custom endpoints) — 4-axis scoring with strengths, weaknesses, and code review analysis
Round-robin peer judging — models judge each other with written reasoning for why they picked a winner
AI-powered test generation — describe a problem in plain English, a cloud model generates the scenario with function signature and test cases
Custom judge instructions — tell the judge what to focus on (e.g., "penalize solutions without docstrings")
Live streaming execution with real-time progress, Docker execution indicators, and test result badges
Elo rating system — persistent cross-run ratings from judge and peer comparisons
Head-to-head matrix with per-scenario breakdown showing who won each problem and why
Radar charts for coding (Correctness, Code Quality, Speed, Reliability, Edge Cases)
8 built-in starter suites including OWASP LLM Top 10 (25 adversarial scenarios) and Coding Sandbox Basics
Suite import/export as .modelsweep.json files
Model browser — search, browse, and pull Ollama models from within the app
Export results as PDF, JSON, or CSV
Fully local — all data stays on your machine, cloud APIs used only for judge/generation when explicitly configured

Quick Start

Prerequisites

Node.js 18+
Ollama installed and running
Docker (optional, for code execution sandbox)

Install & Run

git clone https://github.com/leonickson1/ModelSweep.git
cd ModelSweep/app
npm install

# Start Ollama (in a separate terminal)
ollama serve

# Start the dev server
npm run dev

Open http://localhost:3000. ModelSweep auto-detects your Ollama instance and lists installed models.

First Run

Go to Suites and pick a built-in starter suite (General Intelligence, Coding Sandbox Basics, OWASP LLM Top 10)
Click Run Suite, select your models, optionally enable cloud judge + peer judging
Watch streaming execution with Docker test results, then explore the results dashboard

Docker Code Execution

The coding sandbox runs model-generated code in isolated Docker containers:

Languages: Python 3.11, JavaScript (Node.js 20), Go 1.21, Rust 1.74
Isolation: No network access, 512MB memory limit, 1 CPU, per-scenario timeout
Auto-pull: Docker images are pulled automatically on first use
Test results: Each test case shows expected vs actual output, pass/fail, and execution time
Code extraction: Handles code fences, thinking model tags, multi-function solutions
Judge integration: Test results are sent to the cloud judge so it knows which code actually works — broken code scores low regardless of how clean it looks

The prompt tells models the exact function signature to implement. The runner automatically:

Detects multi-parameter functions and spreads array inputs
Strips console.log/print statements and example usage from model code
Handles models that name functions differently than requested

Evaluation Modes

Standard

Static prompts with single-turn responses. Gate checks + optional judge scoring.

Tool Calling

Define mock tools (JSON Schema) and test function calling. Deterministic scoring on tool selection, parameter accuracy, restraint, and ordering.

Conversation (Multi-turn)

Simulator model plays a user persona for multi-turn dialogue. Tests context retention, persona consistency, quality maintenance. Supports scripted, local, or cloud simulators.

Adversarial (Red Team)

Attacker model tries to breach system prompt defenses. Strategies: prompt extraction, jailbreak, persona break, data exfiltration. Includes OWASP LLM Top 10 suite.

Coding Sandbox

Models write code executed in Docker against real test cases. Score = passed tests / total tests. Cloud judge provides code review analysis explaining why tests passed or failed.

Vision

Test vision models on image understanding: object identification, OCR, counting, spatial reasoning, description, visual reasoning.

RAG

Upload documents, test retrieval faithfulness. Measures sentence grounding, abstention accuracy, and answer correctness.

How Scoring Works

Gate Checks (Pass/Fail)

Gate	Trigger
EMPTY	Response has fewer than 4 words
REFUSED	Matches refusal patterns
REPETITION_LOOP	4-gram repetition > 50% in last 300 words
GIBBERISH	More than 40% non-ASCII characters
TRUNCATED	Hit token limit mid-sentence
ERROR	Timeout or API error

Cloud Judge (4-Axis)

For coding suites, the judge acts as a senior software engineer:

Accuracy: Does the code work? Test results are ground truth.
Helpfulness: Edge case handling, robustness
Clarity: Readability, variable naming, structure
Instruction Following: Matches required signature and constraints

Plus a Code Review field explaining what the code does right or wrong.

For standard suites: accuracy, helpfulness, clarity, instruction following.

Peer Judging

When 3+ models are tested, they judge each other round-robin. Each judge picks a winner and explains why. Results feed into Elo ratings.

Coding Score

(passed test cases / total test cases) * 100. When all models fail, judge comparison is skipped. Models that fail tests cannot win judge comparisons.

Tech Stack

Layer	Tech
Framework	Next.js 14 (App Router)
Styling	Tailwind CSS (dark-only)
Animation	Framer Motion
Charts	Recharts
Flow Viz	React Flow (@xyflow/react)
State	Zustand
Database	SQLite (better-sqlite3)
Icons	Lucide React
Code Sandbox	Docker (dockerode)
Doc Parsing	pdf-parse, mammoth
MCP	@modelcontextprotocol/sdk

Project Structure

app/src/
  app/                    Pages + 40+ API routes
  components/
    ui/                   GlowCard, Button, ScoreBadge, ModelBadge, Markdown
    layout/               Sidebar, ConnectionProvider, CommandPalette
    charts/               Radar, Bar, Distribution, Elo, Quality, Heatmap
    results/              Mode-specific result views + matchup history
    suite/                7 suite editors (tool, conversation, adversarial, coding, vision, RAG)
    run/                  React Flow visualizations for live runs
  lib/
    db.ts                 SQLite (20+ tables)
    ollama.ts             Ollama client with streaming chat
    scoring.ts            Gate checks + composite scoring
    code-execution-engine.ts   Docker sandbox (4 languages)
    adversarial-engine.ts      Red team attack/defense
    conversation-engine.ts     Multi-turn conversation runner
    peer-judge-engine.ts       Round-robin peer judging
    rag-engine.ts              Document parsing + faithfulness
    vision-engine.ts           Vision model evaluation
    providers/cloud-inference.ts   OpenAI/Anthropic/custom clouds
  store/                  5 Zustand stores
  types/                  TypeScript interfaces

Configuration

Cloud Providers

Configure OpenAI, Anthropic, or custom OpenAI-compatible endpoints in Settings > Cloud Providers. Used for judge scoring, peer judging extras, conversation simulation, and AI test generation.

Docker

Install Docker for code execution. Containers use python:3.11-slim, node:20-slim, golang:1.21-slim, rust:1.74-slim. Images auto-pull on first use.

Custom Judge Instructions

When enabling the judge on a run, you can type custom instructions like:

"Focus on code efficiency and use of docstrings"
"Penalize brute force solutions"
"Evaluate like a senior engineer doing a code review"

Roadmap

CI/CD pipeline for automated testing and deployment
MCP server integration for live tool calling evaluation
Quantization impact metrics (compare Q4 vs Q8 of same model)
Improved RAG evaluation (chunk-level grounding visualization, multi-document support)
Vision evaluation enhancements (multi-image comparison, video frame analysis)
Batch comparison mode (run same suite across model versions)
Community leaderboard (opt-in anonymous score sharing)

Commands

All commands run from app/:

npm run dev        # Dev server at http://localhost:3000
npm run build      # Production build
npm run lint       # ESLint
npx tsc --noEmit   # Type check

Known Limitations

Conversation/adversarial scoring needs a judge model for meaningful scores (without one, most dimensions default to 3/5)
Tool calling requires models that support Ollama's tool calling API
Code sandbox requires Docker installed and running
Vision testing requires vision-capable models (llava, llama3.2-vision, etc.)
Peer judging needs 3+ models; small models are unreliable judges for code quality

License

MIT License. See LICENSE for details.

Built for COMP 590: HCI in the Age of AI — Professor Leonard McMillan

Built with Claude Code

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
app		app
screenshots		screenshots
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
REPORT.md		REPORT.md
generate_source.py		generate_source.py

Folders and files

Latest commit

History

Repository files navigation

ModelSweep

Screenshots

What It Does

Features

Quick Start

Prerequisites

Install & Run

First Run

Docker Code Execution

Evaluation Modes

Standard

Tool Calling

Conversation (Multi-turn)

Adversarial (Red Team)

Coding Sandbox

Vision

RAG

How Scoring Works

Gate Checks (Pass/Fail)

Cloud Judge (4-Axis)

Peer Judging

Coding Score

Tech Stack

Project Structure

Configuration

Cloud Providers

Docker

Custom Judge Instructions

Roadmap

Commands

Known Limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages