Duel

Duel benchmarks LLMs on RTVS Duel, Slovak quiz game with 10 multiple-choice questions and 60-second time limit.

Project targets hard case for LLMs: obscure Slovak words, anecdotes, and cultural references in underrepresented language setting. Repo now supports live browser runs, replay datasets, JSON artifacts, markdown reports, and CI-backed regression checks.

Why this project matters

niche benchmark with real Slovak-language and culture-heavy questions
live browser automation, not only synthetic prompt files
replay mode for reproducible offline comparisons
report artifacts that make model behavior easy to review on GitHub

Benchmark Tracker

Status legend:

Historical live scores from original project README are preserved below. Average attained score out of 20 runs. Maximum score is 10.

Model	Avg. score / 10	Notes
`gpt-oss-120b`	4.3	Historical live runs
`glm-4.7`	3.3	Historical live runs
`qwen3-30b-a3b-instruct-2507`	2.9	Historical live runs
`qwen3-30b-a3b-thinking-2507`	3.8	Historical live runs
`mistral-7b-instruct`	1.2	Historical live runs
`gemini-2.5-flash`	10.0	Historical live runs
`gpt-4.1-mini`	pending	OpenAI provider wired, benchmark run pending
`gpt-4.1`	pending	OpenAI-compatible path available
`gemini-2.5-pro`	pending	Gemini provider path ready
`claude-3.7-sonnet`	pending	Add Anthropic provider integration
`llama-3.3-70b-instruct`	pending	Add hosted endpoint and benchmark run
`deepseek-r1`	pending	Add provider path and cost tracking

Current Sample Report

Checked-in replay artifacts currently compare offline oracle and baseline providers.

Provider	Model	Source	Runs	Avg Score	Completion
`baseline`	`always-a`	replay	2	0.0	0.0%
`oracle`	`oracle`	replay	2	5.0	100.0%

Full generated output:

Features

live benchmark mode against duelonline.sk via Selenium
replay benchmark mode from local JSON datasets
provider abstraction for OpenAI, Gemini, and offline baselines
per-run JSON artifacts with prompt, answer, latency, and result data
markdown leaderboard generation for GitHub-friendly presentation
pytest + Ruff + GitHub Actions CI
OpenCode GitHub automations for comments, PR review, issue triage, schedules, and manual runs

OpenCode GitHub

Configured workflows:

Workflow	Triggers	Purpose
`OpenCode Comment Tasks`	`issue_comment`, `pull_request_review_comment`, `workflow_dispatch`	react to `/oc` and `/opencode` comments
`OpenCode PR Review`	`pull_request`, `workflow_dispatch`	review benchmark PRs automatically
`OpenCode Issue Triage`	`issues`, `workflow_dispatch`	triage new issues with spam guard
`OpenCode Scheduled Sweep`	`schedule`, `workflow_dispatch`	weekly repository sweep
`OpenCode Manual Task`	`workflow_dispatch`	ad hoc OpenCode run from Actions tab

Secrets required:

GEMINI_API_KEY
GITHUB_TOKEN is passed by GitHub Actions so OpenCode can comment, review, react, and open issues/PRs

Current model:

google/gemini-2.5-flash

Live leaderboard snapshot

The repository includes a small generated leaderboard snapshot shown on the project Pages site. We auto-generate a simple SVG from reports/summary.json using tools/generate_leaderboard_svg.py. To refresh locally:

python tools/generate_leaderboard_svg.py

Resulting docs/leaderboard.svg and docs/site-data.json feed GitHub Pages site.

Why earlier GitHub Action failed

Earlier 403 Resource not accessible by integration error came from old default-branch workflow config. That version had read-only GitHub permissions and did not pass GITHUB_TOKEN to OpenCode, so it could not add reactions or comments.

Current workflows now include:

GITHUB_TOKEN
use_github_token: true
write permissions where OpenCode needs to comment, react, open issues, or create PRs

Quick trigger tests

Issue comment:

/opencode explain this issue

PR review comment on code:

/oc add error handling here

Manual run from Actions tab:

open OpenCode Manual Task
set prompt to something like Summarize current benchmark architecture and suggest one missing test.

PR auto-review:

open or update a PR and check OpenCode PR Review

Issue triage:

open a new issue from an account older than 30 days

Scheduled sweep:

wait for cron or trigger OpenCode Scheduled Sweep manually

Architecture

live site / replay dataset
          |
          v
      runner.py
          |
          +--> browser.py        # live Selenium client
          +--> replay.py         # offline datasets
          +--> providers/*       # OpenAI, Gemini, baselines
          +--> storage.py        # JSON artifacts
          +--> reporting.py      # leaderboard + summaries

Quickstart

Install

uv sync --group dev

Configure providers

export DUEL_API_KEY=replace-me
export GEMINI_API_KEY=replace-me

Run replay benchmark

uv run duel benchmark \
  --source replay \
  --dataset examples/replay_sample.json \
  --provider oracle \
  --runs 2

Build report

uv run duel report

Run quality checks

make lint
make test

Example Commands

Replay benchmark

uv run duel benchmark --source replay --dataset examples/replay_sample.json --provider baseline

Larger replay dataset for regression checks:

uv run duel benchmark --source replay --dataset examples/replay_extended.json --provider oracle

Live benchmark

uv run duel benchmark --source live --provider openai --runs 1

Report only

uv run duel report

Config

Main config lives in config.yaml.

player: profile data sent to live game form
benchmark: default provider and artifact directory
providers: model defaults and environment variable names

Siemens / Custom Gemini endpoint example

If you need to route Gemini requests through an internal Siemens proxy or a custom base URL, add base_url under the providers.gemini section of the config. See config/siemens.example.yaml for a sample.

Cost rate overrides

Override cost estimation in config.yaml:

benchmark:
  cost_rates:
    gemini-2.5-flash:
      input_per_million: 0.4
      output_per_million: 3.0

Limitations

replay dataset is still small and intended for smoke tests/demo flow
live mode depends on external site markup and timing behavior
token and cost estimates depend on placeholder rate table and should be tuned for real billing

Next Steps

Add larger labeled replay datasets captured from live sessions.
Tune provider cost tables against real billing.
Add richer charts/dashboard from reports/summary.json.
Add live screenshots or short demo capture for portfolio use.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
config		config
docs		docs
examples		examples
reports		reports
src		src
tests		tests
tools		tools
.env.example		.env.example
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RECRUITERS.md		RECRUITERS.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Duel

Why this project matters

Benchmark Tracker

Current Sample Report

Features

OpenCode GitHub

Live leaderboard snapshot

Why earlier GitHub Action failed

Quick trigger tests

Architecture

Quickstart

Install

Configure providers

Run replay benchmark

Build report

Run quality checks

Example Commands

Replay benchmark

Live benchmark

Report only

Config

Siemens / Custom Gemini endpoint example

Cost rate overrides

Limitations

Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Duel

Why this project matters

Benchmark Tracker

Current Sample Report

Features

OpenCode GitHub

Live leaderboard snapshot

Why earlier GitHub Action failed

Quick trigger tests

Architecture

Quickstart

Install

Configure providers

Run replay benchmark

Build report

Run quality checks

Example Commands

Replay benchmark

Live benchmark

Report only

Config

Siemens / Custom Gemini endpoint example

Cost rate overrides

Limitations

Next Steps

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages