Duel benchmarks LLMs on RTVS Duel, Slovak quiz game with 10 multiple-choice questions and 60-second time limit.
Project targets hard case for LLMs: obscure Slovak words, anecdotes, and cultural references in underrepresented language setting. Repo now supports live browser runs, replay datasets, JSON artifacts, markdown reports, and CI-backed regression checks.
- niche benchmark with real Slovak-language and culture-heavy questions
- live browser automation, not only synthetic prompt files
- replay mode for reproducible offline comparisons
- report artifacts that make model behavior easy to review on GitHub
Status legend:
Historical live scores from original project README are preserved below. Average attained score out of 20 runs. Maximum score is 10.
Checked-in replay artifacts currently compare offline oracle and baseline providers.
| Provider | Model | Source | Runs | Avg Score | Completion |
|---|---|---|---|---|---|
baseline |
always-a |
replay | 2 | 0.0 | 0.0% |
oracle |
oracle |
replay | 2 | 5.0 | 100.0% |
Full generated output:
- live benchmark mode against duelonline.sk via Selenium
- replay benchmark mode from local JSON datasets
- provider abstraction for OpenAI, Gemini, and offline baselines
- per-run JSON artifacts with prompt, answer, latency, and result data
- markdown leaderboard generation for GitHub-friendly presentation
- pytest + Ruff + GitHub Actions CI
- OpenCode GitHub automations for comments, PR review, issue triage, schedules, and manual runs
Configured workflows:
| Workflow | Triggers | Purpose |
|---|---|---|
OpenCode Comment Tasks |
issue_comment, pull_request_review_comment, workflow_dispatch |
react to /oc and /opencode comments |
OpenCode PR Review |
pull_request, workflow_dispatch |
review benchmark PRs automatically |
OpenCode Issue Triage |
issues, workflow_dispatch |
triage new issues with spam guard |
OpenCode Scheduled Sweep |
schedule, workflow_dispatch |
weekly repository sweep |
OpenCode Manual Task |
workflow_dispatch |
ad hoc OpenCode run from Actions tab |
Secrets required:
GEMINI_API_KEYGITHUB_TOKENis passed by GitHub Actions so OpenCode can comment, review, react, and open issues/PRs
Current model:
google/gemini-2.5-flash
The repository includes a small generated leaderboard snapshot shown on the
project Pages site. We auto-generate a simple SVG from reports/summary.json
using tools/generate_leaderboard_svg.py. To refresh locally:
python tools/generate_leaderboard_svg.pyResulting docs/leaderboard.svg and docs/site-data.json feed GitHub Pages site.
Earlier 403 Resource not accessible by integration error came from old default-branch workflow config.
That version had read-only GitHub permissions and did not pass GITHUB_TOKEN to OpenCode, so it could not add reactions or comments.
Current workflows now include:
GITHUB_TOKENuse_github_token: true- write permissions where OpenCode needs to comment, react, open issues, or create PRs
- Issue comment:
/opencode explain this issue
- PR review comment on code:
/oc add error handling here
- Manual run from Actions tab:
- open
OpenCode Manual Task - set
promptto something likeSummarize current benchmark architecture and suggest one missing test.
- PR auto-review:
- open or update a PR and check
OpenCode PR Review
- Issue triage:
- open a new issue from an account older than 30 days
- Scheduled sweep:
- wait for cron or trigger
OpenCode Scheduled Sweepmanually
live site / replay dataset
|
v
runner.py
|
+--> browser.py # live Selenium client
+--> replay.py # offline datasets
+--> providers/* # OpenAI, Gemini, baselines
+--> storage.py # JSON artifacts
+--> reporting.py # leaderboard + summaries
uv sync --group devexport DUEL_API_KEY=replace-me
export GEMINI_API_KEY=replace-meuv run duel benchmark \
--source replay \
--dataset examples/replay_sample.json \
--provider oracle \
--runs 2uv run duel reportmake lint
make testuv run duel benchmark --source replay --dataset examples/replay_sample.json --provider baselineLarger replay dataset for regression checks:
uv run duel benchmark --source replay --dataset examples/replay_extended.json --provider oracleuv run duel benchmark --source live --provider openai --runs 1uv run duel reportMain config lives in config.yaml.
player: profile data sent to live game formbenchmark: default provider and artifact directoryproviders: model defaults and environment variable names
If you need to route Gemini requests through an internal Siemens proxy or a
custom base URL, add base_url under the providers.gemini section of the
config. See config/siemens.example.yaml for a sample.
Override cost estimation in config.yaml:
benchmark:
cost_rates:
gemini-2.5-flash:
input_per_million: 0.4
output_per_million: 3.0- replay dataset is still small and intended for smoke tests/demo flow
- live mode depends on external site markup and timing behavior
- token and cost estimates depend on placeholder rate table and should be tuned for real billing
- Add larger labeled replay datasets captured from live sessions.
- Tune provider cost tables against real billing.
- Add richer charts/dashboard from
reports/summary.json. - Add live screenshots or short demo capture for portfolio use.