"All men by nature desire to know." β Aristotle, Metaphysics
This repository provides the official evaluation code for FIKA-Bench. The benchmark spans 4 broad domains, 17 subcategories, and 228 fine-grained targets across Product, Nature, Transport, and Culture. FIKA-Bench focuses on evidence-grounded knowledge that is not meant to be solved by memorized benchmark labels alone: retained items are manually verified, screened for visual leakage, and paired with evidence supporting the gold answer.
- [2026-05-18] Code repository is released.
- [2026-05-17] dataset is released.
- [2026-05-13] paper is public on arXiv.
- Benchmark goal. FIKA-Bench evaluates whether multimodal models can acquire fine-grained visual knowledge to recognize unseen fine-grained categories.
- Leakage-aware construction. Public-source samples are filtered through model checks, reverse-image-search leakage checks, and human verification.
- Evidence-grounded labels. Retained samples include evidence that supports the gold answer.
fika-bench/
βββ configs/ # OpenClaw/OpenCode configs
βββ data/ # dataset placement
βββ examples/ # one-command entry points
βββ fika_bench/ # dataset, prompt, API, judge, and metric utils
βββ scripts/ # evaluation, judging, and summarization scripts
βββ workspaces/openclaw/ # OpenClaw workspace and skills
βββ README.md
FIKA-Bench uses Apptainer
as the recommended reproduction method. The .sif image is treated as a clean
base environment similar to a Docker image, while Apptainer is better suited to
shared servers: it usually runs without sudo, preserves the calling user's
identity, and uses host bind mounts for data, caches, and working directories.
Required on the server:
apptainer
git
curl
Recommended directory layout:
/path/to/workdir/
βββ containers/
β βββ vllm-openai.sif
βββ release/
β βββ fika-bench-testset.zip
βββ fika-bench/
With this layout, the scripts find the SIF and dataset zip automatically.
cd /path/to/workdir
git clone https://github.com/ligeng0197/FIKA-Bench.git
cd fika-benchIf you already have the repository:
cd /path/to/workdir/fika-bench
git pull --ff-only origin masterRequest access to the gated dataset on Hugging Face:
https://huggingface.co/datasets/oking0197/FIKA-Bench/tree/main
After access is approved, download the dataset zip into the recommended location:
mkdir -p ../release
hf auth login
hf download oking0197/FIKA-Bench --repo-type dataset --local-dir ../releaseAfter download, the expected archive path is:
../release/fika-bench-testset.zip
Build the vLLM Apptainer image from the official Docker image:
mkdir -p ../containers
bash examples/build_vllm_sif.shThis downloads vllm/vllm-openai:latest from Docker Hub and converts it to an
Apptainer SIF. Set VLLM_DOCKER_IMAGE before running the script if you want to
pin a specific vLLM Docker tag or digest.
This creates:
../containers/vllm-openai.sif
../containers/vllm-openai.sif.sha256
The next vLLM and OpenClaw steps reproduce the agent setting. If you only want to evaluate the plain Qwen3-VL model without OpenClaw, skip Sections 4-7 and use the local model commands in Additional Evaluation Modes.
OpenClaw calls Qwen3-VL through an OpenAI-compatible vLLM endpoint. Start this
endpoint in one host shell. tmux is recommended for long runs:
cd /path/to/workdir/fika-bench
GPU=0 PORT=8010 bash examples/apptainer_vllm_qwen3vl8b_server.shThe script serves Qwen/Qwen3-VL-8B-Instruct-FP8 at:
http://127.0.0.1:8010/v1
Check the endpoint from another host shell:
curl http://127.0.0.1:8010/v1/modelsOptional multi-GPU data-parallel serving:
GPU=0,1,2 DATA_PARALLEL_SIZE=3 PORT=8010 \
bash examples/apptainer_vllm_qwen3vl8b_server.shRun this command from the host shell, not from inside an interactive container.
The wrapper calls apptainer exec internally and runs OpenClaw inside the SIF.
Smoke test:
cd /path/to/workdir/fika-bench
LIMIT=1 bash examples/apptainer_openclaw_eval.shFull all-sample run:
LIMIT=0 OUT_DIR=results/openclaw_qwen3vl8b_8010_apptainer_full \
bash examples/apptainer_openclaw_eval.shThe wrapper installs portable Node.js and openclaw@2026.3.13 into:
runtime_cache/openclaw_apptainer/
OpenClaw runs are resumable. Re-running the same command with the same
OUT_DIR skips sample IDs already written to results.jsonl.
Set your judge endpoint:
export JUDGE_BASE_URL=https://your-judge-endpoint/v1
export JUDGE_API_KEY=your_api_key
export JUDGE_MODEL=gpt-5-miniRun the strict judge:
python3 scripts/judge_results.py \
--results results/openclaw_qwen3vl8b_8010_apptainer_full/results.jsonl \
--base-url "$JUDGE_BASE_URL" \
--api-key "$JUDGE_API_KEY" \
--judge-model "$JUDGE_MODEL" \
--resumeThe script prints SUMMARY_JSON and writes judged outputs next to the result
directory. Our local reproduction of OpenClaw + Qwen3-VL-8B-Instruct-FP8
was approximately:
39 / 311 = 12.54% strict accuracy
Because agent behavior can involve external network access and other environment-dependent factors, measured performance may vary slightly across systems and runs.
| Variable | Meaning |
|---|---|
SIF |
Path to vllm-openai.sif if it is not in a searched location. |
VLLM_DOCKER_IMAGE |
Docker image used by examples/build_vllm_sif.sh. Defaults to vllm/vllm-openai:latest. |
HF_CACHE_DIR |
Host HuggingFace cache root. Defaults to ~/.cache/huggingface. |
MODEL_PATH |
Container-side model path. Usually not needed. |
DATA_ZIP |
Path to the dataset zip, for example fika-bench-testset.zip. |
PICS_DEPS_DIR |
Optional offline dependency directory containing PicImageSearch. |
GPU |
Visible GPU IDs for vLLM, for example 0 or 0,1,2. |
PORT |
vLLM server port. Defaults to 8010. |
LIMIT |
Number of samples for OpenClaw. 1 for smoke, 0 for full set. |
OUT_DIR |
Evaluation output directory. Reuse it to resume. |
OPENCLAW_RUNTIME_DIR |
Bind-mounted directory where Node.js/OpenClaw are installed. |
Apptainer + OpenClaw is the main reproduction path. The repository also keeps ordinary model evaluation paths for convenience.
export OPENAI_BASE_URL=https://your-endpoint/v1
export OPENAI_API_KEY=...
export JUDGE_BASE_URL=https://your-judge-endpoint/v1
export JUDGE_API_KEY=...
export JUDGE_MODEL=gpt-5-mini
python3 scripts/run_api_eval.py \
--model gpt-5-mini \
--judge \
--concurrency 4 \
--api-image-max-long-side 1600 \
--resumeThis path evaluates the plain Qwen3-VL model without OpenClaw and does not require starting the vLLM server in Section 4.
python3 scripts/run_local_transformers_eval.py \
--model-path /path/to/Qwen3-VL-8B-Instruct \
--model Qwen3-VL-8B-Instruct \
--max-pixels 802816 \
--max-new-tokens 512 \
--stop-after-first-json \
--judge \
--resumepython3 scripts/run_opencode_eval.py \
--opencode-bin opencode \
--config configs/opencode.example.json \
--model vllm-local/Qwen/Qwen3-VL-8B-Instruct-FP8 \
--timeout-sec 300 \
--resumeEach run writes normalized and raw records for auditability. Important files:
results/<run>/results.jsonl
results/<run>/traces/
results/<run>/sessions/
results/<run>/run_meta.json
results/<run>_judged/summary.json
results/<run>_judged/raw_judge_inputs/
results/<run>_judged/raw_judge_outputs/
The strict accuracy reported in summary.json is based on:
judge_verdict == "correct"
- The dataset zip is distributed through the gated Hugging Face dataset and is not committed to this code repository.
- Long-running scripts are fail-safe at the sample level: sample errors are recorded, and the runner continues.
- Before long API or agent runs, scripts perform lightweight preflight checks to catch invalid API keys, missing vLLM endpoints, or wrong model names.
If you find this benchmark useful, please cite:
@misc{li2026fikabenchfinegrainedrecognitionfinegrained,
title={FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition},
author={Geng Li and Yuxin Peng},
year={2026},
eprint={2605.13193},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.13193},
}