FIKA-Bench

"All men by nature desire to know." — Aristotle, Metaphysics

This repository provides the official evaluation code for FIKA-Bench. The benchmark spans 4 broad domains, 17 subcategories, and 228 fine-grained targets across Product, Nature, Transport, and Culture. FIKA-Bench focuses on evidence-grounded knowledge that is not meant to be solved by memorized benchmark labels alone: retained items are manually verified, screened for visual leakage, and paired with evidence supporting the gold answer.

🔥 Updates

[2026-05-18] Code repository is released.
[2026-05-17] dataset is released.
[2026-05-13] paper is public on arXiv.

🎯 Overview

Benchmark goal. FIKA-Bench evaluates whether multimodal models can acquire fine-grained visual knowledge to recognize unseen fine-grained categories.
Leakage-aware construction. Public-source samples are filtered through model checks, reverse-image-search leakage checks, and human verification.
Evidence-grounded labels. Retained samples include evidence that supports the gold answer.

📁 Repository Structure

fika-bench/
├── configs/                    # OpenClaw/OpenCode configs
├── data/                       # dataset placement
├── examples/                   # one-command entry points
├── fika_bench/                 # dataset, prompt, API, judge, and metric utils
├── scripts/                    # evaluation, judging, and summarization scripts
├── workspaces/openclaw/        # OpenClaw workspace and skills
└── README.md

🕹️ Usage

1. Recommended Apptainer Setup

FIKA-Bench uses Apptainer as the recommended reproduction method. The .sif image is treated as a clean base environment similar to a Docker image, while Apptainer is better suited to shared servers: it usually runs without sudo, preserves the calling user's identity, and uses host bind mounts for data, caches, and working directories.

Required on the server:

apptainer
git
curl

Recommended directory layout:

/path/to/workdir/
├── containers/
│   └── vllm-openai.sif
├── release/
│   └── fika-bench-testset.zip
└── fika-bench/

With this layout, the scripts find the SIF and dataset zip automatically.

2. Clone the Repository

cd /path/to/workdir
git clone https://github.com/ligeng0197/FIKA-Bench.git
cd fika-bench

If you already have the repository:

cd /path/to/workdir/fika-bench
git pull --ff-only origin master

3. Prepare Data and Model

Request access to the gated dataset on Hugging Face:

https://huggingface.co/datasets/oking0197/FIKA-Bench/tree/main

After access is approved, download the dataset zip into the recommended location:

mkdir -p ../release
hf auth login
hf download oking0197/FIKA-Bench --repo-type dataset --local-dir ../release

After download, the expected archive path is:

../release/fika-bench-testset.zip

Build the vLLM Apptainer image from the official Docker image:

mkdir -p ../containers
bash examples/build_vllm_sif.sh

This downloads vllm/vllm-openai:latest from Docker Hub and converts it to an Apptainer SIF. Set VLLM_DOCKER_IMAGE before running the script if you want to pin a specific vLLM Docker tag or digest.

This creates:

../containers/vllm-openai.sif
../containers/vllm-openai.sif.sha256

The next vLLM and OpenClaw steps reproduce the agent setting. If you only want to evaluate the plain Qwen3-VL model without OpenClaw, skip Sections 4-7 and use the local model commands in Additional Evaluation Modes.

4. Start the vLLM Server for OpenClaw

OpenClaw calls Qwen3-VL through an OpenAI-compatible vLLM endpoint. Start this endpoint in one host shell. tmux is recommended for long runs:

cd /path/to/workdir/fika-bench
GPU=0 PORT=8010 bash examples/apptainer_vllm_qwen3vl8b_server.sh

The script serves Qwen/Qwen3-VL-8B-Instruct-FP8 at:

http://127.0.0.1:8010/v1

Check the endpoint from another host shell:

curl http://127.0.0.1:8010/v1/models

Optional multi-GPU data-parallel serving:

GPU=0,1,2 DATA_PARALLEL_SIZE=3 PORT=8010 \
  bash examples/apptainer_vllm_qwen3vl8b_server.sh

5. Run OpenClaw in Apptainer

Run this command from the host shell, not from inside an interactive container. The wrapper calls apptainer exec internally and runs OpenClaw inside the SIF.

Smoke test:

cd /path/to/workdir/fika-bench
LIMIT=1 bash examples/apptainer_openclaw_eval.sh

Full all-sample run:

LIMIT=0 OUT_DIR=results/openclaw_qwen3vl8b_8010_apptainer_full \
  bash examples/apptainer_openclaw_eval.sh

The wrapper installs portable Node.js and openclaw@2026.3.13 into:

runtime_cache/openclaw_apptainer/

OpenClaw runs are resumable. Re-running the same command with the same OUT_DIR skips sample IDs already written to results.jsonl.

6. Strict LLM-as-Judge

Set your judge endpoint:

export JUDGE_BASE_URL=https://your-judge-endpoint/v1
export JUDGE_API_KEY=your_api_key
export JUDGE_MODEL=gpt-5-mini

Run the strict judge:

python3 scripts/judge_results.py \
  --results results/openclaw_qwen3vl8b_8010_apptainer_full/results.jsonl \
  --base-url "$JUDGE_BASE_URL" \
  --api-key "$JUDGE_API_KEY" \
  --judge-model "$JUDGE_MODEL" \
  --resume

The script prints SUMMARY_JSON and writes judged outputs next to the result directory. Our local reproduction of OpenClaw + Qwen3-VL-8B-Instruct-FP8 was approximately:

39 / 311 = 12.54% strict accuracy

Because agent behavior can involve external network access and other environment-dependent factors, measured performance may vary slightly across systems and runs.

⚙️ Common Overrides

Variable	Meaning
`SIF`	Path to `vllm-openai.sif` if it is not in a searched location.
`VLLM_DOCKER_IMAGE`	Docker image used by `examples/build_vllm_sif.sh`. Defaults to `vllm/vllm-openai:latest`.
`HF_CACHE_DIR`	Host HuggingFace cache root. Defaults to `~/.cache/huggingface`.
`MODEL_PATH`	Container-side model path. Usually not needed.
`DATA_ZIP`	Path to the dataset zip, for example `fika-bench-testset.zip`.
`PICS_DEPS_DIR`	Optional offline dependency directory containing `PicImageSearch`.
`GPU`	Visible GPU IDs for vLLM, for example `0` or `0,1,2`.
`PORT`	vLLM server port. Defaults to `8010`.
`LIMIT`	Number of samples for OpenClaw. `1` for smoke, `0` for full set.
`OUT_DIR`	Evaluation output directory. Reuse it to resume.
`OPENCLAW_RUNTIME_DIR`	Bind-mounted directory where Node.js/OpenClaw are installed.

🧪 Additional Evaluation Modes

Apptainer + OpenClaw is the main reproduction path. The repository also keeps ordinary model evaluation paths for convenience.

OpenAI-Compatible API

export OPENAI_BASE_URL=https://your-endpoint/v1
export OPENAI_API_KEY=...
export JUDGE_BASE_URL=https://your-judge-endpoint/v1
export JUDGE_API_KEY=...
export JUDGE_MODEL=gpt-5-mini

python3 scripts/run_api_eval.py \
  --model gpt-5-mini \
  --judge \
  --concurrency 4 \
  --api-image-max-long-side 1600 \
  --resume

Local Transformers

This path evaluates the plain Qwen3-VL model without OpenClaw and does not require starting the vLLM server in Section 4.

python3 scripts/run_local_transformers_eval.py \
  --model-path /path/to/Qwen3-VL-8B-Instruct \
  --model Qwen3-VL-8B-Instruct \
  --max-pixels 802816 \
  --max-new-tokens 512 \
  --stop-after-first-json \
  --judge \
  --resume

OpenCode

python3 scripts/run_opencode_eval.py \
  --opencode-bin opencode \
  --config configs/opencode.example.json \
  --model vllm-local/Qwen/Qwen3-VL-8B-Instruct-FP8 \
  --timeout-sec 300 \
  --resume

📤 Output Files

Each run writes normalized and raw records for auditability. Important files:

results/<run>/results.jsonl
results/<run>/traces/
results/<run>/sessions/
results/<run>/run_meta.json
results/<run>_judged/summary.json
results/<run>_judged/raw_judge_inputs/
results/<run>_judged/raw_judge_outputs/

The strict accuracy reported in summary.json is based on:

judge_verdict == "correct"

📝 Notes

The dataset zip is distributed through the gated Hugging Face dataset and is not committed to this code repository.
Long-running scripts are fail-safe at the sample level: sample errors are recorded, and the runner continues.
Before long API or agent runs, scripts perform lightweight preflight checks to catch invalid API keys, missing vLLM endpoints, or wrong model names.

📑 Citation

If you find this benchmark useful, please cite:

@misc{li2026fikabenchfinegrainedrecognitionfinegrained,
      title={FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition},
      author={Geng Li and Yuxin Peng},
      year={2026},
      eprint={2605.13193},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.13193},
}

🧩 Related Projects

vLLM: OpenAI-compatible model serving.
Qwen3-VL: the reproduced VLM backbone.
Apptainer: container runtime for HPC environments.
OpenClaw and OpenCode: agentic evaluation frameworks used in this benchmark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FIKA-Bench

🔥 Updates

🎯 Overview

📁 Repository Structure

🕹️ Usage

1. Recommended Apptainer Setup

2. Clone the Repository

3. Prepare Data and Model

4. Start the vLLM Server for OpenClaw

5. Run OpenClaw in Apptainer

6. Strict LLM-as-Judge

⚙️ Common Overrides

🧪 Additional Evaluation Modes

OpenAI-Compatible API

Local Transformers

OpenCode

📤 Output Files

📝 Notes

📑 Citation

🧩 Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
data		data
examples		examples
figs		figs
fika_bench		fika_bench
results		results
scripts		scripts
workspaces		workspaces
.gitattributes		.gitattributes
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

FIKA-Bench

🔥 Updates

🎯 Overview

📁 Repository Structure

🕹️ Usage

1. Recommended Apptainer Setup

2. Clone the Repository

3. Prepare Data and Model

4. Start the vLLM Server for OpenClaw

5. Run OpenClaw in Apptainer

6. Strict LLM-as-Judge

⚙️ Common Overrides

🧪 Additional Evaluation Modes

OpenAI-Compatible API

Local Transformers

OpenCode

📤 Output Files

📝 Notes

📑 Citation

🧩 Related Projects

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages