Skip to content

ligeng0197/FIKA-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

FIKA-Bench

Paper Dataset v1.0 Apptainer OpenClaw

"All men by nature desire to know." β€” Aristotle, Metaphysics

FIKA-Bench representative instances

This repository provides the official evaluation code for FIKA-Bench. The benchmark spans 4 broad domains, 17 subcategories, and 228 fine-grained targets across Product, Nature, Transport, and Culture. FIKA-Bench focuses on evidence-grounded knowledge that is not meant to be solved by memorized benchmark labels alone: retained items are manually verified, screened for visual leakage, and paired with evidence supporting the gold answer.

πŸ”₯ Updates

  • [2026-05-18] Code repository is released.
  • [2026-05-17] dataset is released.
  • [2026-05-13] paper is public on arXiv.

🎯 Overview

  • Benchmark goal. FIKA-Bench evaluates whether multimodal models can acquire fine-grained visual knowledge to recognize unseen fine-grained categories.
  • Leakage-aware construction. Public-source samples are filtered through model checks, reverse-image-search leakage checks, and human verification.
  • Evidence-grounded labels. Retained samples include evidence that supports the gold answer.

πŸ“ Repository Structure

fika-bench/
β”œβ”€β”€ configs/                    # OpenClaw/OpenCode configs
β”œβ”€β”€ data/                       # dataset placement
β”œβ”€β”€ examples/                   # one-command entry points
β”œβ”€β”€ fika_bench/                 # dataset, prompt, API, judge, and metric utils
β”œβ”€β”€ scripts/                    # evaluation, judging, and summarization scripts
β”œβ”€β”€ workspaces/openclaw/        # OpenClaw workspace and skills
└── README.md

πŸ•ΉοΈ Usage

1. Recommended Apptainer Setup

FIKA-Bench uses Apptainer as the recommended reproduction method. The .sif image is treated as a clean base environment similar to a Docker image, while Apptainer is better suited to shared servers: it usually runs without sudo, preserves the calling user's identity, and uses host bind mounts for data, caches, and working directories.

Required on the server:

apptainer
git
curl

Recommended directory layout:

/path/to/workdir/
β”œβ”€β”€ containers/
β”‚   └── vllm-openai.sif
β”œβ”€β”€ release/
β”‚   └── fika-bench-testset.zip
└── fika-bench/

With this layout, the scripts find the SIF and dataset zip automatically.

2. Clone the Repository

cd /path/to/workdir
git clone https://github.com/ligeng0197/FIKA-Bench.git
cd fika-bench

If you already have the repository:

cd /path/to/workdir/fika-bench
git pull --ff-only origin master

3. Prepare Data and Model

Request access to the gated dataset on Hugging Face:

https://huggingface.co/datasets/oking0197/FIKA-Bench/tree/main

After access is approved, download the dataset zip into the recommended location:

mkdir -p ../release
hf auth login
hf download oking0197/FIKA-Bench --repo-type dataset --local-dir ../release

After download, the expected archive path is:

../release/fika-bench-testset.zip

Build the vLLM Apptainer image from the official Docker image:

mkdir -p ../containers
bash examples/build_vllm_sif.sh

This downloads vllm/vllm-openai:latest from Docker Hub and converts it to an Apptainer SIF. Set VLLM_DOCKER_IMAGE before running the script if you want to pin a specific vLLM Docker tag or digest.

This creates:

../containers/vllm-openai.sif
../containers/vllm-openai.sif.sha256

The next vLLM and OpenClaw steps reproduce the agent setting. If you only want to evaluate the plain Qwen3-VL model without OpenClaw, skip Sections 4-7 and use the local model commands in Additional Evaluation Modes.

4. Start the vLLM Server for OpenClaw

OpenClaw calls Qwen3-VL through an OpenAI-compatible vLLM endpoint. Start this endpoint in one host shell. tmux is recommended for long runs:

cd /path/to/workdir/fika-bench
GPU=0 PORT=8010 bash examples/apptainer_vllm_qwen3vl8b_server.sh

The script serves Qwen/Qwen3-VL-8B-Instruct-FP8 at:

http://127.0.0.1:8010/v1

Check the endpoint from another host shell:

curl http://127.0.0.1:8010/v1/models

Optional multi-GPU data-parallel serving:

GPU=0,1,2 DATA_PARALLEL_SIZE=3 PORT=8010 \
  bash examples/apptainer_vllm_qwen3vl8b_server.sh

5. Run OpenClaw in Apptainer

Run this command from the host shell, not from inside an interactive container. The wrapper calls apptainer exec internally and runs OpenClaw inside the SIF.

Smoke test:

cd /path/to/workdir/fika-bench
LIMIT=1 bash examples/apptainer_openclaw_eval.sh

Full all-sample run:

LIMIT=0 OUT_DIR=results/openclaw_qwen3vl8b_8010_apptainer_full \
  bash examples/apptainer_openclaw_eval.sh

The wrapper installs portable Node.js and openclaw@2026.3.13 into:

runtime_cache/openclaw_apptainer/

OpenClaw runs are resumable. Re-running the same command with the same OUT_DIR skips sample IDs already written to results.jsonl.

6. Strict LLM-as-Judge

Set your judge endpoint:

export JUDGE_BASE_URL=https://your-judge-endpoint/v1
export JUDGE_API_KEY=your_api_key
export JUDGE_MODEL=gpt-5-mini

Run the strict judge:

python3 scripts/judge_results.py \
  --results results/openclaw_qwen3vl8b_8010_apptainer_full/results.jsonl \
  --base-url "$JUDGE_BASE_URL" \
  --api-key "$JUDGE_API_KEY" \
  --judge-model "$JUDGE_MODEL" \
  --resume

The script prints SUMMARY_JSON and writes judged outputs next to the result directory. Our local reproduction of OpenClaw + Qwen3-VL-8B-Instruct-FP8 was approximately:

39 / 311 = 12.54% strict accuracy

Because agent behavior can involve external network access and other environment-dependent factors, measured performance may vary slightly across systems and runs.

βš™οΈ Common Overrides

Variable Meaning
SIF Path to vllm-openai.sif if it is not in a searched location.
VLLM_DOCKER_IMAGE Docker image used by examples/build_vllm_sif.sh. Defaults to vllm/vllm-openai:latest.
HF_CACHE_DIR Host HuggingFace cache root. Defaults to ~/.cache/huggingface.
MODEL_PATH Container-side model path. Usually not needed.
DATA_ZIP Path to the dataset zip, for example fika-bench-testset.zip.
PICS_DEPS_DIR Optional offline dependency directory containing PicImageSearch.
GPU Visible GPU IDs for vLLM, for example 0 or 0,1,2.
PORT vLLM server port. Defaults to 8010.
LIMIT Number of samples for OpenClaw. 1 for smoke, 0 for full set.
OUT_DIR Evaluation output directory. Reuse it to resume.
OPENCLAW_RUNTIME_DIR Bind-mounted directory where Node.js/OpenClaw are installed.

πŸ§ͺ Additional Evaluation Modes

Apptainer + OpenClaw is the main reproduction path. The repository also keeps ordinary model evaluation paths for convenience.

OpenAI-Compatible API

export OPENAI_BASE_URL=https://your-endpoint/v1
export OPENAI_API_KEY=...
export JUDGE_BASE_URL=https://your-judge-endpoint/v1
export JUDGE_API_KEY=...
export JUDGE_MODEL=gpt-5-mini

python3 scripts/run_api_eval.py \
  --model gpt-5-mini \
  --judge \
  --concurrency 4 \
  --api-image-max-long-side 1600 \
  --resume

Local Transformers

This path evaluates the plain Qwen3-VL model without OpenClaw and does not require starting the vLLM server in Section 4.

python3 scripts/run_local_transformers_eval.py \
  --model-path /path/to/Qwen3-VL-8B-Instruct \
  --model Qwen3-VL-8B-Instruct \
  --max-pixels 802816 \
  --max-new-tokens 512 \
  --stop-after-first-json \
  --judge \
  --resume

OpenCode

python3 scripts/run_opencode_eval.py \
  --opencode-bin opencode \
  --config configs/opencode.example.json \
  --model vllm-local/Qwen/Qwen3-VL-8B-Instruct-FP8 \
  --timeout-sec 300 \
  --resume

πŸ“€ Output Files

Each run writes normalized and raw records for auditability. Important files:

results/<run>/results.jsonl
results/<run>/traces/
results/<run>/sessions/
results/<run>/run_meta.json
results/<run>_judged/summary.json
results/<run>_judged/raw_judge_inputs/
results/<run>_judged/raw_judge_outputs/

The strict accuracy reported in summary.json is based on:

judge_verdict == "correct"

πŸ“ Notes

  • The dataset zip is distributed through the gated Hugging Face dataset and is not committed to this code repository.
  • Long-running scripts are fail-safe at the sample level: sample errors are recorded, and the runner continues.
  • Before long API or agent runs, scripts perform lightweight preflight checks to catch invalid API keys, missing vLLM endpoints, or wrong model names.

πŸ“‘ Citation

If you find this benchmark useful, please cite:

@misc{li2026fikabenchfinegrainedrecognitionfinegrained,
      title={FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition},
      author={Geng Li and Yuxin Peng},
      year={2026},
      eprint={2605.13193},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.13193},
}

🧩 Related Projects

  • vLLM: OpenAI-compatible model serving.
  • Qwen3-VL: the reproduced VLM backbone.
  • Apptainer: container runtime for HPC environments.
  • OpenClaw and OpenCode: agentic evaluation frameworks used in this benchmark.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors