EmbodiedEvalKit

A unified evaluation framework that simplifies embodied AI benchmarking with clean interfaces, supporting 25+ benchmarks and diverse model backends.

Features

🚀 Multiple Inference Backends: vLLM (tensor parallel & multi-GPU), HuggingFace Transformers, and API — switch backends in one config.
🤖 20+ Models Supported: GPT, Gemini, Qwen, InternVL, Molmo, Magma and more — generalist and embodied-specialist families out of the box.
📊 20+ Embodied Benchmarks: Embodied QA, Spatial Reasoning, Embodied Pointing, Affordance and Location, and Embodied Planning — all in one place.
🗂️ Standardized Dataset Format: All benchmarks reorganized into HuggingFace Parquet — unified data pipeline for reproducible evaluation.
🎯 Unified Pointing Evaluation: One interface for diverse pointing formats, coordinate systems, and model-specific conventions.
🧩 Modular & Decoupled Design: Models, benchmarks, and metrics cleanly separated — easy to extend, swap, or customize independently.

Supported Benchmarks

Benchmark	HuggingFace Link	Paper Title
ERQA	FlagEval/ERQA	ERQA: A Benchmark for Embodied Referential Question Answering
CV-Bench	nyu-visionx/CV-Bench	Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
EmbSpatial	FlagEval/EmbSpatial-Bench	EmbSpatial: A Benchmark for Spatial Understanding in Embodied AI
SAT	FlagEval/SAT	SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models
RoboSpatial	chanhee-luke/RoboSpatial-Home	RoboSpatial: Teaching Spatial Understanding to 2D/3D VLMs
RoboVQA	IffYuan/RoboVQA	RoboVQA: Multimodal Long-Horizon Reasoning for Robotics
VABench-Point	IffYuan/VABench-P	From Seeing to Doing: Bridging Reasoning and Decision for Embodied AI
Where2Place	FlagEval/Where2Place	RoboPoint: A VLM for Spatial Affordance Prediction
RefSpatial-Bench	FlagEval/RefSpatial-Bench	RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
Part-Afford-2K	IffYuan/Part-Affordance-2K	Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
RoboRefit	IffYuan/RoboRefit	VL-Grasp: a 6-Dof Interactive Grasp Policy for Language-Oriented Objects in Cluttered Indoor Scenes
RoboAfford-Eval	Zray26/roboafford-eval	RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation
VSI-Bench	IffYuan/vsi-bench	Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
OpenEQA	IffYuan/open-eqa	OpenEQA: Embodied Question Answering in the Real World
EgoPlan2	IffYuan/ego-plan	EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
PIOBench	IffYuan/PIO-Bench	Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding
PointBench	IffYuan/PointBench	PointArena: Probing Multimodal Grounding Through Language-Guided Pointing
Pixmo Points	IffYuan/pixmo-points-eval	Molmo and PixMo: Open Weights and Open Data for State-of-the-Art VLMs
BLINK	BLINK-Benchmark/BLINK	BLINK: Multimodal Large Language Models Can See but Not Perceive
COSMOS	IffYuan/COSMOS	Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
ShareRobot Trajectory	IffYuan/sharerobot_trajectory	RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
VABench-Visual-Trace	IffYuan/vabench-v	From Seeing to Doing: Bridging Reasoning and Decision for Embodied AI
RoboFAC	MINT-SJTU/RoboFAC-dataset	RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction
VLABench	VLABench/vlm_evaluation_v1.0	VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
PIO-S3-Verified	IffYuan/PIO-S3	Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding

Note: Several benchmarks have been reformatted into HuggingFace Parquet for unified processing. All dataset copyrights and licenses belong to the original authors and publications. Please cite the original papers when using these datasets.

Supported Models

Model List

Model	Inference Backend	Backbone
GPT series	API	gpt
Gemini series	API	gemini-2.5
Gemini Robotics	API	gemini_robotics
Qwen2.5VL series	vLLM	qwen2_5
Qwen3VL series	vLLM	qwen3
InternVL series	HuggingFace	internvl
Molmo series	HuggingFace	molmo
RoboBrain2.0-7B	vLLM	qwen2_5
VeBrain	vLLM	qwen2_5
Mimo-Embodied	HuggingFace	mimo
Pelican-VL 1.0	vLLM	pelican
Magma-8B	HuggingFace	magma
Embodied-R1	vLLM	qwen2_5
Embodied-R1.5	vLLM	qwen3

Note: The framework supports various model sizes (7B, 8B, 72B, etc.) within each model family via the backbone parameter.

Backend Description

vLLM: High-performance inference engine with tensor parallelism and batch inference, suitable for locally deployed large models
API: Cloud-based model services via API calls (OpenAI, Google, etc.)
HuggingFace: Transformers-based inference supporting a wide range of open-source models

What is Backbone?

Backbone is a key architectural abstraction in OmniScope that identifies the model family and handles model-specific conventions. It serves as a unified interface layer that enables diverse models to work seamlessly within the evaluation framework.

Key Roles of Backbone:

Inference Backend Selection: Automatically routes to the appropriate inference engine (vLLM, API, or HuggingFace) based on model type
Prompt Format Adaptation: Generates model-specific instruction formats and prompt templates for each benchmark
Coordinate System Handling: Manages different coordinate representations across models:
- qwen3, gemini-2.5: 0-1000 normalized coordinates
- qwen2_5, mimo: Absolute pixel coordinates
- molmo: 0-100 coordinates in XML format
- gpt, pelican, internvl, magma: 0-1 normalized coordinates

Backbone	Format	Example
qwen3, gemini-2.5	0-1000 normalized	`[500, 500]`
qwen2_5, mimo	Absolute pixels	`[256, 256]`
molmo	0-100 in XML	`<point x="50.0" y="50.0">`
gpt, pelican, internvl, magma	0-1 normalized	`(0.5, 0.5)`

Output Parsing: Converts model-specific output formats into standardized evaluation results

This abstraction allows OmniScope to support models with vastly different architectures, APIs, and conventions through a single unified evaluation pipeline.

Quick Start

1. Installation (vLLM + Qwen3-VL as an Example)

# Install all dependencies
pip install -r requirements.txt

2. Setup API Keys

Create key files in the project root (one-time setup):

echo "your_hf_token" > .hf_token
echo "your_api_key" > .api_key

These are auto-read by the evaluation scripts, no need to pass them as arguments.

Evaluation

Eval Embodied-R1.5 via VLLM

If you have 2 gpus, using:

bash scripts/eval_embodied_r1.5.sh

If you have 8 gpus, using:

bash scripts/eval_er1.5_parallel.sh

Eval single benchmark using VLLM

# Embodied-R1.5
python eval_erqa.py \
  --model_name Embodied-R1.5 \
  --model_path /path/to/Embodied-R1.5 \
  --backbone qwen3 \
  --max_model_len 20000 \
  --gpu_memory_utilization 0.8 \
  --tensor_parallel_size 2

# Qwen3-VL-8B-Instruct
python eval_erqa.py \
  --model_name Qwen3-VL-8B-Instruct \
  --model_path Qwen/Qwen3-VL-8B-Instruct \
  --backbone qwen3 \
  --max_model_len 20000 \
  --gpu_memory_utilization 0.8 \
  --tensor_parallel_size 2

Using HF Backbone

#InternVL3_5-8B
python eval_erqa.py \
  --model_name InternVL3_5-8B \
  --model_path OpenGVLab/InternVL3_5-8B \
  --backbone internvl \
  --max_model_len 20000

Using API Backbone

#gpt-5-mini
API_KEY=$(cat ./.api_key | tr -d '\n\r ')
export API_KEY=$API_KEY
BASE_URL=https://api.gpt.ge/v1

python eval_erqa.py \
  --model_name gpt-5-mini \
  --model_path gpt-5-mini \
  --backbone gpt \
  --max_model_len 20000 \
  --base_url ${BASE_URL}

See more examples in scripts!

Key Parameters

General Parameters

--model_name: Model name for logging and result tracking (e.g., "Qwen3-VL-8B-Instruct")
--model_path: Path to local model checkpoint or API model identifier (e.g., "/path/to/model" or "gpt-4o")
--backbone: Model backbone type that determines coordinate format and prompt style. Options: qwen3, qwen2_5, gpt, gemini-2.5, gemini_robotics, molmo, mimo, magma, internvl, pelican. Auto-detects from model_name if not specified.
--instruct_following: Custom instruction prompt to override default format instructions. Use this to test different prompting strategies.
--thinking_model: Enable thinking model mode for models that output reasoning in <think> tags and answers in <answer> tags (e.g., o1, DeepSeek-R1)
--debug: Debug mode that processes only the first 20 samples for quick testing

vLLM Backend Parameters

--tensor_parallel_size: Number of GPUs for tensor parallelism (default: 2). Use this to distribute model across multiple GPUs for larger models.
--max_model_len: Maximum context length in tokens (default: 10240). Adjust based on your GPU memory and benchmarkkrequirements.
--gpu_memory_utilization: GPU memory utilization ratio from 0.0 to 1.0 (default: 0.8). Lower values leave more memory for other processes.
--max_images_per_prompt: Maximum number of images per prompt (default: 16). Increase for multi-image benchmarks.
--max_videos_per_prompt: Maximum number of videos per prompt (default: 1). Increase for video-based benchmarks.
--seed: Random seed for reproducibility (default: 3407)

API Backend Parameters

--max_concurrent_requests: Maximum number of concurrent API requests (default: 100). Adjust based on API rate limits.
--base_url: Custom API base URL for self-hosted or alternative API endpoints. Leave empty to use default endpoints.
--max_tokens: Maximum tokens to generate (default: 4096)
--temperature: Sampling temperature from 0.0 to 2.0 (default: 0.7). Lower values make output more deterministic.
--top_p: Nucleus sampling parameter from 0.0 to 1.0 (default: 0.8)
--top_k: Top-k sampling parameter (default: 20)
--repetition_penalty: Penalty for repeating tokens (default: 1.05). Values > 1.0 discourage repetition.
--presence_penalty: Penalty for token presence (default: 0.0). Used by some API providers like OpenAI.

HuggingFace Backend Parameters

--device: Device type for inference. Options: cuda (GPU), cpu. Default is cuda if available.
--dtype: Data type for model weights and computation. Options: bfloat16 (recommended for modern GPUs), float16, float32. Default: bfloat16.
--batch_size: Batch size for inference (default: 1). Increase for faster processing if GPU memory allows.

Adding New Benchmarks

Adding a new benchmark involves three main steps: creating a dataset class, implementing an evaluation script, and registering the benchmark. The framework provides a unified interface that handles different model backends automatically.

1. Create Dataset Class

Create a new file in benchmark/ (e.g., benchmark/my_benchmark.py) that inherits from BaseDataset, you can copy any benchmarks in benchmark/ folder.

Key Methods to Implement:

Refer to benchmark/base.py for the base class interface and benchmark/vabench_point.py for a complete example.

Required methods:

get_default_instruct(): Return backbone-specific instruction format
load_dataset(): Load dataset from HuggingFace
prepare_dataset(): Convert to standard format with 'question', 'image', 'metadata'
process_raw_output(): Parse model output and compute metrics
evaluate_results(): Evaluate all results
compute_statistics(): Aggregate metrics
save_results(): Save to logs/

2. Create Evaluation Script

Copy an existing evaluation script (e.g., eval_partafford.py) and modify:

from benchmark.my_benchmark import MyBenchmarkDataset
from core.inference import create_inference_engine
from core.logger import setup_logging

def main(args, task_name, model_name, model_path, instruct_following, dataset_name, subset, split):
    # Initialize dataset
    dataset = MyBenchmarkDataset(
        dataset_name=dataset_name,
        subset=subset,
        split=split,
        instruct_following=instruct_following,
        task_name=task_name,
        model_name=model_name,
        backbone=args.backbone,
        debug=args.debug
    )

    # Load and prepare
    raw_dataset = dataset.load_dataset()
    prepared_dataset = dataset.prepare_dataset(raw_dataset)

    # Run inference
    inference_engine, engine = create_inference_engine(args, model_path, model_name)
    raw_outputs = inference_engine.batch_inference(prepared_dataset, engine)

    # Evaluate and save
    results = dataset.evaluate_results(prepared_dataset, raw_outputs)
    statistics = dataset.compute_statistics(results)
    dataset.save_results(results, statistics)

Refer to eval_partafford.py for the complete template with argument parsing.

Contributing

We welcome contributions to EmbodiedEvalKit! Here's how you can help:

See CONTRIBUTING.md for commit conventions.

Pull Requests

Fork the repository and create your branch from main
Add your benchmark implementation following the structure in Adding New Benchmarks
Ensure your code follows the existing style and passes all tests
Update documentation if you're adding new features
Submit a pull request with a clear description of your changes

For bug fixes, feature requests, or questions, please open an issue on our GitHub repository.

Citation

If you find this work useful, please consider citing:

@article{yuan2026embodiedr15,
  title={Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models},
  author={Yuan, Yifu and Huang, Yaoting and Zhang, Shuoheng and Yao, Xianze and Li, Pengyi and Han, Linqi and Liu, Yuhao and Li, Yutong and Liao, Ruihao and Wu, Qiyu and Li, Yuxiao and Zhang, Zhao and Sun, Jiangeng and Jia, Wenting and Li, Chen and Dong, Zibin and Ni, Fei and Zheng, Yan and Gu, Shuyang and Ma, Yi and Tang, Hongyao and Hu, Han and Hao, Jianye},
  journal={arXiv preprint},
  year={2026}
}

@article{yuan2025embodied,
  title={Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation},
  author={Yuan, Yifu and Cui, Haiqin and Huang, Yaoting and Chen, Yibin and Ni, Fei and Dong, Zibin and Li, Pengyi and Zheng, Yan and Hao, Jianye},
  journal={ICLR 2026},
  year={2025}
}

@article{yuan2025seeing,
  title={From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation},
  author={Yuan, Yifu and Cui, Haiqin and Chen, Yibin and Dong, Zibin and Ni, Fei and Kou, Longxin and Liu, Jinyi and Li, Pengyi and Zheng, Yan and Hao, Jianye},
  journal={ICLR 2026},
  year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
benchmark		benchmark
core		core
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
eval_blink.py		eval_blink.py
eval_cosmos.py		eval_cosmos.py
eval_cvbench.py		eval_cvbench.py
eval_egoplan2.py		eval_egoplan2.py
eval_embspatial.py		eval_embspatial.py
eval_erqa.py		eval_erqa.py
eval_openeqa.py		eval_openeqa.py
eval_partafford.py		eval_partafford.py
eval_pio.py		eval_pio.py
eval_pio_s3_verified.py		eval_pio_s3_verified.py
eval_pixmo_points.py		eval_pixmo_points.py
eval_pointbench.py		eval_pointbench.py
eval_refspatial.py		eval_refspatial.py
eval_roboafford.py		eval_roboafford.py
eval_robofac.py		eval_robofac.py
eval_roborefit.py		eval_roborefit.py
eval_robospatial.py		eval_robospatial.py
eval_robovqa.py		eval_robovqa.py
eval_sat.py		eval_sat.py
eval_sharerobot_v.py		eval_sharerobot_v.py
eval_vabench_point.py		eval_vabench_point.py
eval_vabench_v.py		eval_vabench_v.py
eval_vlabench.py		eval_vlabench.py
eval_vsi_bench.py		eval_vsi_bench.py
eval_where2place.py		eval_where2place.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EmbodiedEvalKit

Features

Supported Benchmarks

Supported Models

Model List

Backend Description

What is Backbone?

Quick Start

1. Installation (vLLM + Qwen3-VL as an Example)

2. Setup API Keys

Evaluation

Eval Embodied-R1.5 via VLLM

Eval single benchmark using VLLM

Using HF Backbone

Using API Backbone

Key Parameters

General Parameters

vLLM Backend Parameters

API Backend Parameters

HuggingFace Backend Parameters

1. Create Dataset Class

2. Create Evaluation Script

Contributing

Pull Requests

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EmbodiedEvalKit

Features

Supported Benchmarks

Supported Models

Model List

Backend Description

What is Backbone?

Quick Start

1. Installation (vLLM + Qwen3-VL as an Example)

2. Setup API Keys

Evaluation

Eval Embodied-R1.5 via VLLM

Eval single benchmark using VLLM

Using HF Backbone

Using API Backbone

Key Parameters

General Parameters

vLLM Backend Parameters

API Backend Parameters

HuggingFace Backend Parameters

1. Create Dataset Class

2. Create Evaluation Script

Contributing

Pull Requests

Citation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages