A unified evaluation framework that simplifies embodied AI benchmarking with clean interfaces, supporting 25+ benchmarks and diverse model backends.
- 🚀 Multiple Inference Backends: vLLM (tensor parallel & multi-GPU), HuggingFace Transformers, and API — switch backends in one config.
- 🤖 20+ Models Supported: GPT, Gemini, Qwen, InternVL, Molmo, Magma and more — generalist and embodied-specialist families out of the box.
- 📊 20+ Embodied Benchmarks: Embodied QA, Spatial Reasoning, Embodied Pointing, Affordance and Location, and Embodied Planning — all in one place.
- 🗂️ Standardized Dataset Format: All benchmarks reorganized into HuggingFace Parquet — unified data pipeline for reproducible evaluation.
- 🎯 Unified Pointing Evaluation: One interface for diverse pointing formats, coordinate systems, and model-specific conventions.
- 🧩 Modular & Decoupled Design: Models, benchmarks, and metrics cleanly separated — easy to extend, swap, or customize independently.
Note: Several benchmarks have been reformatted into HuggingFace Parquet for unified processing. All dataset copyrights and licenses belong to the original authors and publications. Please cite the original papers when using these datasets.
| Model | Inference Backend | Backbone |
|---|---|---|
| GPT series | API | gpt |
| Gemini series | API | gemini-2.5 |
| Gemini Robotics | API | gemini_robotics |
| Qwen2.5VL series | vLLM | qwen2_5 |
| Qwen3VL series | vLLM | qwen3 |
| InternVL series | HuggingFace | internvl |
| Molmo series | HuggingFace | molmo |
| RoboBrain2.0-7B | vLLM | qwen2_5 |
| VeBrain | vLLM | qwen2_5 |
| Mimo-Embodied | HuggingFace | mimo |
| Pelican-VL 1.0 | vLLM | pelican |
| Magma-8B | HuggingFace | magma |
| Embodied-R1 | vLLM | qwen2_5 |
| Embodied-R1.5 | vLLM | qwen3 |
Note: The framework supports various model sizes (7B, 8B, 72B, etc.) within each model family via the backbone parameter.
- vLLM: High-performance inference engine with tensor parallelism and batch inference, suitable for locally deployed large models
- API: Cloud-based model services via API calls (OpenAI, Google, etc.)
- HuggingFace: Transformers-based inference supporting a wide range of open-source models
Backbone is a key architectural abstraction in OmniScope that identifies the model family and handles model-specific conventions. It serves as a unified interface layer that enables diverse models to work seamlessly within the evaluation framework.
Key Roles of Backbone:
- Inference Backend Selection: Automatically routes to the appropriate inference engine (vLLM, API, or HuggingFace) based on model type
- Prompt Format Adaptation: Generates model-specific instruction formats and prompt templates for each benchmark
- Coordinate System Handling: Manages different coordinate representations across models:
qwen3,gemini-2.5: 0-1000 normalized coordinatesqwen2_5,mimo: Absolute pixel coordinatesmolmo: 0-100 coordinates in XML formatgpt,pelican,internvl,magma: 0-1 normalized coordinates
| Backbone | Format | Example |
|---|---|---|
| qwen3, gemini-2.5 | 0-1000 normalized | [500, 500] |
| qwen2_5, mimo | Absolute pixels | [256, 256] |
| molmo | 0-100 in XML | <point x="50.0" y="50.0"> |
| gpt, pelican, internvl, magma | 0-1 normalized | (0.5, 0.5) |
- Output Parsing: Converts model-specific output formats into standardized evaluation results
This abstraction allows OmniScope to support models with vastly different architectures, APIs, and conventions through a single unified evaluation pipeline.
# Install all dependencies
pip install -r requirements.txtCreate key files in the project root (one-time setup):
echo "your_hf_token" > .hf_token
echo "your_api_key" > .api_keyThese are auto-read by the evaluation scripts, no need to pass them as arguments.
If you have 2 gpus, using:
bash scripts/eval_embodied_r1.5.shIf you have 8 gpus, using:
bash scripts/eval_er1.5_parallel.sh# Embodied-R1.5
python eval_erqa.py \
--model_name Embodied-R1.5 \
--model_path /path/to/Embodied-R1.5 \
--backbone qwen3 \
--max_model_len 20000 \
--gpu_memory_utilization 0.8 \
--tensor_parallel_size 2
# Qwen3-VL-8B-Instruct
python eval_erqa.py \
--model_name Qwen3-VL-8B-Instruct \
--model_path Qwen/Qwen3-VL-8B-Instruct \
--backbone qwen3 \
--max_model_len 20000 \
--gpu_memory_utilization 0.8 \
--tensor_parallel_size 2#InternVL3_5-8B
python eval_erqa.py \
--model_name InternVL3_5-8B \
--model_path OpenGVLab/InternVL3_5-8B \
--backbone internvl \
--max_model_len 20000 #gpt-5-mini
API_KEY=$(cat ./.api_key | tr -d '\n\r ')
export API_KEY=$API_KEY
BASE_URL=https://api.gpt.ge/v1
python eval_erqa.py \
--model_name gpt-5-mini \
--model_path gpt-5-mini \
--backbone gpt \
--max_model_len 20000 \
--base_url ${BASE_URL}See more examples in scripts!
--model_name: Model name for logging and result tracking (e.g., "Qwen3-VL-8B-Instruct")--model_path: Path to local model checkpoint or API model identifier (e.g., "/path/to/model" or "gpt-4o")--backbone: Model backbone type that determines coordinate format and prompt style. Options:qwen3,qwen2_5,gpt,gemini-2.5,gemini_robotics,molmo,mimo,magma,internvl,pelican. Auto-detects from model_name if not specified.--instruct_following: Custom instruction prompt to override default format instructions. Use this to test different prompting strategies.--thinking_model: Enable thinking model mode for models that output reasoning in<think>tags and answers in<answer>tags (e.g., o1, DeepSeek-R1)--debug: Debug mode that processes only the first 20 samples for quick testing
--tensor_parallel_size: Number of GPUs for tensor parallelism (default: 2). Use this to distribute model across multiple GPUs for larger models.--max_model_len: Maximum context length in tokens (default: 10240). Adjust based on your GPU memory and benchmarkkrequirements.--gpu_memory_utilization: GPU memory utilization ratio from 0.0 to 1.0 (default: 0.8). Lower values leave more memory for other processes.--max_images_per_prompt: Maximum number of images per prompt (default: 16). Increase for multi-image benchmarks.--max_videos_per_prompt: Maximum number of videos per prompt (default: 1). Increase for video-based benchmarks.--seed: Random seed for reproducibility (default: 3407)
--max_concurrent_requests: Maximum number of concurrent API requests (default: 100). Adjust based on API rate limits.--base_url: Custom API base URL for self-hosted or alternative API endpoints. Leave empty to use default endpoints.--max_tokens: Maximum tokens to generate (default: 4096)--temperature: Sampling temperature from 0.0 to 2.0 (default: 0.7). Lower values make output more deterministic.--top_p: Nucleus sampling parameter from 0.0 to 1.0 (default: 0.8)--top_k: Top-k sampling parameter (default: 20)--repetition_penalty: Penalty for repeating tokens (default: 1.05). Values > 1.0 discourage repetition.--presence_penalty: Penalty for token presence (default: 0.0). Used by some API providers like OpenAI.
--device: Device type for inference. Options:cuda(GPU),cpu. Default iscudaif available.--dtype: Data type for model weights and computation. Options:bfloat16(recommended for modern GPUs),float16,float32. Default:bfloat16.--batch_size: Batch size for inference (default: 1). Increase for faster processing if GPU memory allows.
Adding New Benchmarks
Adding a new benchmark involves three main steps: creating a dataset class, implementing an evaluation script, and registering the benchmark. The framework provides a unified interface that handles different model backends automatically.
Create a new file in benchmark/ (e.g., benchmark/my_benchmark.py) that inherits from BaseDataset, you can copy any benchmarks in benchmark/ folder.
Key Methods to Implement:
Refer to benchmark/base.py for the base class interface and benchmark/vabench_point.py for a complete example.
Required methods:
get_default_instruct(): Return backbone-specific instruction formatload_dataset(): Load dataset from HuggingFaceprepare_dataset(): Convert to standard format with 'question', 'image', 'metadata'process_raw_output(): Parse model output and compute metricsevaluate_results(): Evaluate all resultscompute_statistics(): Aggregate metricssave_results(): Save to logs/
Copy an existing evaluation script (e.g., eval_partafford.py) and modify:
from benchmark.my_benchmark import MyBenchmarkDataset
from core.inference import create_inference_engine
from core.logger import setup_logging
def main(args, task_name, model_name, model_path, instruct_following, dataset_name, subset, split):
# Initialize dataset
dataset = MyBenchmarkDataset(
dataset_name=dataset_name,
subset=subset,
split=split,
instruct_following=instruct_following,
task_name=task_name,
model_name=model_name,
backbone=args.backbone,
debug=args.debug
)
# Load and prepare
raw_dataset = dataset.load_dataset()
prepared_dataset = dataset.prepare_dataset(raw_dataset)
# Run inference
inference_engine, engine = create_inference_engine(args, model_path, model_name)
raw_outputs = inference_engine.batch_inference(prepared_dataset, engine)
# Evaluate and save
results = dataset.evaluate_results(prepared_dataset, raw_outputs)
statistics = dataset.compute_statistics(results)
dataset.save_results(results, statistics)Refer to eval_partafford.py for the complete template with argument parsing.
We welcome contributions to EmbodiedEvalKit! Here's how you can help:
See CONTRIBUTING.md for commit conventions.
- Fork the repository and create your branch from
main - Add your benchmark implementation following the structure in Adding New Benchmarks
- Ensure your code follows the existing style and passes all tests
- Update documentation if you're adding new features
- Submit a pull request with a clear description of your changes
For bug fixes, feature requests, or questions, please open an issue on our GitHub repository.
If you find this work useful, please consider citing:
@article{yuan2026embodiedr15,
title={Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models},
author={Yuan, Yifu and Huang, Yaoting and Zhang, Shuoheng and Yao, Xianze and Li, Pengyi and Han, Linqi and Liu, Yuhao and Li, Yutong and Liao, Ruihao and Wu, Qiyu and Li, Yuxiao and Zhang, Zhao and Sun, Jiangeng and Jia, Wenting and Li, Chen and Dong, Zibin and Ni, Fei and Zheng, Yan and Gu, Shuyang and Ma, Yi and Tang, Hongyao and Hu, Han and Hao, Jianye},
journal={arXiv preprint},
year={2026}
}
@article{yuan2025embodied,
title={Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation},
author={Yuan, Yifu and Cui, Haiqin and Huang, Yaoting and Chen, Yibin and Ni, Fei and Dong, Zibin and Li, Pengyi and Zheng, Yan and Hao, Jianye},
journal={ICLR 2026},
year={2025}
}
@article{yuan2025seeing,
title={From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation},
author={Yuan, Yifu and Cui, Haiqin and Chen, Yibin and Dong, Zibin and Ni, Fei and Kou, Longxin and Liu, Jinyi and Li, Pengyi and Zheng, Yan and Hao, Jianye},
journal={ICLR 2026},
year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.
