Skip to content

pickxiguapi/EmbodiedEvalKit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EmbodiedEvalKit Logo

HuggingFace Website

EmbodiedEvalKit

A unified evaluation framework that simplifies embodied AI benchmarking with clean interfaces, supporting 25+ benchmarks and diverse model backends.

Features

  • 🚀 Multiple Inference Backends: vLLM (tensor parallel & multi-GPU), HuggingFace Transformers, and API — switch backends in one config.
  • 🤖 20+ Models Supported: GPT, Gemini, Qwen, InternVL, Molmo, Magma and more — generalist and embodied-specialist families out of the box.
  • 📊 20+ Embodied Benchmarks: Embodied QA, Spatial Reasoning, Embodied Pointing, Affordance and Location, and Embodied Planning — all in one place.
  • 🗂️ Standardized Dataset Format: All benchmarks reorganized into HuggingFace Parquet — unified data pipeline for reproducible evaluation.
  • 🎯 Unified Pointing Evaluation: One interface for diverse pointing formats, coordinate systems, and model-specific conventions.
  • 🧩 Modular & Decoupled Design: Models, benchmarks, and metrics cleanly separated — easy to extend, swap, or customize independently.

Supported Benchmarks

Benchmark HuggingFace Link Paper Title
ERQA FlagEval/ERQA ERQA: A Benchmark for Embodied Referential Question Answering
CV-Bench nyu-visionx/CV-Bench Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
EmbSpatial FlagEval/EmbSpatial-Bench EmbSpatial: A Benchmark for Spatial Understanding in Embodied AI
SAT FlagEval/SAT SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models
RoboSpatial chanhee-luke/RoboSpatial-Home RoboSpatial: Teaching Spatial Understanding to 2D/3D VLMs
RoboVQA IffYuan/RoboVQA RoboVQA: Multimodal Long-Horizon Reasoning for Robotics
VABench-Point IffYuan/VABench-P From Seeing to Doing: Bridging Reasoning and Decision for Embodied AI
Where2Place FlagEval/Where2Place RoboPoint: A VLM for Spatial Affordance Prediction
RefSpatial-Bench FlagEval/RefSpatial-Bench RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
Part-Afford-2K IffYuan/Part-Affordance-2K Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
RoboRefit IffYuan/RoboRefit VL-Grasp: a 6-Dof Interactive Grasp Policy for Language-Oriented Objects in Cluttered Indoor Scenes
RoboAfford-Eval Zray26/roboafford-eval RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation
VSI-Bench IffYuan/vsi-bench Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
OpenEQA IffYuan/open-eqa OpenEQA: Embodied Question Answering in the Real World
EgoPlan2 IffYuan/ego-plan EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
PIOBench IffYuan/PIO-Bench Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding
PointBench IffYuan/PointBench PointArena: Probing Multimodal Grounding Through Language-Guided Pointing
Pixmo Points IffYuan/pixmo-points-eval Molmo and PixMo: Open Weights and Open Data for State-of-the-Art VLMs
BLINK BLINK-Benchmark/BLINK BLINK: Multimodal Large Language Models Can See but Not Perceive
COSMOS IffYuan/COSMOS Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
ShareRobot Trajectory IffYuan/sharerobot_trajectory RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
VABench-Visual-Trace IffYuan/vabench-v From Seeing to Doing: Bridging Reasoning and Decision for Embodied AI
RoboFAC MINT-SJTU/RoboFAC-dataset RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction
VLABench VLABench/vlm_evaluation_v1.0 VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
PIO-S3-Verified IffYuan/PIO-S3 Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding

Note: Several benchmarks have been reformatted into HuggingFace Parquet for unified processing. All dataset copyrights and licenses belong to the original authors and publications. Please cite the original papers when using these datasets.

Supported Models

Model List

Model Inference Backend Backbone
GPT series API gpt
Gemini series API gemini-2.5
Gemini Robotics API gemini_robotics
Qwen2.5VL series vLLM qwen2_5
Qwen3VL series vLLM qwen3
InternVL series HuggingFace internvl
Molmo series HuggingFace molmo
RoboBrain2.0-7B vLLM qwen2_5
VeBrain vLLM qwen2_5
Mimo-Embodied HuggingFace mimo
Pelican-VL 1.0 vLLM pelican
Magma-8B HuggingFace magma
Embodied-R1 vLLM qwen2_5
Embodied-R1.5 vLLM qwen3

Note: The framework supports various model sizes (7B, 8B, 72B, etc.) within each model family via the backbone parameter.

Backend Description

  • vLLM: High-performance inference engine with tensor parallelism and batch inference, suitable for locally deployed large models
  • API: Cloud-based model services via API calls (OpenAI, Google, etc.)
  • HuggingFace: Transformers-based inference supporting a wide range of open-source models

What is Backbone?

Backbone is a key architectural abstraction in OmniScope that identifies the model family and handles model-specific conventions. It serves as a unified interface layer that enables diverse models to work seamlessly within the evaluation framework.

Key Roles of Backbone:

  1. Inference Backend Selection: Automatically routes to the appropriate inference engine (vLLM, API, or HuggingFace) based on model type
  2. Prompt Format Adaptation: Generates model-specific instruction formats and prompt templates for each benchmark
  3. Coordinate System Handling: Manages different coordinate representations across models:
    • qwen3, gemini-2.5: 0-1000 normalized coordinates
    • qwen2_5, mimo: Absolute pixel coordinates
    • molmo: 0-100 coordinates in XML format
    • gpt, pelican, internvl, magma: 0-1 normalized coordinates
Backbone Format Example
qwen3, gemini-2.5 0-1000 normalized [500, 500]
qwen2_5, mimo Absolute pixels [256, 256]
molmo 0-100 in XML <point x="50.0" y="50.0">
gpt, pelican, internvl, magma 0-1 normalized (0.5, 0.5)
  1. Output Parsing: Converts model-specific output formats into standardized evaluation results

This abstraction allows OmniScope to support models with vastly different architectures, APIs, and conventions through a single unified evaluation pipeline.

Quick Start

1. Installation (vLLM + Qwen3-VL as an Example)

# Install all dependencies
pip install -r requirements.txt

2. Setup API Keys

Create key files in the project root (one-time setup):

echo "your_hf_token" > .hf_token
echo "your_api_key" > .api_key

These are auto-read by the evaluation scripts, no need to pass them as arguments.

Evaluation

Eval Embodied-R1.5 via VLLM

If you have 2 gpus, using:

bash scripts/eval_embodied_r1.5.sh

If you have 8 gpus, using:

bash scripts/eval_er1.5_parallel.sh

Eval single benchmark using VLLM

# Embodied-R1.5
python eval_erqa.py \
  --model_name Embodied-R1.5 \
  --model_path /path/to/Embodied-R1.5 \
  --backbone qwen3 \
  --max_model_len 20000 \
  --gpu_memory_utilization 0.8 \
  --tensor_parallel_size 2

# Qwen3-VL-8B-Instruct
python eval_erqa.py \
  --model_name Qwen3-VL-8B-Instruct \
  --model_path Qwen/Qwen3-VL-8B-Instruct \
  --backbone qwen3 \
  --max_model_len 20000 \
  --gpu_memory_utilization 0.8 \
  --tensor_parallel_size 2

Using HF Backbone

#InternVL3_5-8B
python eval_erqa.py \
  --model_name InternVL3_5-8B \
  --model_path OpenGVLab/InternVL3_5-8B \
  --backbone internvl \
  --max_model_len 20000 

Using API Backbone

#gpt-5-mini
API_KEY=$(cat ./.api_key | tr -d '\n\r ')
export API_KEY=$API_KEY
BASE_URL=https://api.gpt.ge/v1

python eval_erqa.py \
  --model_name gpt-5-mini \
  --model_path gpt-5-mini \
  --backbone gpt \
  --max_model_len 20000 \
  --base_url ${BASE_URL}

See more examples in scripts!

Key Parameters

General Parameters

  • --model_name: Model name for logging and result tracking (e.g., "Qwen3-VL-8B-Instruct")
  • --model_path: Path to local model checkpoint or API model identifier (e.g., "/path/to/model" or "gpt-4o")
  • --backbone: Model backbone type that determines coordinate format and prompt style. Options: qwen3, qwen2_5, gpt, gemini-2.5, gemini_robotics, molmo, mimo, magma, internvl, pelican. Auto-detects from model_name if not specified.
  • --instruct_following: Custom instruction prompt to override default format instructions. Use this to test different prompting strategies.
  • --thinking_model: Enable thinking model mode for models that output reasoning in <think> tags and answers in <answer> tags (e.g., o1, DeepSeek-R1)
  • --debug: Debug mode that processes only the first 20 samples for quick testing

vLLM Backend Parameters

  • --tensor_parallel_size: Number of GPUs for tensor parallelism (default: 2). Use this to distribute model across multiple GPUs for larger models.
  • --max_model_len: Maximum context length in tokens (default: 10240). Adjust based on your GPU memory and benchmarkkrequirements.
  • --gpu_memory_utilization: GPU memory utilization ratio from 0.0 to 1.0 (default: 0.8). Lower values leave more memory for other processes.
  • --max_images_per_prompt: Maximum number of images per prompt (default: 16). Increase for multi-image benchmarks.
  • --max_videos_per_prompt: Maximum number of videos per prompt (default: 1). Increase for video-based benchmarks.
  • --seed: Random seed for reproducibility (default: 3407)

API Backend Parameters

  • --max_concurrent_requests: Maximum number of concurrent API requests (default: 100). Adjust based on API rate limits.
  • --base_url: Custom API base URL for self-hosted or alternative API endpoints. Leave empty to use default endpoints.
  • --max_tokens: Maximum tokens to generate (default: 4096)
  • --temperature: Sampling temperature from 0.0 to 2.0 (default: 0.7). Lower values make output more deterministic.
  • --top_p: Nucleus sampling parameter from 0.0 to 1.0 (default: 0.8)
  • --top_k: Top-k sampling parameter (default: 20)
  • --repetition_penalty: Penalty for repeating tokens (default: 1.05). Values > 1.0 discourage repetition.
  • --presence_penalty: Penalty for token presence (default: 0.0). Used by some API providers like OpenAI.

HuggingFace Backend Parameters

  • --device: Device type for inference. Options: cuda (GPU), cpu. Default is cuda if available.
  • --dtype: Data type for model weights and computation. Options: bfloat16 (recommended for modern GPUs), float16, float32. Default: bfloat16.
  • --batch_size: Batch size for inference (default: 1). Increase for faster processing if GPU memory allows.
Adding New Benchmarks

Adding a new benchmark involves three main steps: creating a dataset class, implementing an evaluation script, and registering the benchmark. The framework provides a unified interface that handles different model backends automatically.

1. Create Dataset Class

Create a new file in benchmark/ (e.g., benchmark/my_benchmark.py) that inherits from BaseDataset, you can copy any benchmarks in benchmark/ folder.

Key Methods to Implement:

Refer to benchmark/base.py for the base class interface and benchmark/vabench_point.py for a complete example.

Required methods:

  • get_default_instruct(): Return backbone-specific instruction format
  • load_dataset(): Load dataset from HuggingFace
  • prepare_dataset(): Convert to standard format with 'question', 'image', 'metadata'
  • process_raw_output(): Parse model output and compute metrics
  • evaluate_results(): Evaluate all results
  • compute_statistics(): Aggregate metrics
  • save_results(): Save to logs/

2. Create Evaluation Script

Copy an existing evaluation script (e.g., eval_partafford.py) and modify:

from benchmark.my_benchmark import MyBenchmarkDataset
from core.inference import create_inference_engine
from core.logger import setup_logging

def main(args, task_name, model_name, model_path, instruct_following, dataset_name, subset, split):
    # Initialize dataset
    dataset = MyBenchmarkDataset(
        dataset_name=dataset_name,
        subset=subset,
        split=split,
        instruct_following=instruct_following,
        task_name=task_name,
        model_name=model_name,
        backbone=args.backbone,
        debug=args.debug
    )

    # Load and prepare
    raw_dataset = dataset.load_dataset()
    prepared_dataset = dataset.prepare_dataset(raw_dataset)

    # Run inference
    inference_engine, engine = create_inference_engine(args, model_path, model_name)
    raw_outputs = inference_engine.batch_inference(prepared_dataset, engine)

    # Evaluate and save
    results = dataset.evaluate_results(prepared_dataset, raw_outputs)
    statistics = dataset.compute_statistics(results)
    dataset.save_results(results, statistics)

Refer to eval_partafford.py for the complete template with argument parsing.

Contributing

We welcome contributions to EmbodiedEvalKit! Here's how you can help:

See CONTRIBUTING.md for commit conventions.

Pull Requests

  1. Fork the repository and create your branch from main
  2. Add your benchmark implementation following the structure in Adding New Benchmarks
  3. Ensure your code follows the existing style and passes all tests
  4. Update documentation if you're adding new features
  5. Submit a pull request with a clear description of your changes

For bug fixes, feature requests, or questions, please open an issue on our GitHub repository.

Citation

If you find this work useful, please consider citing:

@article{yuan2026embodiedr15,
  title={Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models},
  author={Yuan, Yifu and Huang, Yaoting and Zhang, Shuoheng and Yao, Xianze and Li, Pengyi and Han, Linqi and Liu, Yuhao and Li, Yutong and Liao, Ruihao and Wu, Qiyu and Li, Yuxiao and Zhang, Zhao and Sun, Jiangeng and Jia, Wenting and Li, Chen and Dong, Zibin and Ni, Fei and Zheng, Yan and Gu, Shuyang and Ma, Yi and Tang, Hongyao and Hu, Han and Hao, Jianye},
  journal={arXiv preprint},
  year={2026}
}

@article{yuan2025embodied,
  title={Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation},
  author={Yuan, Yifu and Cui, Haiqin and Huang, Yaoting and Chen, Yibin and Ni, Fei and Dong, Zibin and Li, Pengyi and Zheng, Yan and Hao, Jianye},
  journal={ICLR 2026},
  year={2025}
}

@article{yuan2025seeing,
  title={From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation},
  author={Yuan, Yifu and Cui, Haiqin and Chen, Yibin and Dong, Zibin and Ni, Fei and Kou, Longxin and Liu, Jinyi and Li, Pengyi and Zheng, Yan and Hao, Jianye},
  journal={ICLR 2026},
  year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A unified evaluation framework that simplifies embodied AI benchmarking with clean interfaces, supporting 25+ benchmarks and diverse model backends.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors