Vision–Language–Action (VLA) Models in Robotics

This repository was developed alongside the paper Vision Language Action Models in Robotic Manipulation: A Systematic Review and provides a living catalog of:

Dataset Benchmarking Code
Code to benchmark the datasets based on the task complexity and modality richness.
VLA Models
Key vision–language–action models that are used in the review, with links to the original papers.
Datasets
Major benchmarks and large‑scale collections used to train and evaluate VLA systems, including QA/navigation datasets, manipulation demonstrations, and multimodal embodiment data.
Simulators
Widely adopted simulation platforms for generating VLA data—spanning photorealistic navigation, dexterous manipulation, multi‑robot coordination, and more—each linked to its official website.

We aim to keep this list up to date as new VLA models, datasets, and simulation tools emerge. Contributions and pull requests adding recently published work or tooling are most welcome!

Reference

@article{UDDIN2026104062,
title = {Multimodal fusion with vision-language-action models for robotic manipulation: A systematic review},
journal = {Information Fusion},
volume = {129},
pages = {104062},
year = {2026},
issn = {1566-2535},
doi = {https://doi.org/10.1016/j.inffus.2025.104062},
author = {Muhayy {Ud Din} and Waseem Akram and Lyes {Saad Saoud} and Jan Rosell and Irfan Hussain},
}

Dataset Benchmarking Code

Benchmarking VLA Datasets by Task Complexity and Modality Richness. Each bubble represents a VLA dataset, positioned according to its normalized task-complexity score (x-axis) and its modality-richness score (y-axis). The bubble area is proportional to the dataset scale that is number of annotated episodes or interactions.

Code

VLA Models

The top row presents major VLA models introduced each year, alongside their associated institutions. The bottom row displays key datasets used to train and evaluate VLA models, grouped by release year. The figure highlights the increasing scale and diversity of datasets and institutional involvement, with contributions from academic (e.g., CMU, CNRS, UC, Peking Uni) and industrial labs (e.g., Google, NVIDIA, Microsoft). This timeline highlights the rapid advancements in VLA research.

Below is the list of the VLAs reviewed in the paper

[2022]Cliport: What and where pathways for robotic manipulation
[2022]Rt-1: Robotics transformer for real‑world control at scale
[2022]A Generalist Agent
[2022]VIMA: General Robot Manipulation with Multimodal Prompts
[2022]PERCEIVER-ACTOR:A Multi-Task Transformer for Robotic Manipulation
[2022]Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
[2023]RoboAgent: Generalist Robot Agent with Semantic and Temporal Understanding
[2023]Robotic Task Generalization via Hindsight Trajectory Sketches
[2023]Learning fine‑grained bimanual manipulation with low‑cost hardware
[2023]Rt-2: Vision‑language‑action models transfer web knowledge to robotic control
[2023]Voxposer: Composable 3D value maps for robotic manipulation with language models
[2024]CLIP‑RT: Learning Language‑Conditioned Robotic Policies with Natural Language Supervision
[2023]Diffusion Policy: Visuomotor policy learning via action diffusion
[2024]Octo: An open‑source generalist robot policy
[2024]Towards testing and evaluating vision‑language manipulation: An empirical study
[2024]NaVILA: Legged robot vision‑language‑action model for navigation
[2024]RoboNurse‑VLA: Real‑time voice‑to‑action pipeline for surgical instrument handover
[2024]Mobility VLA: Multimodal instruction navigation with topological mapping
[2024]ReVLA: Domain adaptation adapters for robotic foundation models
[2024]Uni‑NaVid: Video‑based VLA unifying embodied navigation tasks
[2024]RDT‑1B: 1.2B‑parameter diffusion foundation model for manipulation
[2024]RoboMamba: Mamba‑based unified VLA with linear‑time inference
[2024]Chain‑of‑Affordance: Sequential affordance reasoning for spatial planning
[2024]Edge VLA:Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities
[2024]OpenVLA: LORA‑fine‑tuned open‑source VLA with high‑success transfer
[2024]CogACT: Componentized diffusion action transformer for VLA
[2024]ShowUI‑2B: GUI/web navigation via screenshot grounding and token selection
[2024]HiRT: Hierarchical planning/control separation for VLA
[2024]Pi‑0: General robot control flow model for open‑world tasks
[2024]A3VLM: Articulation‑aware affordance grounding from RGB video
[2024]SVLR: Modular “segment‑to‑action” pipeline using visual prompt retrieval
[2024]Bi‑VLA: Dual‑arm instruction‑to‑action planner for recipe demonstrations
[2024]QUAR‑VLA: Quadruped‑specific VLA with adaptive gait mapping
[2024]3D‑VLA: Integrating 3D generative diffusion heads for world reconstruction
[2024]RoboMM: MIM‑based multimodal decoder unifying 3D perception and language
[2025]FAST: Frequency‑space action tokenization for faster inference
[2025]OpenVLA‑OFT: Optimized fine‑tuning of OpenVLA with parallel decoding
[2025]CoVLA: Autonomous driving VLA trained on annotated scene data
[2025]ORION: Holistic end‑to‑end driving VLA with semantic trajectory control
[2025]UAV‑VLA: Zero‑shot aerial mission VLA combining satellite/UAV imagery
[2025]Combat VLA: Ultra‑fast tactical reasoning in 3D environments
[2025]HybridVLA: Ensemble decoding combining diffusion and autoregressive policies
[2025]NORA: Low‑overhead VLA with integrated visual reasoning and FAST decoding
[2025]SpatialVLA: 3D spatial encoding and adaptive action discretization
[2025]MoLe‑VLA: Selective layer activation for faster inference
[2025]JARVIS‑VLA: Open‑world instruction following in 3D games with keyboard/mouse
[2025]UP‑VLA: Unified understanding and prediction model for embodied agents
[2025]Shake‑VLA: Modular bimanual VLA for cocktail‑mixing tasks
[2025]MORE: Scalable mixture‑of‑experts RL for VLA models
[2025]DexGraspVLA: Diffusion‑based dexterous grasping framework
[2025]DexVLA: Cross‑embodiment diffusion expert for rapid adaptation
[2025]Humanoid‑VLA: Hierarchical full‑body humanoid control VLA
[2025]ObjectVLA: End‑to‑end open‑world object manipulation
[2025]Gemini Robotics: Bringing AI into the Physical World
[2025]ECoT: Robotic Control via Embodied Chain‑of‑Thought Reasoning
[2025]OTTER: A Vision‑Language‑Action Model with Text‑Aware Visual Feature Extraction
[2025]π‑0.5: A VLA Model with Open‑World Generalization
[2025]OneTwoVLA: A Unified Model with Adaptive Reasoning
[2025]Helix: A Vision-Language-Action Model for Generalist Humanoid Control
[2025]SmolVLA: A Vision‑Language‑Action Model for Affordable and Efficient Robotics
[2025]EF‑VLA: Vision‑Language‑Action Early Fusion with Causal Transformers
[2025]PD‑VLA: Accelerating vision‑language‑action inference via parallel decoding
[2025]LeVERB: Humanoid Whole‑Body Control via Latent Verb Generation
[2025]TLA: Tactile‑Language‑Action Model for High‑Precision Contact Tasks
[2025]Interleave‑VLA: Enhancing VLM‑LLM interleaved instruction processing
[2025]iRe‑VLA: Iterative reinforcement and supervised fine‑tuning for robust VLA
[2025]TraceVLA: Visual trace prompting for spatio‑temporal manipulation cues
[2025]OpenDrive VLA: End‑to‑End Driving with Semantic Scene Alignment
[2025]V‑JEPA 2: Dual‑Stream Video JEPA for Predictive Robotic Planning
[2025]Knowledge Insulating VLA: Insulation Layers for Modular VLA Training
[2025]GR00T N1: Diffusion Foundation Model for Humanoid Control
[2025]AgiBot World Colosseo: Unified Embodied Dataset Platform
[2025]Hi Robot: Hierarchical Planning and Control for Complex Environments
[2025]EnerVerse: World‑Model LLM for Long‑Horizon Manipulation
[2024]FLaRe: Large-Scale RL Fine-Tuning for Adaptive Robotic Policies
[2025]Beyond Sight: Sensor Fusion via Language-Grounded Attention
[2025]GeoManip: Geometric Constraint Encoding for Robust Manipulation
[2025]Universal Actions: Standardizing Action Dictionaries for Transfer
[2025]RoboHorizon: Multi-View Environment Modeling with LLM Planning
[2025]SAM2Act: Segmentation‑Augmented Memory for Object‑Centric Manipulation
[2025]VLA‑Cache: Token Caching for Efficient VLA Inference
[2025]Forethought VLA: Latent Alignment for Foresight‑Driven Policies
[2024]GRAPE: Preference‑Guided Policy Adaptation via Feedback
[2025]HAMSTER: Hierarchical Skill Decomposition for Multi‑Step Manipulation
[2025]TempoRep VLA: Successor Representation for Temporal Planning
[2025]ConRFT: Consistency Regularized Fine‑Tuning with Reinforcement
[2025]RoboBERT: Unified Multimodal Transformer for Manipulation
[2024]Diffusion Transformer Policy: Robust Multimodal Action Sampling
[2025]GEVRM: Generative Video Modeling for Goal‑Oriented Planning
[2025]SoFar: Successor‑Feature Orientation Representations
[2025]ARM4R: Auto‑Regressive 4D Transition Modeling for Trajectories
[2025]Magma: Foundation Multimodal Agent Model for Control
[2025]An Atomic Skill Library: Modular Skill Composition for Robotics
[2025]RoboBrain: Knowledge‑Grounded Policy Brain for Multimodal Tasks
[2025]SafeVLA: Safety‑Aware Vision‑Language‑Action Policies
[2025]CognitiveDrone: Embodied Reasoning VLA for UAV Planning
[2025]VLAS: Voice‑Driven Vision‑Language‑Action Control
[2025]ChatVLA: Conversational VLA for Interactive Control
[2024]Diffusion‑VLA: Diffusion‑Based Policy for Generalizable Manipulation
[2025]RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
[2025]Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs
[2025]VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

Datasets

[2018]EmbodiedQA: Embodied Question Answering
[2018]R2R: Vision‑and‑Language Navigation: Interpreting Visually‑Grounded Navigation Instructions in Real Environments
[2020]ALFRED
[2020]RLBench: The Robot Learning Benchmark & Learning Environment
[2019]Vision‑and‑Dialog Navigation
[2021]TEACh: Task‑driven Embodied Agents that Chat
[2022]DialFRED: Dialogue‑Enabled Agents for Embodied Instruction Following
[2022]Ego4D: Around the World in 3,000 Hours of Egocentric Video
[2022]CALVIN: A Benchmark for Language‑Conditioned Long‑Horizon Robot Manipulation Tasks
[2024]DROID: A Large‑Scale In‑The‑Wild Robot Manipulation Dataset
[2025]Open X-Embodiment: Robotic Learning Datasets and RT‑X Models
[2025]RoboSpatial: Teaching Spatial Understanding via Vision‑Language Models for Robotics
[2024]CoVLA: Comprehensive Vision‑Language‑Action Dataset for Autonomous Driving
[2025]TLA: Tactile‑Language‑Action Model for Contact‑Rich Manipulation
[2023]BridgeData V2: A Dataset for Robot Learning at Scale
[2023]LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
[2025]Kaiwu: A Multimodal Manipulation Dataset and Framework for Robotic Perception and Interaction
[2025]PLAICraft: Large‑Scale Time‑Aligned Vision‑Speech‑Action Dataset for Embodied AI
[2025]AgiBot World Colosseo: A Large‑Scale Manipulation Dataset for Intelligent Embodied Systems
[2023]Robo360: A 3D Omnispective Multi‑Modal Robotic Manipulation Dataset
[2025]REASSEMBLE: A Multimodal Dataset for Contact‑Rich Robotic Assembly and Disassembly
[2025]RoboCerebra: A Large‑Scale Benchmark for Long‑Horizon Robotic Manipulation Evaluation
[2025]IRef‑VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes
[2025]Interleave‑VLA: Enhancing Robot Manipulation with Interleaved Image‑Text Instructions
[2024]RoboMM: All‑in‑One Multimodal Large Model for Robotic Manipulation
[2024]All Robots in One: A New Standard and Unified Dataset for Versatile, General‑Purpose Embodied Agents [2025]RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Simulators

[2017][AI2-THOR:]AI2-THOR
[2019][Habitat:]Habitat
[2020][NVIDIA Isaac Sim:]NVIDIA Isaac Sim
[2004][Gazebo:]Gazebo
[2016][PyBullet:]PyBullet
[2013][CoppeliaSim:]CoppeliaSim
[2004][Webots:]Webots
[2018][Unity ML‑Agents:]Unity ML‑Agents
[2012][MuJoCo:]MuJoCo
[2020][iGibson:]iGibson
[2023][UniSim:]UniSim
[2020][SAPIEN:]SAPIEN

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
Plot_script		Plot_script
README.md		README.md
VLA.png		VLA.png
benchmarkdataset.png		benchmarkdataset.png
dataset_plot.py		dataset_plot.py
vlas.png		vlas.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision–Language–Action (VLA) Models in Robotics

Reference

Table of Contents

Dataset Benchmarking Code

VLA Models

Datasets

Simulators

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vision–Language–Action (VLA) Models in Robotics

Reference

Table of Contents

Dataset Benchmarking Code

VLA Models

Datasets

Simulators

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages