Contact-rich manipulation requires reliable estimation of extrinsic contacts—the interactions between a grasped object and its environment—which provide essential contextual information for planning, control, and policy learning. However, existing approaches often rely on restrictive assumptions, such as predefined contact types, fixed grasp configurations, or camera calibration, that hinder generalization to novel objects and deployment in unstructured environments.
UNIC is a unified multimodal framework for extrinsic contact estimation that operates without any prior knowledge or camera calibration. UNIC directly encodes visual observations in the camera frame and integrates them with proprioceptive and tactile modalities in a fully data-driven manner. It introduces a unified contact representation based on scene affordance maps that captures diverse contact formations and employs a multimodal fusion mechanism with random masking, enabling robust multimodal representation learning.
UNIC handles multiple contact scenarios including single-object contact, multi-object interactions, and no-contact states.
UNIC performs reliable estimation during robot motion and in-hand object slip.
UNIC adapts to dynamic camera viewpoints without requiring recalibration.
UNIC generalizes to diverse object configurations and contact locations.
UNIC demonstrates strong generalization to objects not seen during training.
UNIC integrates four sensing modalities:
- 🔵 Point clouds - 3D information from RGB-D camera
- 🟠 Tactile signals - Marker displacement maps from GelSight sensors
- 🟢 Force-torque - 6D wrench from wrist-mounted sensor
- 🟡 Proprioception - End-effector rotation
-
Prior-free Contact Affordance Representation
- Unified representation based on scene affordance maps
- Captures diverse contact types: point, line, patch
- Models complex contact chains (gripper–object–object–environment)
- No camera calibration or object geometry required
-
Masked Multimodal Fusion
- Random masking during training
- Learns robust cross-modal representations
- Enables reliable estimation even with missing modalities at deployment
- Flexible sensor configuration without retraining
-
Efficient Sampling Strategy
- Decouples global multimodal fusion from point-wise affordance generation
- Lightweight point-wise computation
- Supports real-time inference (>600 Hz)
-
Install Miniforge (recommended). Miniforge is the conda-forge–recommended installer and includes
mambaout of the box. -
Create conda environment:
mamba env create -f conda_env.yaml
conda activate unic- Install third-party dependencies:
bash third_party.shFor all dataset merge and usage instructions, see dataset_readme.md.
The training dataset is a Zarr archive (~98 GB unzipped). For distribution it is split into two balanced zip parts, each ~48.9 GB, hosted on Zenodo:
- Part 1 —
split_training_part1.zip(DOI 10.5281/zenodo.20127326) - Part 2 —
split_training_part2.zip(DOI 10.5281/zenodo.20287722)
Released under CC-BY-SA-4.0.
Train the UNIC model with:
python train.py --config-dir=./unic/config --config-name=train_unicTraining configurations are located in unic/config.
Training logs and metrics are automatically tracked with Weights & Biases (wandb). Checkpoints are saved periodically in the output directory specified in the config.
If you find this work useful, please consider citing:
@inproceedings{xu2026unic,
author = {Xu, Zhengtong and Shirai, Yuki},
title = {UNIC: Learning Unified Multimodal Extrinsic Contact Estimation},
booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
year = {2026}
}Released under AGPL-3.0-or-later license, as found in the LICENSE.md file.
All files:
Copyright (C) 2025 Mitsubishi Electric Research Laboratories (MERL)
SPDX-License-Identifier: AGPL-3.0-or-later
For questions or issues, please contact:
- Zhengtong Xu (Purdue University): xu1703@purdue.edu
- Yuki Shirai (MERL): yukishirai1926@gmail.com
- Diego Romeres (MERL): romeres@merl.com









