Skip to content

kmn01/DepthAware

Repository files navigation

Multimodal Late Fusion for Depth-Aware Spatial Scene Understanding

Kirthana Natarajan, Trisha Maturi

Abstract

Depth information provides explicit geometric cues that are often missing from RGB images, making it a potentially valuable modality for indoor scene understanding. While recent vision–language models such as CLIP learn powerful semantic representations from RGB–text pairs, their effectiveness for encoding depth images and integrating geometric information remains underexplored. In this work, we investigate the role of depth embeddings in multimodal indoor scene classification using the SUNRGBD dataset. We generate spatially focused captions for RGB images using a pretrained vision–language model and extract image, text, and depth embeddings using frozen encoders. These embeddings are combined using multiple fusion architectures, including shallow and deep late fusion, gated fusion, and transformer-based fusion, and evaluated under controlled ablation settings.

In addition, we conduct a systematic analysis of CLIP as a depth encoder, examining embedding–geometry correlations, neighborhood consistency, cross-modal alignment, and retrieval performance. Our results show that CLIP depth embeddings capture coarse geometric structure and local depth similarity, despite not being trained for this modality. For scene classification, multimodal fusion improves performance over unimodal baselines, with image–text combinations providing the largest gains and depth contributing modest improvements in specific settings. However, increased architectural complexity does not consistently yield better performance, and transformer-based fusion underperforms under limited data. Overall, our findings highlight both the potential and limitations of using CLIP-based depth embeddings for multimodal scene understanding.

About

Multimodal Late Fusion for Depth-Aware Spatial Scene Understanding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors