Multimodal Late Fusion for Depth-Aware Spatial Scene Understanding

Kirthana Natarajan, Trisha Maturi

Abstract

Depth information provides explicit geometric cues that are often missing from RGB images, making it a potentially valuable modality for indoor scene understanding. While recent vision–language models such as CLIP learn powerful semantic representations from RGB–text pairs, their effectiveness for encoding depth images and integrating geometric information remains underexplored. In this work, we investigate the role of depth embeddings in multimodal indoor scene classification using the SUNRGBD dataset. We generate spatially focused captions for RGB images using a pretrained vision–language model and extract image, text, and depth embeddings using frozen encoders. These embeddings are combined using multiple fusion architectures, including shallow and deep late fusion, gated fusion, and transformer-based fusion, and evaluated under controlled ablation settings.

In addition, we conduct a systematic analysis of CLIP as a depth encoder, examining embedding–geometry correlations, neighborhood consistency, cross-modal alignment, and retrieval performance. Our results show that CLIP depth embeddings capture coarse geometric structure and local depth similarity, despite not being trained for this modality. For scene classification, multimodal fusion improves performance over unimodal baselines, with image–text combinations providing the largest gains and depth contributing modest improvements in specific settings. However, increased architectural complexity does not consistently yield better performance, and transformer-based fusion underperforms under limited data. Overall, our findings highlight both the potential and limitations of using CLIP-based depth embeddings for multimodal scene understanding.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
notebooks		notebooks
sceneclassification		sceneclassification
.gitignore		.gitignore
DepthAware_Report.pdf		DepthAware_Report.pdf
Multimodal Late Fusion for Depth-Aware Spatial Scene Understanding.pdf		Multimodal Late Fusion for Depth-Aware Spatial Scene Understanding.pdf
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Late Fusion for Depth-Aware Spatial Scene Understanding

Abstract

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multimodal Late Fusion for Depth-Aware Spatial Scene Understanding

Abstract

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages