Camera Motion Grounded Evaluation and Training for Vision-Language Models

This repository contains the code for paper Towards Genuine Spatial Intelligence: Camera Motion Grounded Evaluation and Training for Vision-Language Models.

🚀 Quick Links

Model Checkpoint: CaMo-3B
Dataset: SpatialLadder-26k CameraBench

📦 Installation

Create Conda Environment

conda create -n camo python=3.10 -y
conda activate camo

Install LLaMA-Factory

Follow the standard LLaMA-Factory installation instructions:

# Clone the repository
git clone https://github.com/hsiangwei0903/CaMo.git

# Install dependencies
pip install -e .

Other dependencies might be required depending on your training config. We use deepspeed==0.15.4 and flash-attn==2.7.4.post1 for training as well.

For detailed setup instructions and other installation issues, refer to the LLaMA-Factory documentation.

📚 Dataset Preparation

Requirements

SpatialLadder-26k: Download from Hugging Face
CameraBench: Request access using this form

Setup

After downloading, organize the datasets in the data/ directory and update the configuration files accordingly.

🏋️ Training

Quick Start

To train the model, run the provided SLURM script:

sbatch train.slurm

Alternative: Direct Command

Alternatively, you can run the LLaMA-Factory training command directly:

llamafactory-cli train examples/train_full/qwen2_5vl_3b_full_sft_camo.yaml

For more training configurations, refer to train.slurm or explore the examples/train_full/ directory.

📊 Evaluation

Supported Datasets

The evaluation pipeline supports the following spatial understanding benchmarks:

Setup

Before running evaluations, download the required benchmark datasets and update their paths in eval_spld/evaluator.py.

Quick Start

To evaluate your trained model:

cd eval_spld
bash run_eval.sh

This command will execute the full evaluation pipeline using the default configuration.

Spatial Narrative Score (SNS) Evaluation

Setup

First, export your Gemini API key as an environment variable:

export gemini_api_key=<your_gemini_api_key>

Running Evaluation

To evaluate video captions using the Spatial Narrative Score metric:

python caption_eval/eval_sns.py --results_path <caption_results_path>

The results JSON file should follow this format:

{
    "47334107.mp4": [
        "Caption segment 1",
        "Caption segment 2",
        "Caption segment 3"
    ],
    "another_video.mp4": [
        "Caption segment 1",
        "Caption segment 2"
    ]
}

Each video file is mapped to a list of caption strings representing temporal segments of the video.

🔗 Related Resources

LLaMA-Factory: The underlying training framework
SpatialLadder: Spatial understanding dataset and evaluation

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
caption_eval		caption_eval
data		data
docker		docker
eval_spld		eval_spld
examples		examples
scripts		scripts
src		src
tests		tests
tests_v1		tests_v1
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
train.slurm		train.slurm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Camera Motion Grounded Evaluation and Training for Vision-Language Models

🚀 Quick Links

📦 Installation

Create Conda Environment

Install LLaMA-Factory

📚 Dataset Preparation

Requirements

Setup

🏋️ Training

Quick Start

Alternative: Direct Command

📊 Evaluation

Supported Datasets

Setup

Quick Start

Spatial Narrative Score (SNS) Evaluation

Setup

Running Evaluation

🔗 Related Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Camera Motion Grounded Evaluation and Training for Vision-Language Models

🚀 Quick Links

📦 Installation

Create Conda Environment

Install LLaMA-Factory

📚 Dataset Preparation

Requirements

Setup

🏋️ Training

Quick Start

Alternative: Direct Command

📊 Evaluation

Supported Datasets

Setup

Quick Start

Spatial Narrative Score (SNS) Evaluation

Setup

Running Evaluation

🔗 Related Resources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages