Skip to content

hsiangwei0903/CaMo

Repository files navigation

Camera Motion Grounded Evaluation and Training for Vision-Language Models

This repository contains the code for paper Towards Genuine Spatial Intelligence: Camera Motion Grounded Evaluation and Training for Vision-Language Models.

🚀 Quick Links

📦 Installation

Create Conda Environment

conda create -n camo python=3.10 -y
conda activate camo

Install LLaMA-Factory

Follow the standard LLaMA-Factory installation instructions:

# Clone the repository
git clone https://github.com/hsiangwei0903/CaMo.git

# Install dependencies
pip install -e .

Other dependencies might be required depending on your training config. We use deepspeed==0.15.4 and flash-attn==2.7.4.post1 for training as well.

For detailed setup instructions and other installation issues, refer to the LLaMA-Factory documentation.

📚 Dataset Preparation

Requirements

  1. SpatialLadder-26k: Download from Hugging Face
  2. CameraBench: Request access using this form

Setup

After downloading, organize the datasets in the data/ directory and update the configuration files accordingly.

🏋️ Training

Quick Start

To train the model, run the provided SLURM script:

sbatch train.slurm

Alternative: Direct Command

Alternatively, you can run the LLaMA-Factory training command directly:

llamafactory-cli train examples/train_full/qwen2_5vl_3b_full_sft_camo.yaml

For more training configurations, refer to train.slurm or explore the examples/train_full/ directory.

📊 Evaluation

Supported Datasets

The evaluation pipeline supports the following spatial understanding benchmarks:

Setup

Before running evaluations, download the required benchmark datasets and update their paths in eval_spld/evaluator.py.

Quick Start

To evaluate your trained model:

cd eval_spld
bash run_eval.sh

This command will execute the full evaluation pipeline using the default configuration.

Spatial Narrative Score (SNS) Evaluation

Setup

First, export your Gemini API key as an environment variable:

export gemini_api_key=<your_gemini_api_key>

Running Evaluation

To evaluate video captions using the Spatial Narrative Score metric:

python caption_eval/eval_sns.py --results_path <caption_results_path>

The results JSON file should follow this format:

{
    "47334107.mp4": [
        "Caption segment 1",
        "Caption segment 2",
        "Caption segment 3"
    ],
    "another_video.mp4": [
        "Caption segment 1",
        "Caption segment 2"
    ]
}

Each video file is mapped to a list of caption strings representing temporal segments of the video.

🔗 Related Resources

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages