This repository contains the code for paper Towards Genuine Spatial Intelligence: Camera Motion Grounded Evaluation and Training for Vision-Language Models.
- Model Checkpoint: CaMo-3B
- Dataset: SpatialLadder-26k CameraBench
conda create -n camo python=3.10 -y
conda activate camoFollow the standard LLaMA-Factory installation instructions:
# Clone the repository
git clone https://github.com/hsiangwei0903/CaMo.git
# Install dependencies
pip install -e .Other dependencies might be required depending on your training config. We use deepspeed==0.15.4 and flash-attn==2.7.4.post1 for training as well.
For detailed setup instructions and other installation issues, refer to the LLaMA-Factory documentation.
- SpatialLadder-26k: Download from Hugging Face
- CameraBench: Request access using this form
After downloading, organize the datasets in the data/ directory and update the configuration files accordingly.
To train the model, run the provided SLURM script:
sbatch train.slurmAlternatively, you can run the LLaMA-Factory training command directly:
llamafactory-cli train examples/train_full/qwen2_5vl_3b_full_sft_camo.yamlFor more training configurations, refer to train.slurm or explore the examples/train_full/ directory.
The evaluation pipeline supports the following spatial understanding benchmarks:
Before running evaluations, download the required benchmark datasets and update their paths in eval_spld/evaluator.py.
To evaluate your trained model:
cd eval_spld
bash run_eval.shThis command will execute the full evaluation pipeline using the default configuration.
First, export your Gemini API key as an environment variable:
export gemini_api_key=<your_gemini_api_key>To evaluate video captions using the Spatial Narrative Score metric:
python caption_eval/eval_sns.py --results_path <caption_results_path>The results JSON file should follow this format:
{
"47334107.mp4": [
"Caption segment 1",
"Caption segment 2",
"Caption segment 3"
],
"another_video.mp4": [
"Caption segment 1",
"Caption segment 2"
]
}Each video file is mapped to a list of caption strings representing temporal segments of the video.
- LLaMA-Factory: The underlying training framework
- SpatialLadder: Spatial understanding dataset and evaluation