Freeze-Align is a lightweight framework for flexible multimodal alignment.
It connects frozen vision and language encoders using small trainable projectors, allowing efficient cross-modal learning without the need for full-scale retraining.
This repository contains the code and datasets associated with our CVPR 2025 paper:
Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment
This framework is built on the premise that semantically similar vision and language embedding spaces can be aligned through simple projection transformations. For example, by aligning DINOv2 with the Sentence-Transformer model all-roberta-large-v1, we achieve a remarkable 76% zero-shot ImageNet accuracy, surpassing comparable CLIP models while reducing alignment compute by 65x and paired data requirements by 20x.
We believe this approach holds immense potential for further advancements. We invite the open-source community to explore aligning newer, more powerful vision and language encoders to develop high-performing CLIP-like models with minimal effort. Notably, recent improvements in language models on the MTEB benchmark and advancements in SSL vision models present exciting opportunities for experimentation and innovation.
- Paper
- Concept Coverage Dataset 6M
- [Video] (Coming soon)
- [Slides] (Coming soon)
- Optimal vision/language encoder pair discovery via Centered Kernel Alignment (CKA)
- Lightweight training using frozen unimodal backbones
- Curate high-quality datasets from LAION or other pools to enable efficient alignment
- Supports flexible vision encoders (huggingface transformers) and language encoders (sentence transformers)
- Drastically reduced compute and data requirements. For instance we outperform OpenAI , LAION CLIP models with 20x less paired data and 65x less compute.
# Clone the repository
git clone https://github.com/mayug/freeze-align.git
cd freeze-align
# Create environment
conda env create -f environment.yaml
conda activate freeze-align
Find the most semantically similar vision-language encoder pairs.
python get_semantic_sim.pyWorkflow:
- Download the COCO dataset.
- Generate embeddings for selected models.
- Compute linear CKA scores.
- Save a plot (
cka_results.png) and output the best encoder pair.
Train lightweight projectors between frozen encoders.
Setup:
- Download datasets using img2dataset:
- CC3M
- CC12M
- SBU
- LAION class-collected 6M (Google Drive, Hugging Face)
- ImageNet validation set
- Configure dataset paths in
dinov2-arl-wds-combined.yaml. - Set up environment:
conda env create -f environment.yaml
conda activate freeze-align
bash extra_install.shTraining Command:
python -m torch.distributed.launch --master_port=43770 --nproc_per_node=8 --use_env PretrainHydra.py --config dinov2-arl-wds-combined --output_dir ./storage/output/ --overrides +save_last_only=False fp16=True disable_wandb=False text_pooling=mean local_vision_projection=patch local_text_projection=patch text_projection=mlpCurate a concept-rich dataset from LAION.
Steps:
- Download LAION Metadata into
/laion400m-meta/. - Compute Embeddings:
python getting_laion_embeds.py --gpu <GPU_ID> --b <BATCH_SIZE> --m <MODEL> --p <PART>- Calculate Similarity Scores:
python scores_new.py --gpu <GPU_ID> --b <BATCH_SIZE> --p <PART>- Sort Top Samples:
python sort_samples.py --p <PART> --max <MAX_SAMPLES> --b <BATCH_SIZE> --sort_b <SORT_BATCH_SIZE> --gpu <GPU_ID>- Deduplicate and Collect Final Samples:
python collect_fast.py --parts <NUM_PARTS> --max <MAX_SAMPLES>🔥 Tip: Complete all parts for each step before proceeding to the next.
- ✅ Publish class-collected datasets to Hugging Face Datasets
- ⬜ Add
push_to_hubutility for uploading trained models - ⬜ Release Colab demos for alignment and training
- ⬜ Add support for additional vision backbones (e.g., SAM, EVA-CLIP)
If you use our work, please cite:
@inproceedings{maniparambil2025harnessing,
title={Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment},
author={Maniparambil, Mayug and Akshulakov, Raiymbek and Djilali, Yasser Abdelaziz Dahou and Narayan, Sanath and Singh, Ankit and O'Connor, Noel E},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}- Code adapted in part from the LiLT project.
- Dataset downloads powered by img2dataset.