CLIP4Cir

CLIP for Composed image retrieval

About The Project

Composed image retrieval task

CLIP task-oriented fine-tuning

Combiner training

Combiner architecture

Abstract

Recent works have shown that large-scale vision and language pretrained (VLP) models can be used to address many different tasks, such as zero-shot learning or text-to-image retrieval. In this paper, we explore the use of features obtained from the OpenAI CLIP model to address the task of composed image retrieval. This task is a new multimodal retrieval task where the query consists of a reference image and an associated text that adds information on conditions or changes that the user wants with respect to the reference image, i.e. the query is provided as an image-language pair

To address this task, we initially perform a task-oriented fine-tuning of both CLIP encoders using a simple combination of visual and textual features. Then, in the second stage, we learn a Combiner network that can merge the fine-tuned features integrating the multimodal information and providing combined features used to perform the retrieval task. Contrastive learning is used in the training of both stages.

Starting from the bare CLIP features as a simple baseline, we show that both the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval

Built With

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

We strongly recommend the use of the Anaconda package manager to avoid dependency/reproducibility problems. A conda installation guide for Linux systems can be found here

Installation

Clone the repo

git clone https://github.com/ABaldrati/CLIP4Cir

Install Python dependencies

conda create -n clip4cir -y python=3.8
conda activate clip4cir
conda install -y -c pytorch pytorch=1.11.0 torchvision=0.12.0
conda install -y -c anaconda pandas=1.4.2
pip install comet-ml==3.21.0
pip install git+https://github.com/openai/CLIP.git

Usage

Here's a brief description of each file under the src/ directory:

For running the following scripts in a decent amount of time, it is heavily recommended to use a CUDA-capable GPU. It is also recommended to have a properly initialized Comet.ml account to have better logging of the metrics (all the metrics will also be logged on a csv file).

utils.py: utils file
combiner.py: Combiner model definition
data_utils.py: dataset loading and preprocessing utils
clip_fine_tune.py: CLIP task-oriented fine-tuning file
combiner_train.py: Combiner training file
validate.py: compute metrics on the validation sets
cirr_test_submission.py: generate test prediction on cirr test set

N.B The purpose of the code in this repo is to be as clear as possible. For this reason, it does not include some optimizations such as gradient checkpointing (when fine-tuning CLIP) and feature pre-computation (when training the Combiner network)

Data Preparation

To properly work with the codebase FashionIQ and CIRR datasets should have the following structure:

project_base_path
└───  fashionIQ_dataset
      └─── captions
            | cap.dress.test.json
            | cap.dress.train.json
            | cap.dress.val.json
            | ...
            
      └───  images
            | B00006M009.jpg
            | B00006M00B.jpg
            | B00006M6IH.jpg
            | ...
            
      └─── image_splits
            | split.dress.test.json
            | split.dress.train.json
            | split.dress.val.json
            | ...

└───  cirr_dataset  
       └─── train
            └─── 0
                | train-10108-0-img0.png
                | train-10108-0-img1.png
                | train-10108-1-img0.png
                | ...
                
            └─── 1
                | train-10056-0-img0.png
                | train-10056-0-img1.png
                | train-10056-1-img0.png
                | ...
                
            ...
            
       └─── dev
            | dev-0-0-img0.png
            | dev-0-0-img1.png
            | dev-0-1-img0.png
            | ...
       
       └─── test1
            | test1-0-0-img0.png
            | test1-0-0-img1.png
            | test1-0-1-img0.png 
            | ...
       
       └─── cirr
            └─── captions
                | cap.rc2.test1.json
                | cap.rc2.train.json
                | cap.rc2.val.json
                
            └─── image_splits
                | split.rc2.test1.json
                | split.rc2.train.json
                | split.rc2.val.json

Pre-trained models

We provide the pre-trained (both CLIP and Combiner network) checkpoint via Google Drive in case you don't have enough GPU resources

CLIP fine-tuning

To fine-tune the CLIP model on FashionIQ or CIRR dataset run the following command with the desired hyper-parameters:

python src/clip_fine_tune.py --dataset {'CIRR' or 'FashionIQ'} --api-key {Comet-api-key} --workspace {Comet-workspace} --experiment-name {Comet-experiment-name} --num-epochs 100 --clip-model-name RN50x4 --encoder both --learning-rate 2e-6 --batch-size 128 --transform targetpad --target-ratio 1.25  --save-training --save-best --validation-frequency 1

Combiner training

To train the Combiner model on FashionIQ or CIRR dataset run the following command with the desired hyper-parameters:

python src/combiner_train.py --dataset {'CIRR' or 'FashionIQ'} --api-key {Comet-api-key} --workspace {Comet-workspace} --experiment-name {Comet-experiment-name} --projection-dim 2560 --hidden-dim 5120 --num-epochs 300 --clip-model-name RN50x4 --clip-model-path {path-to-fine-tuned-CLIP} --combiner-lr 2e-5 --batch-size 4096 --clip-bs 32 --transform targetpad --target-ratio 1.25 --save-training --save-best --validation-frequency 1

Validation

To compute the metrics on the validation set run the following command

python src/validate.py --dataset {'CIRR' or 'FashionIQ'} --combining-function {'combiner' or 'sum'} --combiner-path {path to trained Combiner} --projection-dim 2560 --hidden-dim 5120 --clip-model-name RN50x4 --clip-model-path {path-to-fine-tuned-CLIP} --target-ratio 1.25 --transform targetpad

Test

To generate the prediction files to be submitted on CIRR evaluation server run the following command:

python src/cirr_test_submission.py --submission-name {file name of the submission} --combining-function {'combiner' or 'sum'} --combiner-path {path to trained Combiner} --projection-dim 4096 --hidden-dim 8192 --clip-model-name RN50x4 --clip-model-path {path-to-fine-tuned-CLIP} --target-ratio 1.25 --transform targetpad

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIP4Cir

CLIP for Composed image retrieval

Table of Contents

About The Project

Composed image retrieval task

CLIP task-oriented fine-tuning

Combiner training

Combiner architecture

Abstract

Built With

Getting Started

Prerequisites

Installation

Usage

Data Preparation

Pre-trained models

CLIP fine-tuning

Combiner training

Validation

Test

Authors

License

Citation

Contacts

About

Releases

Packages

Languages

hy18284/CLIP4Cir

Folders and files

Latest commit

History

Repository files navigation

CLIP4Cir

CLIP for Composed image retrieval

Table of Contents

About The Project

Composed image retrieval task

CLIP task-oriented fine-tuning

Combiner training

Combiner architecture

Abstract

Built With

Getting Started

Prerequisites

Installation

Usage

Data Preparation

Pre-trained models

CLIP fine-tuning

Combiner training

Validation

Test

Authors

License

Citation

Contacts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages