# Contrastive Language-Image Pretraining with SogCLR

### **Introduction**

In this tutorial, you will learn how to conduct contrastive language-image pretraining by optimizing the [Global Contrastive Loss](https://arxiv.org/abs/2202.12387) (GCL) on a subset of the [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/) dataset. Also, you will learn how to evaluate the model on retrieval task using the [MSCOCO](https://cocodataset.org/#home) dataset and zero-shot classification task using the [ImageNet](https://www.image-net.org/challenges/LSVRC/index.php) dataset. The code is based on [iSogCLR's](https://github.com/zhqiu/contrastive-learning-iSogCLR) codebase, which includes the implementation of CLIP, SogCLR and iSogCLR.

### Preparation

First, we:

1. Download the source code and data
2. Install required packages

In [3]:
! pip install -r ../requirements.txt --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autovizwidget 0.21.0 requires pandas<2.0.0,>=0.20.1, but you have pandas 2.1.1 which is incompatible.
hdijupyterutils 0.21.0 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.1.1 which is incompatible.
sparkmagic 0.21.0 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.1.1 which is incompatible.[0m[31m
[0m

### Training

The following command runs the training script to train a ResNet50 (pretrained on ImageNet) and a DistilBERT (pretrained on BookCorpus and English Wikipedia) on the cc3m dataset using the SogCLR loss for 30 epochs with temperature 0.01.

In [26]:
!python ../main.py \
    --data_path ../../datasets  \
    --ann_path ../../datasets/clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir ../../output/sogclr_cc3m_g0.8_e30 \
    --init_model \
    --use_amp \
    --ita_type sogclr \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --device cuda \
    --val_frequency 5 \
    --step_size_per_epoch 200 \
    --print_freq_per_epoch 100 \
    --epochs 1

***
Creating retrieval dataset
***
len of train_dataset: 100000
len of validation dataset: 5000
***
Creating model
***
Cosine annealing scheduler will have no effect on the learning rate since t_initial = t_mul = eta_mul = 1.
Start training
Train Epoch: [0]: 100%|█| 781/781 [08:49<00:00,  1.48it/s, loss_ita=0.0104, lr=0
Averaged stats: lr: 0.0001  lr_temp_net: 0.0000  loss_ita: 0.0228  avg_image_tau: 0.0000  avg_text_tau: 0.0000  cur_eta: 0.0000  grad_tau_image: 0.0000  grad_tau_text: 0.0000  b_I: 0.0000  b_T: 0.0000  v: 0.0000  lamda: 0.0000  weights_image_pos: 0.0000  weights_text_pos: 0.0000
Training time: 0:08:53
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/improved-clip/notebooks/../main.py", line 192, in <module>
    train_stats = run_pipeline(args)
  File "/home/ec2-user/SageMaker/improved-clip/notebooks/../main.py", line 96, in run_pipeline
    return train_stats
NameError: name 'train_stats' is not defined


### Evaluation

The following command runs the evaluation script to evaluate the retrieval performance of the trained model on the MSCOCO validation dataset and the zero-shot classification performance on the ImageNet validation dataset. The evaluation command is obtained by appending `--evaluate --checkpoint /path/to/your/checkpoint --zs_dataset imagenet --zs_datafolder /path/to/imagenet/val` to the training command.

In [None]:
!python ../main.py \
    --data_path ../../datasets  \
    --ann_path ../../datasets/clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/isogclr_cc3m_g0.8_e30 \
    --init_model \
    --use_amp \
    --ita_type sogclr \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --epochs 1 \
    --evaluate --checkpoint '../../output/sogclr_cc3m_g0.8_e30/checkpoint_1.pth' \
    --zs_dataset imagenet --zs_datafolder ../../datasets/imagenet/val

***
Creating retrieval dataset
***
len of train_dataset: 100000
len of validation dataset: 5000
***
Creating model
***
load checkpoint from ../../output/sogclr_cc3m_g0.8_e30/checkpoint_1.pth
Cosine annealing scheduler will have no effect on the learning rate since t_initial = t_mul = eta_mul = 1.
Start training
Training time: 0:00:00
***
Starting evaluation
***
Computing features for evaluation...
Model Retrieval Evaluation time 0:00:49
Validation results: {'txt_r1': 7.06, 'txt_r5': 20.62, 'txt_r10': 30.62, 'txt_r_mean': 19.433333333333334, 'img_r1': 4.234475588788036, 'img_r5': 13.779039545763526, 'img_r10': 21.520252709024753, 'img_r_mean': 13.177922614525437, 'r_mean': 16.305627973929386}
starting zeroshot transfer...


### Benchmarks

The following results are recall at 1 results on the provided MSCOCO and ImageNet datasets. The first row of results are from the model trained using the CLIP loss, and the second row of results are from the model trained using the SogCLR loss. All results are based on a batch size of 128 for 30-epoch pretraining. IR@1 denotes the recall at 1 of image retrieval on MSCOCO, TR@1 denotes the recall at 1 of text retrieval on MSCOCO, and ACC@1 denotes the top 1 accuracy on ImageNet. Average denotes the average of the three metrics.

| Method | MSCOCO TR@1 | MSCOCO IR@1 | ImageNet ACC@1 | Average |
|:----------:|:--------:|:--------:|:--------:|:--------:|
| CLIP | 12.0 | 9.32 | 21.35 | 14.22 |
| SogCLR |  14.38  |  10.73  | 24.54 | 16.55 |