# Contrastive Language-Image Pretraining with SogCLR

### **Introduction**

In this tutorial, you will learn how to conduct contrastive language-image pretraining by optimizing the [Global Contrastive Loss](https://arxiv.org/abs/2202.12387) (GCL) on a subset of the [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions/) dataset. Also, you will learn how to evaluate the model on retrieval task using the [MSCOCO](https://cocodataset.org/#home) dataset and zero-shot classification task using the [ImageNet](https://www.image-net.org/challenges/LSVRC/index.php) dataset. The code is based on [iSogCLR's](https://github.com/zhqiu/contrastive-learning-iSogCLR) codebase, which includes the implementation of CLIP, SogCLR and iSogCLR.

### Preparation

First, we:

1. Download the source code and data
2. Install required packages

In [None]:
!pip install gdown

Collecting gdown
  Downloading gdown-5.2.0-py3-none-any.whl.metadata (5.8 kB)
Collecting tqdm (from gdown)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Downloading gdown-5.2.0-py3-none-any.whl (18 kB)
Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, gdown
Successfully installed gdown-5.2.0 tqdm-4.67.1


In [3]:
!git clone -b project https://github.com/hgarg97/EfficientCLIPTraining.git iSogCLR

/bin/bash: git: command not found


In [None]:
!git clone -b project https://github.com/hgarg97/EfficientCLIPTraining.git iSogCLR

!export PYTHONPATH="$PYTHONPATH:./iSogCLR/bimodal_exps"
!export HUGGINGFACE_HUB_CACHE='./checkpoints/huggingface'
!mkdir checkpoints

Cloning into 'iSogCLR'...
remote: Enumerating objects: 288, done.[K
remote: Counting objects: 100% (81/81), done.[K
remote: Compressing objects: 100% (38/38), done.[K
remote: Total 288 (delta 51), reused 62 (delta 42), pack-reused 207 (from 1)[K
Receiving objects: 100% (288/288), 153.37 KiB | 13.94 MiB/s, done.
Resolving deltas: 100% (132/132), done.


In [None]:
# Creating datasets folder

!mkdir datasets
print(1)
!mkdir -p datasets/imagenet
print(2)

1
2


In [None]:
# Downloading and Unzipping clip.tar.gz file

!gdown 1riKYZDPW2QQLTKX4OWDZK7CpCfM5MLg6    # clip.tar.gz

!tar xf clip_train.tar.gz
print(3)

Downloading...
From: https://drive.google.com/uc?id=1riKYZDPW2QQLTKX4OWDZK7CpCfM5MLg6
To: /content/clip_train.tar.gz
  0% 0.00/4.06M [00:00<?, ?B/s]100% 4.06M/4.06M [00:00<00:00, 181MB/s]
3


In [None]:
# Downloading and Unzipping cc3m_subset_100k.tar.gz file

!gdown 17lYK5zF0GpSZVXlMcPOHD_nucA2qtdrz    # cc3m_subset_100k.tar.gz

!tar xf cc3m_subset_100k.tar.gz -C datasets
print(4)

Downloading...
From (original): https://drive.google.com/uc?id=17lYK5zF0GpSZVXlMcPOHD_nucA2qtdrz
From (redirected): https://drive.google.com/uc?id=17lYK5zF0GpSZVXlMcPOHD_nucA2qtdrz&confirm=t&uuid=371d0c4e-0e58-441a-8457-70a5ae746883
To: /content/cc3m_subset_100k.tar.gz
100% 3.07G/3.07G [00:59<00:00, 51.3MB/s]
4


In [None]:
# Downloading and Unzipping ms_coco_val.tar.gz file

!gdown 1XK6L_jV1ImBzLi4_7tOG7gYCJBjzHWzv    # ms_coco_val.tar.gz

!tar xf mscoco_val.tar.gz -C datasets
print(5)

Downloading...
From (original): https://drive.google.com/uc?id=1XK6L_jV1ImBzLi4_7tOG7gYCJBjzHWzv
From (redirected): https://drive.google.com/uc?id=1XK6L_jV1ImBzLi4_7tOG7gYCJBjzHWzv&confirm=t&uuid=4d13e242-0dbe-4b6e-9455-c5b328320f22
To: /content/mscoco_val.tar.gz
100% 819M/819M [00:16<00:00, 49.5MB/s]
5


In [None]:
# Downloading and Unzipping val.tar file

!gdown 1SUK9F3ZBxdorGpsTS0QjO0gl9Bpbg0d-    # val.tar

!tar xf val.tar -C datasets/imagenet
print(6)

Downloading...
From (original): https://drive.google.com/uc?id=1SUK9F3ZBxdorGpsTS0QjO0gl9Bpbg0d-
From (redirected): https://drive.google.com/uc?id=1SUK9F3ZBxdorGpsTS0QjO0gl9Bpbg0d-&confirm=t&uuid=71374832-b745-4d7b-9d1c-330125d1c251
To: /content/val.tar
100% 6.75G/6.75G [01:38<00:00, 68.8MB/s]
6


In [None]:
# Installing libraries

!pip install -r ./iSogCLR/requirements_colab.txt    # there may be pip warnings/ errors, should be fine to ignore them

Collecting braceexpand==0.1.7 (from -r ./iSogCLR/requirements_colab.txt (line 1))
  Downloading braceexpand-0.1.7-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting colorama==0.4.6 (from -r ./iSogCLR/requirements_colab.txt (line 2))
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting ftfy==6.1.1 (from -r ./iSogCLR/requirements_colab.txt (line 3))
  Downloading ftfy-6.1.1-py3-none-any.whl.metadata (6.1 kB)
Collecting huggingface-hub==0.16.4 (from -r ./iSogCLR/requirements_colab.txt (line 4))
  Downloading huggingface_hub-0.16.4-py3-none-any.whl.metadata (12 kB)
Collecting safetensors==0.3.3 (from -r ./iSogCLR/requirements_colab.txt (line 5))
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.7 kB)
Collecting timm==0.9.7 (from -r ./iSogCLR/requirements_colab.txt (line 6))
  Downloading timm-0.9.7-py3-none-any.whl.metadata (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.8/58.8 kB[0m [

# Training

The following command runs the training script to train a ResNet50 (pretrained on ImageNet) and a DistilBERT (pretrained on BookCorpus and English Wikipedia) on the cc3m dataset using the SogCLR loss for 30 epochs with temperature 0.01.

## isogclr_new_v2 + adamp

### Training

In [None]:
!CUDA_VISIBLE_DEVICES=0 python ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/isogclr_new_v2_and_adamp \
    --init_model \
    --use_amp \
    --ita_type isogclr_new_v2 \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --epochs 30 \
    --opt adamp

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Downloading tokenizer_config.json: 100% 48.0/48.0 [00:00<00:00, 321kB/s]
Downloading config.json: 100% 483/483 [00:00<00:00, 3.25MB/s]
Downloading vocab.txt: 100% 232k/232k [00:00<00:00, 1.10MB/s]
Downloading tokenizer.json: 100% 466k/466k [00:00<00:00, 2.18MB/s]
Creating model
Downloading model.safetensors: 100% 102M/102M [00:00<00:00, 238MB/s] 
Downloading model.safetensors: 100% 268M/268M [00:01<00:00, 241MB/s]
Start training
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha = 1) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1642.)
  exp_avg.mul_(beta1).add_(1 - beta1, grad)
Train Epoch: [0]  [  0/781]  eta: 2:18:06  lr: 0.000010  lr_temp_net: 0.00000100  loss_ita: 0.2269  avg_image_tau: 0.0050  avg_text_tau: 0.0050  cur_eta: 0.0300  grad_tau_image: 2.9889  grad_tau_text: 2.1905  b_I: 0.0000  b_T: 0.0000  v: 0.

### Evaluation

The following command runs the evaluation script to evaluate the retrieval performance of the trained model on the MSCOCO validation dataset and the zero-shot classification performance on the ImageNet validation dataset. The evaluation command is obtained by appending `--evaluate --checkpoint /path/to/your/checkpoint --zs_dataset imagenet --zs_datafolder /path/to/imagenet/val` to the training command.

In [None]:
!CUDA_VISIBLE_DEVICES=0 python ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/isogclr_new_v2_and_adamp \
    --init_model \
    --use_amp \
    --ita_type isogclr_new_v2 \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --epochs 30 \
    --evaluate --checkpoint './output/isogclr_new_v2_and_adamp/checkpoint_30.pth' \
    --zs_dataset imagenet --zs_datafolder ./datasets/imagenet/val

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
load checkpoint from ./output/isogclr_new_v2_and_adamp/checkpoint_30.pth
Start training
Computing features for evaluation...
Evaluation time 0:00:49
coco val: {'txt_r1': 9.5, 'txt_r5': 24.88, 'txt_r10': 35.88, 'txt_r_mean': 23.419999999999998, 'img_r1': 7.157423327602063, 'img_r5': 20.012795393658283, 'img_r10': 29.405414050941662, 'img_r_mean': 18.85854425740067, 'r_mean': 21.139272128700334}
zeroshot: {'zeroshot_top1': 18.862, 'zeroshot_top3': 31.802, 'zeroshot_top5': 37.752, 'zeroshot_top10': 46.152}
Training time 0:06:17


## cyclip + radam

### Training

In [None]:
!CUDA_VISIBLE_DEVICES=0 python ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/cyclip_and_radam \
    --init_model \
    --use_amp \
    --ita_type cyclip \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --epochs 30 \
    --opt radam

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
Start training
Train Epoch: [0]  [  0/781]  eta: 1:56:01  lr: 0.000010  lr_temp_net: 0.00000100  loss_ita: 20.9827  avg_image_tau: 0.0100  avg_text_tau: 0.0100  cur_eta: 0.0000  grad_tau_image: 0.0000  grad_tau_text: 0.0000  b_I: 0.0000  b_T: 0.0000  v: 0.0000  lamda: 0.0000  weights_image_pos: 0.0000  weights_text_pos: 0.0000  time: 8.9137  data: 1.2218  max mem: 11691
	addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
	addcmul_(Tensor tensor1, Tensor tensor2, *, Number value = 1) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1642.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
Train Epoch: [0]  [ 50/781]  eta: 0:10:24  lr: 0.000010  lr_temp_net: 0.00000100  loss_ita: 16.0240  avg_image_tau: 0.0100  avg_text_tau: 0.0100  cur_eta: 0.0000  grad_tau_image: 0.0000  grad_tau_text: 0.0000  b_I: 0.0000  b_T: 

### Evaluation

In [None]:
!CUDA_VISIBLE_DEVICES=0 python ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/cyclip_and_radam \
    --init_model \
    --use_amp \
    --ita_type cyclip \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --epochs 30 \
    --evaluate --checkpoint './output/cyclip_and_radam/checkpoint_30.pth' \
    --zs_dataset imagenet --zs_datafolder ./datasets/imagenet/val

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
load checkpoint from ./output/cyclip_and_radam/checkpoint_30.pth
Start training
Computing features for evaluation...
Evaluation time 0:00:51
coco val: {'txt_r1': 13.7, 'txt_r5': 34.14, 'txt_r10': 45.44, 'txt_r_mean': 31.093333333333334, 'img_r1': 10.4402415130553, 'img_r5': 27.654044543964172, 'img_r10': 38.75004998200648, 'img_r_mean': 25.61477867967532, 'r_mean': 28.35405600650433}
zeroshot: {'zeroshot_top1': 26.32, 'zeroshot_top3': 40.052, 'zeroshot_top5': 46.17, 'zeroshot_top10': 53.932}
Training time 0:06:30


## cyclip + nadam

### Training

In [None]:
!CUDA_VISIBLE_DEVICES=0 python ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/cyclip_and_nadam \
    --init_model \
    --use_amp \
    --ita_type cyclip \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --epochs 30 \
    --opt nadam

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
Start training
Train Epoch: [0]  [  0/781]  eta: 1:55:18  lr: 0.000010  lr_temp_net: 0.00000100  loss_ita: 20.9821  avg_image_tau: 0.0100  avg_text_tau: 0.0100  cur_eta: 0.0000  grad_tau_image: 0.0000  grad_tau_text: 0.0000  b_I: 0.0000  b_T: 0.0000  v: 0.0000  lamda: 0.0000  weights_image_pos: 0.0000  weights_text_pos: 0.0000  time: 8.8584  data: 1.1208  max mem: 11693
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha = 1) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1642.)
  exp_avg.mul_(beta1).add_(1. - beta1, grad)
Train Epoch: [0]  [ 50/781]  eta: 0:10:44  lr: 0.000010  lr_temp_net: 0.00000100  loss_ita: 9.3725  avg_image_tau: 0.0100  avg_text_tau: 0.0100  cur_eta: 0.0000  grad_tau_image: 0.0000  grad_tau_text: 0.0000  b_I: 0.0000  b_T: 0.0000  v: 0.0000  lamda: 0.0000  weights_image_pos: 0.00

### Evaluation

In [None]:
!CUDA_VISIBLE_DEVICES=0 python ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/cyclip_and_nadam \
    --init_model \
    --use_amp \
    --ita_type cyclip \
    --tau_init 0.01 \
    --sogclr_gamma 0.8 \
    --eta_init 0.03 --sched cosine \
    --no-distributed \
    --epochs 30 \
    --evaluate --checkpoint './output/cyclip_and_nadam/checkpoint_30.pth' \
    --zs_dataset imagenet --zs_datafolder ./datasets/imagenet/val

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
load checkpoint from ./output/cyclip_and_nadam/checkpoint_30.pth
Start training
Computing features for evaluation...
Evaluation time 0:00:58
coco val: {'txt_r1': 3.74, 'txt_r5': 12.22, 'txt_r10': 20.06, 'txt_r_mean': 12.006666666666666, 'img_r1': 3.374785077372146, 'img_r5': 10.820104762285577, 'img_r10': 16.749970010796112, 'img_r_mean': 10.314953283484611, 'r_mean': 11.160809975075638}
zeroshot: {'zeroshot_top1': 4.076, 'zeroshot_top3': 8.886, 'zeroshot_top5': 12.388, 'zeroshot_top10': 18.314}
Training time 0:06:35


## Ntxent + rAdam

### Training

In [3]:
!CUDA_VISIBLE_DEVICES=0 python ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \f
    --train_image_root cc3m_subset_100k \
    --output_dir output/ntxent_and_radam \
    --init_model \
    --use_amp \
    --ita_type ntxent \
    --temp 0.5 \
    --no-distributed \
    --epochs 30 \
    --opt radam

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
Start training
	addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
	addcmul_(Tensor tensor1, Tensor tensor2, *, Number value = 1) (Triggered internally at /opt/conda/conda-bld/pytorch_1729647329220/work/torch/csrc/utils/python_arg_parser.cpp:1642.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
Train Epoch: [0]  [  0/781]  eta: 20:29:16  lr: 0.000010  lr_temp_net: 0.00000100  loss_ita: 4.8574  avg_image_tau: 0.0000  avg_text_tau: 0.0000  cur_eta: 0.0000  grad_tau_image: 0.0000  grad_tau_text: 0.0000  b_I: 0.0000  b_T: 0.0000  v: 0.0000  lamda: 0.0000  weights_image_pos: 0.0000  weights_text_pos: 0.0000  time: 94.4386  data: 14.4249  max mem: 13338
Train Epoch: [0]  [ 50/781]  eta: 1:34:10  lr: 0.000010  lr_temp_net: 0.00000100  loss_ita: 4.8469  avg_image_tau: 0.0000  avg_text_tau: 0.0000  cur_eta: 0.0000  grad_tau_image: 0.0

### Evaluation

In [4]:
!CUDA_VISIBLE_DEVICES=0 python ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/ntxent_and_radam \
    --init_model \
    --use_amp \
    --ita_type ntxent \
    --temp 0.07 \
    --no-distributed \
    --epochs 30 \
    --evaluate --checkpoint './output/ntxent_and_radam/checkpoint_30.pth' \
    --zs_dataset imagenet --zs_datafolder ./datasets/imagenet/val

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
load checkpoint from ./output/ntxent_and_radam/checkpoint_30.pth
Start training
Computing features for evaluation...
Evaluation time 0:01:21
coco val: {'txt_r1': 3.56, 'txt_r5': 11.76, 'txt_r10': 18.52, 'txt_r_mean': 11.280000000000001, 'img_r1': 2.8949578151865327, 'img_r5': 9.300651765364469, 'img_r10': 15.054580351073614, 'img_r_mean': 9.083396643874872, 'r_mean': 10.181698321937436}
zeroshot: {'zeroshot_top1': 7.116, 'zeroshot_top3': 14.826, 'zeroshot_top5': 19.886, 'zeroshot_top10': 28.806}
Training time 0:45:31


## InfoNCE Loss + rAdam

### Training

In [2]:
!CUDA_VISIBLE_DEVICES=0 python ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/infonce_and_radam \
    --init_model \
    --use_amp \
    --ita_type infonce \
    --temp 0.07 \
    --no-distributed \
    --epochs 30 \
    --opt radam

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
Start training
	addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
	addcmul_(Tensor tensor1, Tensor tensor2, *, Number value = 1) (Triggered internally at /opt/conda/conda-bld/pytorch_1729647329220/work/torch/csrc/utils/python_arg_parser.cpp:1642.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
Train Epoch: [0]  [  0/781]  eta: 19:26:15  lr: 0.000010  lr_temp_net: 0.00000100  loss_ita: 4.9577  avg_image_tau: 0.0000  avg_text_tau: 0.0000  cur_eta: 0.0000  grad_tau_image: 0.0000  grad_tau_text: 0.0000  b_I: 0.0000  b_T: 0.0000  v: 0.0000  lamda: 0.0000  weights_image_pos: 0.0000  weights_text_pos: 0.0000  time: 89.5976  data: 1.9247  max mem: 13338
Train Epoch: [0]  [ 50/781]  eta: 0:26:47  lr: 0.000010  lr_temp_net: 0.00000100  loss_ita: 4.8832  avg_image_tau: 0.0000  avg_text_tau: 0.0000  cur_eta: 0.0000  grad_tau_image: 0.00

### Evaluation

In [3]:
!CUDA_VISIBLE_DEVICES=0 python ./iSogCLR/bimodal_exps/clip.py \
    --data_path ./datasets \
    --ann_path ./clip_train \
    --train_file cc3m_train_subset.json \
    --train_image_root cc3m_subset_100k \
    --output_dir output/infonce_and_radam \
    --init_model \
    --use_amp \
    --ita_type infonce \
    --temp 0.07 \
    --no-distributed \
    --epochs 30 \
    --evaluate --checkpoint './output/infonce_and_radam/checkpoint_30.pth' \
    --zs_dataset imagenet --zs_datafolder ./datasets/imagenet/val

Creating retrieval dataset
len of train_dataset: 100000
len of coco val: 5000
Creating model
load checkpoint from ./output/infonce_and_radam/checkpoint_30.pth
Start training
Computing features for evaluation...
Evaluation time 0:01:07
coco val: {'txt_r1': 11.82, 'txt_r5': 30.36, 'txt_r10': 41.9, 'txt_r_mean': 28.026666666666667, 'img_r1': 9.20868487344556, 'img_r5': 24.36322923747451, 'img_r10': 34.94342036866728, 'img_r_mean': 22.838444826529116, 'r_mean': 25.43255574659789}
zeroshot: {'zeroshot_top1': 23.632, 'zeroshot_top3': 37.518, 'zeroshot_top5': 43.6, 'zeroshot_top10': 51.644}
Training time 0:33:00


### Benchmarks

The following results are recall at 1 results on the provided MSCOCO and ImageNet datasets. The first row of results are from the model trained using the CLIP loss, and the second row of results are from the model trained using the SogCLR loss. All results are based on a batch size of 128 for 30-epoch pretraining. IR@1 denotes the recall at 1 of image retrieval on MSCOCO, TR@1 denotes the recall at 1 of text retrieval on MSCOCO, and ACC@1 denotes the top 1 accuracy on ImageNet. Average denotes the average of the three metrics.

| Method | MSCOCO TR@1 | MSCOCO IR@1 | ImageNet ACC@1 | Average |
|:----------:|:--------:|:--------:|:--------:|:--------:|
| CLIP | 12.0 | 9.32 | 21.35 | 14.22 |
| SogCLR |  14.38  |  10.73  | 24.54 | 16.55 |

## Results

In [6]:
import pandas as pd
Results= pd.DataFrame()
Results['Method']=["CLIP + AdamW", "CLIP + Novograd", "isogclr_new + AdamP", "isogclr_new_v2 + AdamP", "cyclip + radam", "cyclip + nadam", "InfoNCE + RAdam", "NT-Exent + RAdam"]
Results['MSCOCO TR@1']=[12.22, 9.84, 13.96, 9.4, 13.7, 3.74, 11.82, 3.56]
Results['MSCOCO IR@1']=[9.07, 6.89, 10.58, 7.15, 10.44, 3.374, 9.20, 2.89]
Results['ImageNet ACC@1']=[21.36, 14.17, 26.08, 18.82, 26.32, 4.076, 23.63, 7.11]
Results['Average'] = Results[['MSCOCO TR@1', 'MSCOCO IR@1', 'ImageNet ACC@1']].mean(axis=1)
Results

Unnamed: 0,Method,MSCOCO TR@1,MSCOCO IR@1,ImageNet ACC@1,Average
0,CLIP + AdamW,12.22,9.07,21.36,14.216667
1,CLIP + Novograd,9.84,6.89,14.17,10.3
2,isogclr_new + AdamP,13.96,10.58,26.08,16.873333
3,isogclr_new_v2 + AdamP,9.4,7.15,18.82,11.79
4,cyclip + radam,13.7,10.44,26.32,16.82
5,cyclip + nadam,3.74,3.374,4.076,3.73
6,InfoNCE + RAdam,11.82,9.2,23.63,14.883333
7,NT-Exent + RAdam,3.56,2.89,7.11,4.52
