The recent `Google Landmark Recognition Challenge` [winning solutions](https://www.kaggle.com/competitions/google-universal-image-embedding/discussion/359316) has seen many versions of CLIP model being used. GLRC is a zero-short learning task where we need to create a 64D image embedding such that identical images should have similar embedding. 


## Open source implmentations
- https://github.com/mlfoundations/open_clip
- https://openai.com/blog/clip/
- https://arxiv.org/pdf/2103.00020.pdf

> Contrastive Language-Image Pretraining aka CLIP reached SOTA accuracy (ResNet-50 supervised model accuracy) on imagenet without using single image from it aka `ZERO shot learning`

> In NLP, GPT-3 is trained on webscale data without supervision (masked language models) and surpassed all tasks when compared to their counterparts trained on high-quality crowd labelled dataset. 

> Clip solves the following major problems
- Costly datasets are not required 
- not `Narrow` anymore. CLIP model is competitive with fully supervised models across 30 datasets including OCR and GEOspatial datasets. 
- Poor real-world performance is resolved. 

## CLIP Alogirthm: How CLIP is built?
- what dataset is used?
- Architecuture and loss function
- Accuracy

### Dataset - WIT - WebImageText
- CLIP uses 400 million (image, text) pairs crawled from internet. 
- First all the wikipedia words are taken, these are agumented with bi-grams. only words which occur more than 100 times are taken. Roughly 500,000 queries were considered. Approx 20k (image, text) pairs were selected for each pair. 
- The resulting dataset has a similar total word count as the WebText dataset used to train GPT-2


<img src="images/clip_wit_example.png" alt="alt text" width="500" align="left"/>

### Architecture

<img src="images/clip_pseudo_code.png" alt="alt text" width="400" align="left"/>

CLIP works on the principle that given an image, predict which out of a set of N (32,768) randomly sampled text snippets, was actually paired with it in our dataset.

#### Text encoder
- BPE (lower-cased byte pair encoding) text embeddings of vocab size 49,152 was used.
- transformers were used to encode the text and generate a lower order dimensional representation.
- For computational efficiency, the max sequence length was capped at 76.

#### image encoder.
- ResNet, Efficientnet, ViT - A total of 32 variants were tested.
- found ViT-L/14@336px to be working better than everything.


#### Training
- contrastive loss uses cross entropy for pos (N) and neg pairs (N2-N). shown in the pseudo code.
- several hyper-parameters were tested for one epoch. 
- cosine-scheduling and  large minibatch size of 32,768 were used.
- mixed precision was used to save memory and accelarate computation.

<img src="images/clip_stage1.png" alt="alt text" width="300" align="left"/>

#### inference 
- For each label in the dataset use a prompt (a photo of {label}) and generate text embedding. 
- Then each image is passed through the `image encoder` and image embedding. These embeddings are scaled by a temperature parameter $\tau$ , and normalized into a probability distribution via a softmax.
- This is kind of multinomial logistic regression classifier with L2-normalized inputs, L2-normalized weights, no bias, and temperature scaling.

<img src="images/clip_inference.png" alt="alt text" width="300" align="left"/>


## Prompt Engineering 
While doing inference, using just the label of the class might not be sufficient as the WebImageText dataset contain phrases describing the image. so the authors have used `a photo of {label}` as prompt for classification. The promot improved the accuracy of imagenet by 1.3%. 

| Dataset | Prompt|
|--------- | ---- |
|  Oxford-IIIT Pets | “A photo of a {label}, a type of pet.”|
| Food101 | a {type} of food|
|  FGVC Aircraft | a {type} of aircraft|
| satellite image classification | “a satellite photo of a {label}.”|

while doing an ensemble of 80 context prompts for each label in imagenet, they saw an improvement of 5% accuracy on ImageNet.

## Processing Cats and dogs dataset

#TODO