[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lorenzobasile/DeepLearningMHPC/blob/main/4_transformers.ipynb)

# Lab 4: Transformers

### Recap from previous Lab

* We saw a simple way to adversarially fool a neural network: the FGSM attack;
* We implemented a simple interpretability tool for CNN classifiers: GradCAM;
* We created our custom implementation of a simple self-attention layer with just one attention head.

### Today

We will introduce two useful libraries for large-scale Deep Learning projects, `transformers` and `datasets`, by HuggingFace. The former contains *millions* of freely accessible pre-trained networks, while the second provides easy access to hundreds of thousands of datasets. As an example, we will play with CLIP, a vision-language model, and show its capabilities in image classification. Then, we will see that it is possible to successfully transfer knowledge between different pre-trained encoders, by aligning the representations of CLIP and of a supervised Vision Transformer.

In [None]:
import torch
from PIL import Image
import requests

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Working with pre-trained transformers

## The `transformers` library


The Hugging Face Transformers library is a high-level interface for working with pre-trained large-scale models (mainly, but not only, transformers of course)  in PyTorch and TensorFlow. It provides free and easy access to a large collection of models for tasks like text classification, translation, summarization, image classification, and multimodal learning. The library handles data preprocessing, model loading, and inference with minimal code, while still allowing customization and fine-tuning. Models are versioned and hosted on the Hugging Face Hub, making it easy to experiment with different architectures. It’s widely used in both research and production for working with state-of-the-art transformer-based models.

## CLIP

Specifically, in the first part of this lab we are interested in CLIP, one of the best known Vision Language models (VLMs), trained on a contrastive loss to match images and text. CLIP contains two encoder-only transformers, one working on images and one on text. At the output layer, these two transformers project their representations into a shared multimodal space, where they can be compared using simple cosine similarity.

<img src="https://miro.medium.com/v2/resize:fit:1200/1*9xH55TenmdcNsRhDaJgdLg.png" width="800"/>



We can import this model and its corresponding data preprocessor from HuggingFace.

In [None]:
from transformers import CLIPProcessor, CLIPModel

Then, a CLIP model and preprocessor can be simply downloaded from the library. The HuggingFace [hub](https://huggingface.co/models) currently hosts more than a million pre-trained models.

In [None]:
clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch16").to(device)
clip.eval() # Today, there is no training
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")

By loading the CLIP model, we automatically load both the image and text transformers, which are contained in the model.

In [None]:
print(type(clip))
print(type(clip.text_model))
print(type(clip.vision_model))

Analogously, the preprocessor contains the tokenizer and image processor required for this CLIP version.

In [None]:
print(clip_processor.tokenizer)
print(clip_processor.image_processor)

Evaluating CLIP is quite straightforward: let's load an image first.

In [None]:
image_url = "https://staranzanoslow.it/wp-content/uploads/2020/12/trieste-castello-miramare-Gianpiero-Decorti00009.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

Then, we define some candidate caption. We want to decide which of these description better matches our image.

In [None]:
candidate_captions = [
    "A photo of the Colosseum",
    "A close-up view of the Tour Eiffel in Paris",
    "A painting of the Taj Mahal",
    "A shot of the Miramare Castle in Trieste",
    "An image of the San Giusto Castle in Trieste",
    "A photo of the Miramare Castle, France",
]

The first thing we need to do is to adapt our data (image and text) to the model, by running the preprocessing function. Note that we need to pad the sentences, as they have different lengths.

In [None]:
inputs = clip_processor(text=candidate_captions, images=image, return_tensors="pt", padding=True)
print(inputs.keys())
print(inputs['input_ids'])
print(inputs['pixel_values'].shape)

Now, we can pass our processed inputs and see that CLIP excels at classifying this image.

In [None]:
inputs = clip_processor(text=candidate_captions, images=image, return_tensors="pt", padding=True)

outputs = clip(**inputs.to(device)) #inputs is a dictionary
#outputs = clip(input_ids=inputs['input_ids'].to(device), pixel_values=inputs['pixel_values'].to(device)) #alternative way
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
print(probs)

However, this is not the only kind of classification that CLIP can do. It has not been trained to classify between different landmarks, but using a much richer signal, text, which gives out much more information to the model. This means that we can, for example, use it to classify the hour when the picture was likely taken.

In [None]:
candidate_captions = [
    "A photo taken at noon",
    "A photo taken at midnight",
    "A photo taken at sunset",
]

inputs = clip_processor(text=candidate_captions, images=image, return_tensors="pt", padding=True)

outputs = clip(**inputs.to(device))
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
probs

## The `datasets` library

The HuggingFace `datasets` library is a lightweight and efficient tool for accessing and working with large-scale datasets, especially for Computer Vision and NLP tasks. It provides a standardized interface to hundreds of popular datasets, including built-in support for streaming, filtering, and pre-processing. Data is stored in memory-efficient formats and can be easily loaded as PyTorch or TensorFlow tensors. The library also integrates with tokenizers and pre-trained models, making it easy to prepare data for training or evaluation. It’s a practical solution for handling datasets without writing custom loading and preprocessing code.

`datasets` is not installed in Colab by default, so we have to download it via `pip`.

Then, we can load our dataset. We will play with the [Food101](https://huggingface.co/datasets/ethz/food101) dataset. This is a challenging and large-scale dataset, containing images of food belonging to 101 classes. Images have different shapes, but they are usually a few hundred pixels wide and high. 

On the [hub](https://huggingface.co/datasets) you can find hundreds of thousands of other datasets.

In [None]:
#!pip install datasets
from datasets import load_dataset

ds = load_dataset("ethz/food101")

Let's have a look at our dataset.

In [None]:
print(ds)

We will be mainly using the test (validation) set, which is very large (more than 25k samples). For time and computational constraint, it is better to shrink it a bit. We do so by applying a new train-test split, in which we preserve the proportion between classes.

In [None]:
N_test_samples = 5000
validation_ds = ds['validation'].train_test_split(test_size=N_test_samples, stratify_by_column="label")
ds['validation'] = validation_ds['train']
ds['test'] = validation_ds['test']
print(ds)

We can simply obtain our class labels in a human-readable format.

In [None]:
classnames = ds['train'].features['label'].names
print(classnames)

We can have a quick look at a sample data point.

In [None]:
sample = ds['train'][20000]
print(classnames[sample['label']])
sample['image']

## Zero-shot classification with CLIP

Now, we want to leverage the image recognition capability of CLIP on a larger-scale experiment: we want to classify Food101, **without any additional training**.

The first thing we do is to create a custom collate function. The collate function is the bridge between the data and the model, and it gets called every time our DataLoader has to create a batch. In previous example, we just used the default collate function, but in this case it is convenient to customize it to facilitate data preprocessing.

Pay attention: unfortunately, there is no standard name for columns. For instance, in the case of CIFAR-10, you have `img` instead of `image`.

In [None]:
def collate_fn(samples, preprocess):
    images=preprocess(images=[sample['image'] for sample in samples], return_tensors="pt")['pixel_values']
    labels=torch.as_tensor([sample['label'] for sample in samples])
    return images, labels

In [None]:
from functools import partial
clip_testloader=torch.utils.data.DataLoader(ds['test'], collate_fn=partial(collate_fn, preprocess=clip_processor), batch_size=16, shuffle=False)

Now, we need our candidate captions. We can simply create a list of sentences formatted as "An image of a \<CLASS NAME\>".

In [None]:
candidate_captions = ['an image of a '+ classnames[i] for i in range(len(classnames))]

Now, we can obtain and store the encoding for each of the candidate captions. We normalize them because we don't want a high-norm encoding to 'obscure' the others.

In [None]:
processed_captions = clip_processor(text=candidate_captions, return_tensors='pt', padding=True)['input_ids']
class_encoding = clip.get_text_features(processed_captions.to(device))
class_encoding = torch.nn.functional.normalize(class_encoding, dim=-1)
print(class_encoding.shape)

We are basically done. We can loop through the test images, obtain their encoding from the CLIP image encoder, and compare each of them with the caption encodings. The one class that has highest similarity is the predicted label of our model.

This approach to classification is usually referred to as **zero-shot classification**. We are classifying a dataset not seen during training, without providing the model with any additional knowledge.

In [None]:
from tqdm import tqdm

with torch.no_grad():
    correct=0
    for x,y in tqdm(clip_testloader):
        x=x.to(device)
        y=y.to(device)
        out = clip.get_image_features(x)
        prediction = (out@class_encoding.T).argmax(-1)
        correct += (prediction==y).sum()
    print(correct/len(clip_testloader.dataset))


## Latent translation

Now, we want to verify how transferable is the knowledge from one model to another. A recent [paper](https://arxiv.org/pdf/2311.00664) proved that it is possible to map with simple transformations (i.e., linear, affine or orthogonal) latent representations produced by different pre-trained encoder. The transformation can be obtained on a small subset of training data points, called *anchors*. 

If everything works correctly, we expect that it is possible to:

- Encode a few training data points (anchors) using two different image encoders (in our example, CLIP and a Vision Transformer);
- Find with a simple method (least squares) a mapping between the two sets of anchor representations;
- Encode the test data points with the Vision Transformer;
- Apply the transformation found in step 2;
- Evaluate the zero-shot accuracy using the text encodings found by CLIP.

As a first step, we have to create our Vision Transformer model.

In [None]:
from transformers import ViTImageProcessor, ViTModel

vit_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
vit = ViTModel.from_pretrained("google/vit-base-patch16-224").to(device)

Then, we define our set of anchor as a small (2%) subset of the training set.

In [None]:
anchors = ds['train'].train_test_split(test_size=0.02, seed=42)['test']
print("Number of anchor points: ", len(anchors))

And we create a dataloader for anchors. Please note that this dataloader is tailored on CLIP, as we pass to it the CLIP preprocessor.

In [None]:
clip_anchorloader = torch.utils.data.DataLoader(anchors, collate_fn=partial(collate_fn, preprocess=clip_processor), batch_size=16, shuffle=False)

Then, we can encode our anchors with CLIP.

In [None]:
clip_anchors=[]

for x,y in tqdm(clip_anchorloader):
    x=x.to(device)
    y=y.to(device)
    out = clip.get_image_features(x)
    clip_anchors.append(out.detach().cpu())
clip_anchors=torch.cat(clip_anchors)

The same process can be done for the ViT. We have to create again also a test dataloader, with the correct preprocessor.

In [None]:
vit_anchorloader = torch.utils.data.DataLoader(anchors, collate_fn=partial(collate_fn, preprocess=vit_processor), batch_size=16, shuffle=False)
vit_testloader = torch.utils.data.DataLoader(ds['test'], collate_fn=partial(collate_fn, preprocess=vit_processor), batch_size=16, shuffle=False)

In [None]:
vit_anchors=[]

for x,y in tqdm(vit_anchorloader):
    x=x.to(device)
    y=y.to(device)
    out = vit(x)['last_hidden_state'][:,0,:]
    vit_anchors.append(out.detach().cpu())
vit_anchors=torch.cat(vit_anchors)

Now, we want to find the optimal projection matrix between ViT and CLIP anchor representations, using least squares:

$$
\argmin_{X} \|A_{ViT}X - A_{CLIP}\|_F^2
$$

In [None]:
solution = torch.linalg.lstsq(vit_anchors, clip_anchors)
X = solution.solution

Finally, we can evaluate the zero-shot accuracy on the test data, encoded by ViT and transformed using the $X$ that we have just found:

In [None]:
correct=0

for x,y in tqdm(vit_testloader):
    x=x.to(device)
    y=y.to(device)
    out = vit(x)['last_hidden_state'][:,0,:]@X.to(device)
    prediction = (out@class_encoding.T).argmax(-1)
    correct += (prediction==y).sum()
print(correct/len(vit_testloader.dataset))


# Homework

- 1. Read the paper [Relative representations enable zero-shot latent space communication
](https://arxiv.org/pdf/2209.15430). Specifically, pay attention to the introduction and the formulation of relative representations (sections 1 to 3).
- 2. Perform a zero-shot model stitching experiment (section 5). Specifically,
    - 1. Download the CIFAR-10 dataset from `datasets`  
    - 2. Encode the entire training set using the image encoder of CLIP and store the representation matrix. Note: for this experiment, since there is no training, it is safer to set `shuffle=False` in **all** the dataloaders.
    - 3. Pick a subset of anchors from the training set, and use them to transform the representation into a relative representation
    - 4. Train a classifier (linear layer trained with gradient descent, or any `sklearn` model you like, for example a SVM or a logistic regression) on the relative representation
    - 5. Encode the test set of CIFAR-10 using the Vision Transformer and store the representation matrix
    - 6. Use the **same** set of anchors to transform the representation into the relative space
    - 7. Classify the new relative representation using the classifier trained before.
- 3. (Optional) Play with the number of anchors you choose: how does the performance you obtain in step 2.7 depend on the number of anchors you pick?