[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/image/image-retrieval-ebook/vision-transformers/vit.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/image/image-retrieval-ebook/vision-transformers/vit.ipynb)

# Vision Transformers (ViT) Walkthrough

In [1]:
!pip install datasets transformers torch



In [None]:
!rm -r 

In [None]:
!pip install roboflow
from roboflow import Roboflow
rf = Roboflow(api_key="TR2n5JTcwaRKLAU3sNDd")
project = rf.workspace("mike-caulfild").project("chtozalevetottigr")
dataset = project.version(4).download("folder")

In [2]:
import os
import random
import pandas as pd
import numpy as np
from PIL import Image, ImageOps
from tqdm.auto import tqdm
import albumentations as A

from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import torchvision
from torchvision import datasets, models, transforms




In [None]:
from datasets import load_dataset

dataset_train = load_dataset('/kaggle/working/ChtoZaLevEtotTigr-4/train',
    split='train', # training dataset
    ignore_verifications=False  # set to True if seeing splits Error
)
dataset_val = load_dataset('/kaggle/working/ChtoZaLevEtotTigr-4/valid',
    split='train', # training dataset
    ignore_verifications=False  # set to True if seeing splits Error)
)
# dataset_test = load_dataset('/kaggle/working/ChtoZaLevEtotTigr-4/test',
#     split='train', # training dataset
#     ignore_verifications=False  # set to True if seeing splits Error)
# )

In [None]:
dataset_train[643]

In [None]:
# check how many labels/number of classes
num_classes = len(set(dataset_train['label']))
labels = dataset_train.features['label']
num_classes, labels

Those are PIL images with $3$ color channels, and $32x32$ pixels resolution. Let's have a look at the first picture in the dataset.

In [None]:
dataset_train[0]['image']

In [None]:
print('width, height',dataset_train[0]['image'].size, 'FORMAT', dataset_train[0]['image'].format, 'MODE', dataset_train[0]['image'].mode)


In [None]:
dataset_train[0]['label'], labels.names[dataset_train[0]['label']]

### Loading ViT Feature Extractor

We use `google/vit-base-patch16-224-in21k` model from the Hugging Face Hub.

The model is named as so as it refers to base-sized architecture with patch resolution of 16x16 and fine-tuning resolution of 224x224.  

In [3]:
from transformers import ViTImageProcessor

# import model
model_id = 'google/vit-base-patch16-224-in21k'
feature_extractor = ViTImageProcessor.from_pretrained(
    model_id
)

Downloading (…)rocessor_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

You can see the feature extractor configuration by printing it

If we consider the first image, i.e., the airplane shown above, we can see the resulting tensor after passing the image through the feature extractor.

In [4]:
# load in relevant libraries, and alias where appropriate
import torch

# device will determine whether to run the training on GPU or CPU.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [None]:
def preprocess(batch):
    # take a list of PIL images and turn them to pixel values
    inputs = feature_extractor(
        batch['image'],
        return_tensors='pt'
    )
    # include the labels
    inputs['label'] = batch['label']
    return inputs

We can apply this to both the training and testing dataset.

In [None]:
# transform the training dataset
prepared_train = dataset_train.with_transform(preprocess)
prepared_val = dataset_val.with_transform(preprocess)
# prepared_test = dataset_test.with_transform(preprocess)

In [None]:
prepared_train

Now, whenever you get an example from the dataset, the transform will be applied in real time (on both samples and slices).

### Model Fine-Tuning

In this section, we are going to build the Trainer, which is a feature-complete training and eval loop for PyTorch, optimized for HuggingFace 🤗 Transformers.

We need to define all of the arguments that it will include:
* training and testing dataset
* feature extractor
* model
* collate function
* evaluation metric
* ... other training arguments.

The collate function is useful when dealing with lots of data. Batches are lists of dictionaries, so collate will help us create batch tensors.

In [None]:
def collate_fn(batch):
    return {
        'pixel_values': torch.stack([x['pixel_values'] for x in batch]),
        'labels': torch.tensor([x['label'] for x in batch])
    }

Let's now define the evaluation metric we are going to use to compare prediction with actual labels. We will use the *accuracy evaluation metric*.

Accuracy is defined as the proportion of correct predictions (True Positive ($TP$) and True Negative ($TN$)) among the total number of cases processed ($TP$, $TN$, False Positive ($FP$), and False Negative ($FN$)).

$$Accuracy = \frac{(TP + TN)}{(TP + TN + FP + FN)}$$    

Below, we are using accuracy within the ```compute_metrics``` function.

In [None]:
import numpy as np
from datasets import load_metric

# f1 metric
metric = load_metric("f1")
def compute_metrics(p):
    return metric.compute(
        predictions=np.argmax(p.predictions, axis=1),
        references=p.label_ids,
        average='weighted'
    )

The last thing consists of defining ```TrainingArguments```.

Most of these are pretty self-explanatory, but one that is quite important here is ```remove_unused_columns=False```. This one will drop any features not used by the model's call function. By default it's True because usually it's ideal to drop unused feature columns, making it easier to unpack inputs into the model's call function. But, in our case, we need the unused features ('image' in particular) in order to create 'pixel_values'.

We have chosen a batch size equal to 16, 100 evaluation steps, and a learning rate of $2e^{-4}$.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir="/kaggle/working/",
  per_device_train_batch_size=32,
  evaluation_strategy="epoch",
  logging_strategy = "epoch",
  save_strategy =  "epoch",
  num_train_epochs=30,
#   save_steps=100,
#   eval_steps=100,
#   logging_steps=10,
  learning_rate=2e-4,
  save_total_limit=2,
  remove_unused_columns=False,
  push_to_hub=False,
  load_best_model_at_end=True,
)

We can now load the pre-trained model. We'll add ```num_labels``` on init so the model creates a classification head with the right number of units.

In [5]:
from transformers import ViTForImageClassification

labels = {0:'front', 1:'left', 2:'other', 3:'right'}

model = ViTForImageClassification.from_pretrained(
    model_id,  # classification head
    num_labels=len(labels)
)

Downloading config.json:   0%|          | 0.00/502 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/346M [00:00<?, ?B/s]

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
len(labels)

In [6]:
model.to(device)

ViTForImageClassification(
  (vit): ViTModel(
    (embeddings): ViTEmbeddings(
      (patch_embeddings): ViTPatchEmbeddings(
        (projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
      )
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): ViTEncoder(
      (layer): ModuleList(
        (0-11): 12 x ViTLayer(
          (attention): ViTAttention(
            (attention): ViTSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): ViTSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (intermediate): ViTIntermediate(
            (dense): Linear(in_features=7

We can see the characteristics of our model.

Now, all instances can be passed to ```Trainer```.

In [None]:
!rm -r /kaggle/working/runs /kaggle/working/checkpoint-144 /kaggle/working/checkpoint-288 /kaggle/working/trainer_state.json /kaggle/working/training_args.bin /kaggle/working/train_results.json /kaggle/working/all_results.json /kaggle/working/preprocessor_config.json /kaggle/working/eval_results.json /kaggle/working/config.json /kaggle/working/model.safetensors /kaggle/working/state.db /kaggle/working/ChtoZaLevEtotTigr-10 /kaggle/working/wandb /kaggle/working/ChtoZaLevEtotTigr-6

In [None]:
from transformers import EarlyStoppingCallback

In [None]:
early_stopping = EarlyStoppingCallback(early_stopping_patience=3)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    compute_metrics=compute_metrics,
    train_dataset=prepared_train,
    eval_dataset=prepared_val,
    callbacks = [early_stopping],
    tokenizer=feature_extractor
)

We can save our trained model.

In [None]:
!pip install wandb

In [None]:
import wandb
wandb.login()

In [None]:
!wandb login --relogin

In [None]:
wandb.init()

In [None]:
train_results = trainer.train()
# save tokenizer with the model
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
# save the trainer state
trainer.save_state()

In [None]:
!zip -r /content/model.zip /content/model/model_vit

#### Model Evaluation

We can now evaluate our model using the accuracy metric defined above...

In [None]:
metrics = trainer.evaluate(prepared_test)
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

Model accuracy is pretty good. Let's have a look to an example. We can pick the first image in our testing dataset and see if the predicted label is correct.

In [5]:
import json #Подключили библиотеку
 
with open('/kaggle/working/trainer_state.json', 'r', encoding='utf-8') as f: #открыли файл
    text = json.load(f) #загнали все из файла в переменную
    print(text) #вывели результат на экран

{'best_metric': 0.27228328585624695, 'best_model_checkpoint': '/kaggle/working/checkpoint-48', 'epoch': 4.0, 'eval_steps': 500, 'global_step': 192, 'is_hyper_param_search': False, 'is_local_process_zero': True, 'is_world_process_zero': True, 'log_history': [{'epoch': 1.0, 'learning_rate': 0.00019333333333333333, 'loss': 0.2642, 'step': 48}, {'epoch': 1.0, 'eval_f1': 0.9012113001529236, 'eval_loss': 0.27228328585624695, 'eval_runtime': 23.1095, 'eval_samples_per_second': 51.71, 'eval_steps_per_second': 3.245, 'step': 48}, {'epoch': 2.0, 'learning_rate': 0.0001866666666666667, 'loss': 0.1556, 'step': 96}, {'epoch': 2.0, 'eval_f1': 0.8919466169890291, 'eval_loss': 0.305475652217865, 'eval_runtime': 22.8647, 'eval_samples_per_second': 52.264, 'eval_steps_per_second': 3.28, 'step': 96}, {'epoch': 3.0, 'learning_rate': 0.00018, 'loss': 0.093, 'step': 144}, {'epoch': 3.0, 'eval_f1': 0.8952650762052572, 'eval_loss': 0.3232346177101135, 'eval_runtime': 23.2968, 'eval_samples_per_second': 51.295

In [None]:
# show the first image of the testing dataset
image = dataset_test["image"][0].resize((200,200))
image

The image is not very clear, even when resized. Let's extract the actual label.

In [None]:
# extract the actual label of the first image of the testing dataset
actual_label = dataset_test["label"][0]

labels = dataset_test.features['label']
actual_label, labels.names[actual_label]


It looks like the image represents a cat. Let's now see what our model has predicted. Given we saved it on the HuggingFace Hub, we first need to import it. We can use ViTForImageClassification and ViTFeatureExtractor to import the model and extract its features. We would need the predicted pixel values "pt".

In [None]:
import urllib.request
from PIL import Image

In [None]:
image_url = 'https://img5.goodfon.com/original/2304x1536/f/46/tigr-progulka-les-tuman.jpg'

urllib.request.urlretrieve(image_url,"image.jpg")

In [None]:
img = Image.open("image.jpg")

img.show();

In [None]:
img

In [None]:
from transformers import ViTForImageClassification, ViTFeatureExtractor

# import our fine-tuned model
model_name_or_path = '/kaggle/working/checkpoint-2400'
model_finetuned = ViTForImageClassification.from_pretrained(model_name_or_path)
# import features
feature_extractor_finetuned = ViTImageProcessor.from_pretrained(model_name_or_path)

In [None]:
inputs = feature_extractor_finetuned(image, return_tensors="pt")

with torch.no_grad():
    logits = model_finetuned(**inputs).logits

In [None]:
predicted_label = logits.argmax(-1).item()
print(predicted_label)
labels = dataset_test.features['label']
labels.names[predicted_label]

We can now see what is our predicted label. Do extract it, we can use the argmax function.

And the answer is cat. Which is what we would expect.

## References

[Article](https://pinecone.io/learn/vision-transformers/)

[1] Dosovitskiy et al., [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929), 2021, CV.

[2] Vaswani et al., [Attention Is All You Need](https://arxiv.org/abs/1706.03762), 2017.

[3] Saeed M., [A Gentle Introduction to Positional Encoding in Transformer Models, Part 1](https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/), 2022, Attention, Machine Learning Mastery.