<a href="https://colab.research.google.com/github/jman4162/PyTorch-Vision-Transformers-ViT/blob/main/Introduction_to_Fine_tuning_Vision_Transformers_(ViT)_for_Robotics_Applications_with_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Fine-tuning Vision Transformers (ViT) for Robotics Applications with PyTorch

Name: John Hodge

Data: 04/23/24

## Introduction

In the fast-evolving field of robotics, visual perception is a critical component, enabling robots to navigate and interact with complex environments. One of the latest breakthroughs in computer vision is the adoption of Vision Transformers (ViTs), which have shown impressive results across a variety of tasks, rivaling traditional convolutional neural networks (CNNs). This tutorial aims to demonstrate the practical application of ViTs, specifically through fine-tuning the `vit_b_16` model for object recognition in robotics.

### What is a Vision Transformer?

Vision Transformers are a class of deep learning models adapted from transformers, which were originally developed for natural language processing. ViTs apply the transformer's self-attention mechanism to grids of image patches, allowing the model to weigh the importance of different parts of an image. This ability to focus on relevant image features adaptively is particularly useful in dynamic scenarios typical in robotics.

The `vit_b_16` model, where "b" stands for "base" and "16" indicates the size of each image patch (16x16 pixels), is a medium-sized ViT model suitable for a wide range of vision tasks. It combines depth and complexity, offering a balanced trade-off between computational efficiency and accuracy.

![ViT](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/vit_architecture.jpg)

Reference: [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929)

### Tutorial Overview

This tutorial will guide you through the steps of fine-tuning the `vit_b_16` model using PyTorch, a powerful and flexible machine learning framework. We will cover:

- **Setting up PyTorch and importing the ViT model**: How to load the pre-trained `vit_b_16` and prepare it for fine-tuning.
- **Data preparation**: Techniques for preparing your image data for training and evaluation, tailored for robotic applications.
- **Fine-tuning process**: Adjustments and optimization for the model specific to object recognition tasks in robotics.
- **Evaluation and testing**: How to assess the model's performance to ensure it meets the demands of robotics applications.

By the end of this tutorial, you will have a solid understanding of how to implement and adapt Vision Transformers for real-world robotic tasks, ensuring your robotic systems can benefit from the latest advancements in computer vision.

Let's dive into the world of Vision Transformers and unlock powerful visual capabilities for your robotic applications!

## Setup Environment
First, ensure you have Python installed, and then install PyTorch and torchvision. You can install them using pip:

In [None]:
!pip install torch torchvision torchsummary tqdm

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collectin

## Import Necessary Libraries

In [None]:
import torch
import time
import random
import os
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from torch import nn
from torchvision import transforms, datasets
from torch.utils.data import DataLoader, random_split
import torch.optim as optim
from torchvision.models import vit_b_16  # Import a pre-trained ViT model
from torch.optim.lr_scheduler import StepLR
from torchsummary import summary as model_summary
from tqdm.auto import tqdm

In [None]:
print(f"Torch: {torch.__version__}")

Torch: 2.2.1+cu121


### Connect to Google Drive to save models (optional)

In [None]:
from google.colab import drive

In [None]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Set random seeds for repeatability

In [None]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

In [None]:
seed = 42  # You can choose any integer value
seed_everything(seed)

## Data Preparation

For robotic applications, you might be interested in a dataset that includes images of objects that a robot might interact with. For demonstration, let's use a subset of CIFAR-10 dataset. Here's a detailed breakdown of each part:

### 1. Define Transformations
```python
transform = transforms.Compose([
    transforms.Resize((224, 224)),  # Resize to fit ViT input dimensions
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
```
- **Purpose**: This block of code defines the transformations that will be applied to each image in the dataset. These transformations are crucial for preparing the data for processing by the Vision Transformer model.
- **Components**:
  - `transforms.Resize((224, 224))`: Resizes each image to 224x224 pixels, which is a typical input size for Vision Transformers that were pretrained on ImageNet.
  - `transforms.ToTensor()`: Converts the images to PyTorch tensors, which are a suitable format for model inputs.
  - `transforms.Normalize(...)`: Normalizes the image data. This step adjusts the pixel values so that their distribution will have a mean and standard deviation that match the distribution of the dataset used to train the model initially (typically ImageNet).

### 2. Load Datasets
```python
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
```
- **Purpose**: These lines load the CIFAR-10 dataset from disk, downloading it if it's not already available. The dataset is split into training data (`train=True`) and test data (`train=False`).
- **Functionality**: The `transform` argument applies the previously defined transformations to the images as they are loaded, ensuring they are in the correct format and scale for training and testing.

### 3. Split Training Dataset
```python
train_size = int(0.8 * len(train_dataset))
val_size = len(train_dataset) - train_size
train_dataset, val_dataset = random_split(train_dataset, [train_size, val_size])
```
- **Purpose**: This part of the code splits the training dataset into two subsets: a training set and a validation set. The split is 80% for training and 20% for validation, which is a common ratio in machine learning to balance training and validation performance.
- **Functionality**: `random_split` is used to randomly assign data points to each subset based on the sizes specified, ensuring that the validation set is representative of the overall dataset.

### 4. Create DataLoaders
```python
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
```
- **Purpose**: These lines create DataLoader objects for the train, validation, and test datasets. DataLoaders are used to efficiently manage batches of data during the training and evaluation phases, providing necessary functionality like shuffling and parallel data loading.
- **Functionality**:
  - `shuffle=True` for the training DataLoader helps in randomizing the input data, which is beneficial for reducing model overfitting and improving model generalizability.
  - `shuffle=False` for validation and test DataLoaders because the order of data does not impact performance evaluation and this helps in consistently evaluating the model.

These steps comprehensively prepare the CIFAR-10 dataset for effective training and evaluation of a Vision Transformer model, ensuring data is in the proper format, normalized, and divided into appropriate sets for the training process.

In [None]:
# Define your transformations, assumed as 'transform' in your code
transform = transforms.Compose([
    transforms.Resize((224, 224)),  # Resize to fit ViT input dimensions
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Load datasets
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)

# Split train dataset into train and validation sets
train_size = int(0.8 * len(train_dataset))
val_size = len(train_dataset) - train_size
train_dataset, val_dataset = random_split(train_dataset, [train_size, val_size])

print(f"Train Data: {len(train_dataset)}")
print(f"Validation Data: {len(val_dataset)}")
print(f"Test Data: {len(test_dataset)}")

# Create DataLoaders for train, validation, and test sets
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


100%|██████████| 170498071/170498071 [00:04<00:00, 40836914.97it/s]


Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
Train Data: 40000
Validation Data: 10000
Test Data: 10000


## Model Setup: Define the Pretrained ViT Model for Fine-tuning

We'll use a pre-trained Vision Transformer and adapt it to our specific task (e.g., classifying 10 types of objects). Let's break down each line and its purpose:

#### 1. Model Initialization
```python
model = vit_b_16(pretrained=True)
```
- **Purpose**: This line initializes the `vit_b_16` model, a Vision Transformer, with weights that have been pretrained on a large dataset (commonly ImageNet). Using a pretrained model as the starting point allows leveraging learned features which can considerably improve the model's performance on similar tasks with less data.
- **Functionality**: The parameter `pretrained=True` instructs PyTorch to load the model complete with weights that have already been optimized. This pretraining helps in accelerating the training process and improving the model's accuracy, especially when the target dataset is relatively small or not diverse.

#### 2. Adjusting the Classifier Head
```python
model.heads.head = nn.Linear(model.heads.head.in_features, epochs)  # Adjust for CIFAR-10
```
- **Purpose**: This line modifies the output layer of the `vit_b_16` model to suit the number of classes in the CIFAR-10 dataset. The CIFAR-10 dataset consists of 10 classes, thus the output layer must produce 10 outputs.
- **Explanation**: In the original `vit_b_16` model, the classifier (also known as the "head") is designed to match the number of classes it was originally trained on (typically 1000 for ImageNet). This line changes the final linear layer so that it now has an output size equal to the number of epochs specified (10, assuming `epochs` variable is meant to represent the number of classes, which appears to be a coding error or a misuse of the variable name). It should typically be a direct numeric value representing the number of target classes, like `10` for CIFAR-10.
- **Details**:
    - `model.heads.head.in_features`: This retrieves the number of input features from the previous output layer, which remains unchanged and matches the model's internal architecture.
    - `nn.Linear`: This creates a new linear layer with the number of input features equal to the original output layer's input features, and the number of output features set to the number of classes in the target dataset (intended to be 10, not `epochs`).

The second line contains a likely mistake where `epochs` (a variable representing the number of training cycles) is used instead of the intended number of classes for CIFAR-10, which is 10. Correctly, this line should be written as:
```python
model.heads.head = nn.Linear(model.heads.head.in_features, 10)  # Correct number of classes for CIFAR-10
```
This adjustment ensures that the model's output can be directly used to classify images into the ten categories of the CIFAR-10 dataset.

Reference: [vit_b_16](https://pytorch.org/vision/main/models/generated/torchvision.models.vit_b_16.html) in PyTorch.

In [None]:
model = vit_b_16(pretrained=True)
model.heads.head = nn.Linear(model.heads.head.in_features, epochs)  # Adjust for CIFAR-10

Downloading: "https://download.pytorch.org/models/vit_b_16-c867db91.pth" to /root/.cache/torch/hub/checkpoints/vit_b_16-c867db91.pth
100%|██████████| 330M/330M [00:01<00:00, 185MB/s]


In [None]:
# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")
model.to(device)
if device == 'cuda':
    print(torch.cuda.get_device_name(0))

Device: cuda


### Define the loss function and the optimizer

In the tutorial on fine-tuning a Vision Transformer for robotics applications using PyTorch, we will employ three crucial lines of code that establish the foundation for the model's training process. These lines define the loss function, the optimizer, and the learning rate scheduler. Here’s a breakdown of each component:

#### 1. Loss Function
```python
criterion = nn.CrossEntropyLoss()
```
- **Purpose**: This line of code defines the loss function that we will use to evaluate the difference between the model's predictions and the actual labels. `CrossEntropyLoss` is widely used for classification tasks because it combines `LogSoftmax` and `NLLLoss` in a single class.
- **Functionality**: It calculates the loss between the model outputs and the target labels, which is essential for training as it provides a measure to optimize during the learning process.

#### 2. Optimizer
```python
optimizer = optim.Adam(model.parameters(), lr=lr)
```
- **Purpose**: This line initializes the optimizer, which is responsible for updating the model parameters based on the computed gradients. We use Adam, an adaptive learning rate optimization algorithm, which is popular due to its efficiency in handling sparse gradients and its adaptiveness in different contexts.
- **Functionality**: `Adam` stands for Adaptive Moment Estimation. It maintains a learning rate for each model parameter and adapts it throughout the training process. This helps in achieving faster convergence and reduces the chance of getting stuck in local optima.

#### 3. Learning Rate Scheduler
```python
scheduler = StepLR(optimizer, step_size=1, gamma=gamma)
```
- **Purpose**: The learning rate scheduler adjusts the learning rate during training, which can lead to better performance and quicker convergence. `StepLR` decreases the learning rate by a factor of `gamma` every `step_size` epochs.
- **Functionality**: This scheduler reduces the learning rate by multiplying it with a factor of `gamma` (a number less than 1) every `step_size` epochs. This is useful for taking larger steps in the initial phase of training when the weights are far from their optimal values and smaller steps as we approach convergence, preventing overshooting of the target minimum.

By integrating these components into your training loop, the model is equipped to effectively learn from the data, adapt its weights during training, and improve its accuracy in object recognition tasks for robotics applications.

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)
scheduler = StepLR(optimizer, step_size=1, gamma=gamma)

## Train the model

The `train_model` function is a comprehensive training loop designed for optimizing a machine learning model over several epochs. It involves both training and validation phases to adjust the model's weights and monitor its performance over time. Let's walk through each component of this function:

#### Function Definition
```python
def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs=10):
```
- **Parameters**:
  - `model`: the neural network model that will be trained.
  - `train_loader` & `val_loader`: DataLoader objects that provide batches of training and validation data, respectively.
  - `criterion`: the loss function used to evaluate how well the model fits the data.
  - `optimizer`: the mechanism for updating model weights based on the computed gradients.
  - `num_epochs`: the number of complete passes through the entire training dataset.

#### Initialization
```python
best_loss = float('inf')
```
- **Purpose**: Initializes the `best_loss` variable to infinity, which is used to track the lowest validation loss observed during training. This helps in saving the best model state.

#### Training Mode
```python
model.train()
```
- **Effect**: Sets the model to training mode, which is necessary for layers like dropout and batch normalization that have distinct behavior during training vs. testing.

#### Training Loop
```python
for epoch in range(num_epochs):
    running_loss = 0.0
    start_time = time.time()
    ...
```
- **Loop Structure**: Iterates over the number of epochs specified. Within each epoch, the function records the start time (for performance monitoring), performs training, evaluates validation loss, and checks for improvements.

#### Training Phase
```python
for images, labels in train_loader:
    images, labels = images.to(device), labels.to(device)
    optimizer.zero_grad()
    outputs = model(images)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
    running_loss += loss.item()
```
- **Operations**:
  - **Data Loading**: Batches of images and labels are loaded from the training DataLoader.
  - **Device Assignment**: Data is transferred to the appropriate device (CPU/GPU).
  - **Gradient Reset**: Clears old gradients from the previous batch.
  - **Forward Pass**: Computes predictions by passing images through the model.
  - **Loss Calculation**: Computes the loss using the specified criterion.
  - **Backpropagation**: Calculates the gradients of the loss with respect to model parameters.
  - **Weight Update**: Adjusts the weights based on the gradients using the optimizer.
  - **Accumulate Loss**: Adds up the loss for later averaging.

#### Validation Phase
```python
model.eval()
with torch.no_grad():
    ...
```
- **Evaluation Mode and Gradient Lock**: Switches to evaluation mode and disables gradient computation to save memory and computations during the validation pass.
- **Validation Operations**: Similar to the training phase, but here the model does not update its weights; it only evaluates the performance on the validation set.

#### Scheduler and Model Saving
```python
scheduler.step()
...
torch.save(model.state_dict(), model_file_name)
```
- **Learning Rate Adjustment**: Updates the learning rate according to the schedule.
- **Model Checkpointing**: If the current validation loss is the lowest observed, saves the model's parameters. This model can later be reloaded for further training or evaluation.

#### Performance Monitoring
```python
end_time = time.time()
epoch_time = end_time - start_time
print(...)
```
- **Timing**: Measures how long each epoch takes, providing insight into the computational cost of training.

#### Model Reset
```python
model.train()
```
- **Reset to Training Mode**: Ensures that the model is ready for the next epoch of training.

This training loop is robust and includes all the necessary elements to effectively train, validate, and monitor a deep learning model's performance, ensuring it is both effective and efficient.

### Set Training Parameters

In the context of fine-tuning a Vision Transformer model for object recognition in robotics using PyTorch, the following training settings play a crucial role in defining how the model learns from the data:

#### 1. Batch Size
```python
batch_size = 64
```
- **Definition**: The batch size determines the number of training examples utilized in one iteration of the model training. For this setting, a batch size of 64 means that the model processes 64 images in each step of the training loop.
- **Impact**: A larger batch size can lead to faster training by taking advantage of parallel processing capabilities of modern GPUs, but it might also require more memory and can affect the model's ability to generalize. Conversely, a smaller batch size offers more updates per epoch, which can improve generalization but slow down the training process.

#### 2. Number of Epochs
```python
epochs = 10
```
- **Definition**: An epoch is a full cycle through the entire training dataset. This setting specifies that the training process will iterate through the whole dataset a total of 10 times.
- **Impact**: More epochs generally allow the model more opportunities to learn and adjust its weights from the data, potentially improving accuracy. However, too many epochs might lead to overfitting, especially if the dataset is not diverse enough or is relatively small.

#### 3. Learning Rate
```python
lr = 3e-5
```
- **Definition**: The learning rate defines the step size at which the optimizer updates the weights of the model during training. A learning rate of \(3 \times 10^{-5}\) is used here.
- **Impact**: The learning rate is crucial for training dynamics; too high a rate might cause the model to converge too quickly to a suboptimal solution, and too low a rate might slow down the training process, potentially leading to a stall in learning.

#### 4. Learning Rate Decay Factor (Gamma)
```python
gamma = 0.7
```
- **Definition**: This setting specifies the factor by which the learning rate is multiplied at regular intervals, as defined by the learning rate scheduler. Here, the learning rate is multiplied by 0.7 at specified intervals.
- **Impact**: Adjusting the learning rate during training can help in fine-tuning the model more effectively by slowing down the updates as the model approaches optimal solutions. This can prevent overshooting the minimum loss and can lead to more stable convergence.

These parameters collectively define the framework for the training process, balancing the rate of learning, the computational efficiency, and the model's capacity to generalize from the training data to real-world scenarios in robotics applications.

In [None]:
# Training settings
batch_size = 64
epochs = 10
lr = 3e-5
gamma = 0.7

### Run the training loop

In [None]:
# Training loop
best_loss = float('inf')
def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs=10):
    global best_loss  # To allow updates to best_loss across epochs
    model.train()  # Ensure the model is in training mode
    for epoch in range(num_epochs):
        running_loss = 0.0
        start_time = time.time()  # Start timer
        # Training phase
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        end_time = time.time()  # End timer
        epoch_time = end_time - start_time  # Calculate epoch time

        # Compute average training loss
        avg_train_loss = running_loss / len(train_loader)
        # print(f'Epoch {epoch+1}, Training Loss: {avg_train_loss}')

        # Validation phase
        model.eval()  # Set the model to evaluation mode
        val_running_loss = 0.0
        with torch.no_grad():  # Disable gradient computation during validation
            for images, labels in val_loader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                loss = criterion(outputs, labels)
                val_running_loss += loss.item()

        # Compute average validation loss
        avg_val_loss = val_running_loss / len(val_loader)
        scheduler.step()  # Update the learning rate scheduler
        print(f'Epoch {epoch+1}, Training Loss: {np.round(avg_train_loss, 6)}, Validation Loss: {np.round(avg_val_loss, 6)}, Training Time: {np.round(epoch_time, 2)}')

        # Check if the current validation loss is the best we've seen so far
        if avg_val_loss < best_loss:
            best_loss = avg_val_loss
            # model_file_name = f'best_model_{epoch+1}_2402422_v01.pt'
            model_file_name = 'best_model_2402422_v01.pt'
            torch.save(model.state_dict(), model_file_name)  # Save the best model
            torch.save(model.state_dict(), '/content/drive/MyDrive/ViT_models/' + model_file_name)  # Save the best model
            print('New best model saved!')

        model.train()  # Set the model back to training mode

In [None]:
train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs=epochs)

Epoch 1, Training Loss: 0.181454, Validation Loss: 0.088346, Training Time: 654.71
New best model saved!
Epoch 2, Training Loss: 0.033748, Validation Loss: 0.080057, Training Time: 655.47
New best model saved!
Epoch 3, Training Loss: 0.011561, Validation Loss: 0.067019, Training Time: 655.75
New best model saved!
Epoch 4, Training Loss: 0.004308, Validation Loss: 0.071983, Training Time: 655.46
Epoch 5, Training Loss: 0.001347, Validation Loss: 0.068002, Training Time: 655.27
Epoch 6, Training Loss: 0.000658, Validation Loss: 0.068222, Training Time: 655.29
Epoch 7, Training Loss: 0.000524, Validation Loss: 0.0693, Training Time: 654.86
Epoch 8, Training Loss: 0.000448, Validation Loss: 0.070445, Training Time: 655.09
Epoch 9, Training Loss: 0.000398, Validation Loss: 0.071466, Training Time: 655.13
Epoch 10, Training Loss: 0.00036, Validation Loss: 0.072017, Training Time: 655.13


## Evaluation

To evaluate the model, check its performance on the test dataset. The function `evaluate_model` computes the model's accuracy, which is the percentage of correctly predicted instances relative to the total number of instances evaluated. Here's a breakdown of how this function works:

### Function Definition
```python
def evaluate_model(model, test_loader):
```
- **Parameters**:
  - `model`: the neural network model that will be evaluated.
  - `test_loader`: a DataLoader object that provides batches of the test dataset, including both the input images and their corresponding labels.

### Set Model to Evaluation Mode
```python
model.eval()
```
- **Purpose**: This line sets the model to evaluation mode, which is crucial for models that have different behavior during training and testing, such as those using dropout layers or batch normalization. In evaluation mode, these layers will behave consistently and not apply randomness or scaling.

### Initialize Counters
```python
total = 0
correct = 0
```
- **Usage**:
  - `total`: keeps track of the total number of examples processed.
  - `correct`: counts the number of examples for which the model's prediction matches the actual label.

### Evaluation Loop
```python
with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
```
- **Context**:
  - **`torch.no_grad()`**: This context manager disables gradient computation, reducing memory usage and speeding up the process since gradients are not needed for model evaluation.
  - **Data Movement**:
    - `images.to(device), labels.to(device)`: Moves the data to the appropriate computing device (CPU or GPU), which is necessary for models trained on GPUs.
  - **Model Prediction**:
    - `outputs = model(images)`: Feeds the batch of images into the model and gets the output logits for each class.
    - `_, predicted = torch.max(outputs.data, 1)`: Finds the predicted class label for each image by selecting the class with the highest logit value. The `torch.max` function returns both the maximum value and the index of that value (the predicted class label) across the specified dimension (`1`, meaning row-wise operation).
  - **Update Counters**:
    - `total += labels.size(0)`: Updates the total number of examples processed.
    - `correct += (predicted == labels).sum().item()`: Increases the count of correct predictions by the number of images in the current batch where the prediction matched the label.

### Calculate and Print Accuracy
```python
accuracy = 100 * correct / total
print(f'Accuracy on test images: {accuracy}%')
```
- **Calculation**:
  - Computes the percentage of correct predictions relative to the total number of predictions made.
- **Output**:
  - Prints the computed accuracy to provide feedback on how well the model is performing on the unseen test data.

This evaluation loop provides a straightforward and effective way to assess the accuracy of a model, allowing for the quantification of model performance in practical and operational terms.

### Run the evaluation loop on the best model

In [1]:
def evaluate_model(model, test_loader):
    model.eval()
    total = 0
    correct = 0
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    accuracy = 100 * correct / total
    print(f'Accuracy on test images: {accuracy}%')

In [None]:
# Load the best model
model_file_name = 'best_model_2402422_v01.pt'
best_model_state_dict = torch.load(model_file_name)
model.load_state_dict(best_model_state_dict)

<All keys matched successfully>

In [None]:
# Evaluate the model
evaluate_model(model, test_loader)

Accuracy on test images: 97.65%


### Results Analysis

The reported accuracy of "97.65%" on test images reflects a very good performance for the Vision Transformer model, particularly when considering the complexities and variabilities associated with image recognition tasks. Achieving such a high level of accuracy signifies that the model has effectively learned from the training data and can generalize well to new, unseen images, which is crucial for practical applications.

However, it is important to note that while this accuracy is commendable, it does not reach the state-of-the-art levels where accuracies exceed 99.5%. Such high-performance benchmarks are typically achieved by models that have undergone extensive fine-tuning on very large datasets and with substantial computational resources. These state-of-the-art models often involve:
- More complex architectures or ensemble methods that integrate outputs from multiple models to boost accuracy.
- Longer training times with numerous epochs, which allow the model to iteratively refine its weights and biases to better fit the data.
- Advanced regularization techniques and hyperparameter optimization strategies that can significantly improve model performance but require experimental tuning and computational power.

Reaching these top-tier accuracies usually demands considerable GPU compute power and time, making them less feasible within the constraints of a limited budget, as often is the case in educational or small-scale research settings. For the purposes of this tutorial, achieving an accuracy of 97.65% with the available resources and within a reasonable time frame is an impressive outcome. It demonstrates the capability of Vision Transformers to handle complex visual tasks effectively, offering a solid foundation for further exploration and optimization with more resources or in applications where very high accuracy is not the critical factor.

In summary, while the model does not achieve the pinnacle of current machine learning performance, it provides a robust and highly effective solution for many practical applications, especially where budget and computational resources are constrained.

State of the art (SOTA) benchmarks: [Image Classification on CIFAR-10
](https://paperswithcode.com/sota/image-classification-on-cifar-10)

## Conclusions

In conclusion, this tutorial has demonstrated the process of fine-tuning a Vision Transformer (ViT) model, specifically the `vit_b_16`, for object recognition tasks relevant to robotics applications using PyTorch. Starting from setting up the necessary environment and data, through training and validation, to evaluating the model on a test dataset, each step has been carefully explained and implemented.

Through our efforts, the model achieved an impressive accuracy of 97.65% on test images. This level of performance underscores the potential of Vision Transformers in image recognition tasks, showcasing their ability to competently handle visual data and make accurate predictions. While not reaching the state-of-the-art accuracy levels above 99.5%, which often require extensive computational resources and fine-tuning, the results achieved are significant, especially considering the practical constraints of limited GPU compute and budget.

The tutorial also highlighted critical elements in training deep learning models, such as the importance of a well-considered loss function, optimizer, and learning rate scheduler. The discussions around the impact of batch size, number of epochs, learning rate, and data transformations have provided deeper insights into model training dynamics and the optimization process.

Moving forward, participants can experiment with different architectures, tuning parameters, or more extensive datasets to push the boundaries of what can be achieved with their models. They could also explore the integration of these trained models into actual robotic systems, examining how these models perform in real-world scenarios and potentially iterating on design choices based on practical feedback.

We hope this tutorial has provided you with a solid foundation in using Vision Transformers for image recognition and inspired you to delve deeper into the field of machine learning and robotics. Whether for academic, personal, or commercial purposes, the skills and knowledge gained here should serve as a robust base for further exploration and development in the exciting intersection of AI and robotics.


## Additional Resources

- [Hugging Face ViT](https://huggingface.co/docs/transformers/en/model_doc/vit)
- [PyTorch ViT](https://pytorch.org/vision/main/models/vision_transformer.html)
- [D2L AI - Attention Mechanisms and Transformers](https://d2l.ai/chapter_attention-mechanisms-and-transformers/index.html)
- https://github.com/lucidrains/vit-pytorch