# Vision Transformer (ViT) for Image Classification [5 points]
Use a Vision Transformer to solve the Cats and Dogs Dataset. You can use pre-defined ViT model or implement from scratch.


## Steps:

In [7]:
!unzip -q kagglecatsanddogs_5340.zip -d /content


In [8]:
import os
root='/content'
data_dir=None
for dirpath, dirnames,_ in os.walk(root):
    if 'Cat' in dirnames and 'Dog' in dirnames:
        data_dir=dirpath
        break
if data_dir is None:
    raise FileNotFoundError(f"Couldn’t locate the PetImages folder under {root}")
print(f"data_dir = {data_dir}")


data_dir = /content/PetImages


1. Load and preprocess the dataset. This may include resizing images, normalizing pixel values, and splitting the dataset into training, validation, and testing sets.

In [9]:
from PIL import Image, UnidentifiedImageError
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, Subset
import numpy as np
import os
def clean_folder(folder):
    for fname in os.listdir(folder):
        path=os.path.join(folder,fname)
        try:
            with Image.open(path) as img:
                img.verify()
        except (UnidentifiedImageError,OSError):
            print("Removing bad file:",path)
            os.remove(path)

for cls in ['Cat','Dog']:
    clean_folder(os.path.join(data_dir,cls))
train_tfms=transforms.Compose([
    transforms.Resize((224,224)),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225]),
])
test_tfms=transforms.Compose([
    transforms.Resize((224,224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225]),
])

base_ds=datasets.ImageFolder(data_dir)
num_samples=len(base_ds)
indices=np.arange(num_samples)
np.random.seed(42); np.random.shuffle(indices)
train_end=int(0.7*num_samples)
val_end=train_end+int(0.15*num_samples)

train_idx=indices[:train_end]
val_idx=indices[train_end:val_end]
test_idx=indices[val_end:]

train_ds=Subset(datasets.ImageFolder(data_dir,transform=train_tfms),train_idx)
val_ds=Subset(datasets.ImageFolder(data_dir,transform=test_tfms),val_idx)
test_ds=Subset(datasets.ImageFolder(data_dir,transform=test_tfms),test_idx)

batch_size=32
train_loader=DataLoader(train_ds,batch_size=batch_size,shuffle=True,num_workers=2)
val_loader=DataLoader(val_ds,batch_size=batch_size,shuffle=False,num_workers=2)
test_loader=DataLoader(test_ds,batch_size=batch_size,shuffle=False,num_workers=2)

print(f"Sizes → Train: {len(train_ds)}, Val: {len(val_ds)}, Test: {len(test_ds)}")
imgs, labels=next(iter(train_loader))
print("shapes:",imgs.shape,labels.shape)


Removing bad file: /content/PetImages/Cat/Thumbs.db
Removing bad file: /content/PetImages/Cat/666.jpg
Removing bad file: /content/PetImages/Dog/Thumbs.db
Removing bad file: /content/PetImages/Dog/11702.jpg




Sizes → Train: 17498, Val: 3749, Test: 3751
shapes: torch.Size([32, 3, 224, 224]) torch.Size([32])


2. Choose to use a pre-defined ViT model or implement it from scratch. You can use an in-built predefined models for this part.

In [10]:
import torch
import torch.nn as nn
import torch.optim as optim
import timm

device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name="vit_base_patch32_224"
model=timm.create_model(model_name,pretrained=True,num_classes=2)

model.to(device)
print("Loaded {} with {:.1f}M params.".format(model_name, sum(p.numel() for p in model.parameters()) / 1e6))


criterion=nn.CrossEntropyLoss()
optimizer=optim.AdamW(model.parameters(), lr=3e-5, weight_decay=1e-2)
scheduler=optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)

print(model)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Loaded vit_base_patch32_224 with 87.5M params.
VisionTransformer(
  (patch_embed): PatchEmbed(
    (proj): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32))
    (norm): Identity()
  )
  (pos_drop): Dropout(p=0.0, inplace=False)
  (patch_drop): Identity()
  (norm_pre): Identity()
  (blocks): Sequential(
    (0): Block(
      (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (attn): Attention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (q_norm): Identity()
        (k_norm): Identity()
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj): Linear(in_features=768, out_features=768, bias=True)
        (proj_drop): Dropout(p=0.0, inplace=False)
      )
      (ls1): Identity()
      (drop_path1): Identity()
      (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (mlp): Mlp(
        (fc1): Linear(in_features=768, out_features=3072, bias=True)
        (act): GELU(approximate='none')
        (drop1): Dropout(

3. Train and evaluate your ViT model. Discuss your results.

In [11]:
import torch
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix

device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

num_epochs=5
patience=3
epochs_no_improve=0
best_val_acc=0.0
best_epoch=0
history={'train_loss':[],'train_acc':[],'val_loss':[],'val_acc':[]}

for epoch in range(1,num_epochs+1):
    model.train()
    running_loss=0.0
    running_corrects=0

    for images,labels in train_loader:
        images=images.to(device)
        labels=labels.to(device)
        optimizer.zero_grad()
        outputs=model(images)
        loss=criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        preds=outputs.argmax(dim=1)
        running_loss+=loss.item()*images.size(0)
        running_corrects+=torch.sum(preds==labels.data)

    train_loss=running_loss/len(train_loader.dataset)
    train_acc=running_corrects.double()/len(train_loader.dataset)
    history['train_loss'].append(train_loss)
    history['train_acc'].append(train_acc.item())

    model.eval()
    val_loss=0.0
    val_corrects=0

    with torch.no_grad():
        for images,labels in val_loader:
            images=images.to(device)
            labels=labels.to(device)

            outputs=model(images)
            loss=criterion(outputs, labels)

            preds=outputs.argmax(dim=1)
            val_loss+=loss.item()*images.size(0)
            val_corrects+=torch.sum(preds==labels.data)

    val_loss=val_loss/len(val_loader.dataset)
    val_acc=val_corrects.double() / len(val_loader.dataset)
    history['val_loss'].append(val_loss)
    history['val_acc'].append(val_acc.item())
    scheduler.step()
    print("Epoch{}/{}Train Loss:{:.4f}Acc:{:.4f} |Val Loss: {:.4f}  Acc: {:.4f}".format(epoch,num_epochs,train_loss,train_acc,val_loss,val_acc))
    if val_acc>best_val_acc:
        best_val_acc=val_acc
        best_epoch=epoch
        epochs_no_improve=0
        torch.save(model.state_dict(),"best_vit_cats_vs_dogs.pth")
    else:
        epochs_no_improve+=1
        print(f"  → No improvement for {epochs_no_improve}/{patience} epochs")

    if epochs_no_improve >= patience:
        print(f"\nEarly stopping triggered. Stopping at epoch {epoch}.")
        break

print(f"\nTraining complete. Best val_acc: {best_val_acc:.4f} (epoch {best_epoch})")
model.load_state_dict(torch.load("best_vit_cats_vs_dogs.pth"))
model.eval()
all_preds = []
all_labels = []
with torch.no_grad():
    for images,labels in test_loader:
        images=images.to(device)
        outputs=model(images)
        preds=outputs.argmax(dim=1).cpu().numpy()
        all_preds.extend(preds)
        all_labels.extend(labels.numpy())

test_acc=np.mean(np.array(all_preds)==np.array(all_labels))
print(f"\nTesting Accuracy: {test_acc:.4f}\n")

print("Classification Report of the Model:")
print(classification_report(all_labels, all_preds, target_names=base_ds.classes))

print("Confusion Matrix of the Model:")
print(confusion_matrix(all_labels, all_preds))


Epoch1/5Train Loss:0.0496Acc:0.9823 |Val Loss: 0.0289  Acc: 0.9880
Epoch2/5Train Loss:0.0112Acc:0.9962 |Val Loss: 0.0447  Acc: 0.9869
  → No improvement for 1/3 epochs
Epoch3/5Train Loss:0.0080Acc:0.9975 |Val Loss: 0.0356  Acc: 0.9899
Epoch4/5Train Loss:0.0047Acc:0.9982 |Val Loss: 0.0546  Acc: 0.9843
  → No improvement for 1/3 epochs
Epoch5/5Train Loss:0.0033Acc:0.9991 |Val Loss: 0.0288  Acc: 0.9917

Training complete. Best val_acc: 0.9917 (epoch 5)





Testing Accuracy: 0.9917

Classification Report of the Model:
              precision    recall  f1-score   support

         Cat       0.99      0.99      0.99      1859
         Dog       0.99      0.99      0.99      1892

    accuracy                           0.99      3751
   macro avg       0.99      0.99      0.99      3751
weighted avg       0.99      0.99      0.99      3751

Confusion Matrix of the Model:
[[1838   21]
 [  10 1882]]


<span style='color:green'>The model demonstrates strong learning and generalization on the Cat vs. Dog task. The training accuracy increased steadily from 98.2% in epoch 1 to 99.9% by epoch 5, while training loss fell from 0.0496 to 0.003. Validation accuracy reached 99.17% in epoch 5, closely following the training trend. This indicates minimal overfitting, even though there were small rises in validation loss at epochs 2 and 4. On the held-out test set, it achieves 99.17% accuracy, with balanced precision, recall, and F1-scores (all are nearly 0.99) for both classes. The confusion matrix supports this high performance, showing only 21 Cats misclassified as Dogs and 10 Dogs misclassified as Cats out of 3,751 samples. Overall, the model is very accurate and reliable.</span>

6. References. Include details on all the resources used to complete this part.

<span style='color:green'>Kaggle. Dogs vs. Cats dataset. 2013. https://www.microsoft.com/en-us/download/details.aspx?id=54765
ViT transformer https://huggingface.co/docs/transformers/model_doc/vit
</span>