# Vision Transformer (ViT) for Image Classification [5 points]
Use a Vision Transformer to solve the Cats and Dogs Dataset. You can use pre-defined ViT model or implement from scratch.
Deploy the model and record a short video (~5 mins) on how it works.

In [9]:
import os
import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, random_split
from transformers import ViTForImageClassification, ViTFeatureExtractor, TrainingArguments, Trainer
from datasets import Dataset
from PIL import Image
import numpy as np
import kagglehub
from kagglehub import KaggleDatasetAdapter

## Steps:

1. Load and preprocess the dataset. This may include resizing images, normalizing pixel values, and splitting the dataset into training, validation, and testing sets.

In [10]:
!pip install datasets



In [11]:
DATA_DIR = "/kaggle/input/kaggle-cat-vs-dog-dataset/kagglecatsanddogs_3367a/PetImages"

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

full_dataset = datasets.ImageFolder(DATA_DIR, transform=transform)

In [12]:
valid_indices = [i for i, (x, _) in enumerate(full_dataset) if x.shape[0] == 3]
full_dataset = torch.utils.data.Subset(full_dataset, valid_indices)

In [13]:
train_size = int(0.8 * len(full_dataset))
val_size = len(full_dataset) - train_size
train_ds, val_ds = random_split(full_dataset, [train_size, val_size])

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=2)
val_loader = DataLoader(val_ds, batch_size=64, shuffle=False, num_workers=2)

2. Choose to use a pre-defined ViT model or implement it from scratch. You can use an in-built predefined models for this part.

In [14]:
model = ViTForImageClassification.from_pretrained(
    "google/vit-base-patch16-224",
    num_labels=2,
    ignore_mismatched_sizes=True
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

print(model)

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224 and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([1000, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ViTForImageClassification(
  (vit): ViTModel(
    (embeddings): ViTEmbeddings(
      (patch_embeddings): ViTPatchEmbeddings(
        (projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
      )
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): ViTEncoder(
      (layer): ModuleList(
        (0-11): 12 x ViTLayer(
          (attention): ViTAttention(
            (attention): ViTSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
            )
            (output): ViTSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (intermediate): ViTIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermed

In [15]:
import torch.optim as optim
from torch import nn

optimizer = optim.Adam(model.parameters(), lr=2e-5)
criterion = nn.CrossEntropyLoss()

3. Train and evaluate your ViT model. Discuss your results.

In [16]:
from tqdm import tqdm

for epoch in range(3):
    model.train()
    total_loss = 0
    correct = 0
    for images, labels in tqdm(train_loader, desc=f"[Train] Epoch {epoch+1}"):
        images, labels = images.to(device), labels.to(device)

        inputs = {"pixel_values": images}
        outputs = model(**inputs)
        loss = criterion(outputs.logits, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        correct += (outputs.logits.argmax(dim=1) == labels).sum().item()

    train_loss = total_loss / len(train_loader)
    train_acc = correct / len(train_loader.dataset)

    model.eval()
    val_loss = 0
    val_correct = 0
    with torch.no_grad():
        for images, labels in tqdm(val_loader, desc=f"[Val] Epoch {epoch+1}"):
            images, labels = images.to(device), labels.to(device)
            inputs = {"pixel_values": images}
            outputs = model(**inputs)
            loss = criterion(outputs.logits, labels)

            val_loss += loss.item()
            val_correct += (outputs.logits.argmax(dim=1) == labels).sum().item()

    val_loss = val_loss / len(val_loader)
    val_acc = val_correct / len(val_loader.dataset)

    # Print epoch summary
    print(f"Epoch {epoch+1} Summary:")
    print(f"  Train Loss: {train_loss:.4f} | Train Acc: {train_acc*100:.2f}%")
    print(f"  Val   Loss: {val_loss:.4f} | Val   Acc: {val_acc*100:.2f}%")


[Train] Epoch 1: 100%|██████████| 312/312 [11:10<00:00,  2.15s/it]
[Val] Epoch 1: 100%|██████████| 78/78 [01:01<00:00,  1.28it/s]


Epoch 1 Summary:
  Train Loss: 0.0356 | Train Acc: 99.09%
  Val   Loss: 0.0238 | Val   Acc: 99.18%


[Train] Epoch 2: 100%|██████████| 312/312 [11:10<00:00,  2.15s/it]
[Val] Epoch 2: 100%|██████████| 78/78 [01:00<00:00,  1.28it/s]


Epoch 2 Summary:
  Train Loss: 0.0043 | Train Acc: 99.93%
  Val   Loss: 0.0121 | Val   Acc: 99.58%


[Train] Epoch 3: 100%|██████████| 312/312 [11:10<00:00,  2.15s/it]
[Val] Epoch 3: 100%|██████████| 78/78 [01:00<00:00,  1.28it/s]

Epoch 3 Summary:
  Train Loss: 0.0027 | Train Acc: 99.94%
  Val   Loss: 0.0156 | Val   Acc: 99.54%





In [17]:
os.makedirs("saved_model", exist_ok=True)
model.save_pretrained("saved_model")
print("Model saved to saved_model/")

Model saved to saved_model/


Very High Accuracy (Train & Val)
The model quickly reached 99%+ accuracy in the very first epoch.
That's expected with ViT on a relatively “easy” dataset like Cats vs Dogs — the patterns are visually distinctive.

Low Loss
By epoch 3, training loss dropped below 0.003.
Validation loss also stayed low, shows no major overfitting.

Slight Increase in Val Loss at Epoch 3
Val loss went from 0.0121 → 0.0156, while val accuracy dropped just a bit, could hint at minor overfitting about to start.

4. Deploy your trained ViT model. This could be a simple script or application that takes an image as input and predicts whether it's a cat or a dog.

In [None]:
from fastapi import FastAPI, File, UploadFile
from PIL import Image
import torch
from transformers import ViTForImageClassification, ViTFeatureExtractor
import io

app = FastAPI()

model_path = "/Users/skdharaneeshwar/Desktop/Spring25/DL/cats_dogs/saved_model"
model = ViTForImageClassification.from_pretrained(model_path)

feature_extractor = ViTFeatureExtractor.from_pretrained(model_path)
model.eval()

@app.get("/")
def read_root():
    return {"message": "Cat vs Dog Classifier is up!"}

@app.post("/predict")
async def predict(file: UploadFile = File(...)):
    image_bytes = await file.read()
    image = Image.open(io.BytesIO(image_bytes)).convert("RGB")

    inputs = feature_extractor(images=image, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        predicted_class = torch.argmax(logits, dim=1).item()

    label = "Cat" if predicted_class == 0 else "Dog"
    return {"prediction": label}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run("model_deploy:app", host="127.0.0.1", port=8000, reload=True)

5. Record a short video (~5 mins) demonstrating how your deployed ViT model works. The video should showcase the model taking image inputs and providing predictions. Explain the key aspects of your implementation and deployment process in the video.
   a. Upload the video to UBbox and create a shared link
   b. Add the link at the end of your ipynb file.

**Shared UBbox Video Link:**

https://buffalo.box.com/s/ba7abdt04hajjivmnw9t6h0ouloy01h6

6. References. Include details on all the resources used to complete this part.

Hugging Face – Vision Transformer (ViT) Model:
https://huggingface.co/google/vit-base-patch16-224

PyTorch Vision ImageFolder:
https://pytorch.org/vision/stable/generated/torchvision.datasets.ImageFolder.html

FastAPI Documentation:
https://fastapi.tiangolo.com/

Uvicorn – ASGI server for FastAPI:
https://www.uvicorn.org/

Kaggle Dataset Source (Alternate):
https://www.kaggle.com/datasets/karakaggle/kaggle-cat-vs-dog-dataset