<a href="https://colab.research.google.com/github/kdmalc/intro-computer-vision/blob/main/HW5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Name:

NetID:

Collaborators:

# General instructions
Please copy this colab notebook into your own Drive to edit. This notebook will also serve as your final submission report - please ensure that code cells run correctly, and that all non-code (text/latex) blocks are rendered correctly before submissing the file. Feel free to add any additional cells (code or text) you need. Please follow good coding, markdown, and presentation etiquette.

__Please do not use any AI tools for this assignment.__


## Submission instructions

- Before submitting, please `run-all` the code. This will re-render your entire jupyter file cell by cell to produce all the outputs.

- You are required to download the colab notebook as a `.ipynb` file and submit it to canvas. Please name your `.ipynb` file as `netid.ipynb`

- Modify the text cell on top to include your name and the names of any collaborators from this class you worked with on this assignment.

- Download a pdf of the executed colab notebook. You can use print -> save as pdf. Please name your `.pdf` file as `netid.pdf`.

- Any extra images used in the homework should also be uploaded to canvas.

- For simplicity, you can also upload a `netid.zip` file to canvas containing all solution files.

# Problem 1: Segmentation


In [None]:
import numpy as np
import glob
import matplotlib.pyplot as plt
from PIL import Image

import torch
import torch.nn as nn
import torchvision.transforms.functional as TF
from torch.utils.data import Dataset
from torch.utils import data
from torchvision import transforms as T
from torchvision import models

torch.manual_seed(0)
np.random.seed(0)

In [None]:
!gdown https://drive.google.com/uc?id=1eYYJ26R1S9Ln_ExwHFBqd3rbln9qVdi4&export=download
!unzip -qq cityscapes.zip

In [None]:
class CityScapesDataset(Dataset):
  def __init__(self, images, labels, im_transform, mask_transform):
    self.images = images
    self.labels = labels
    self.im_transform = im_transform
    self.mask_transform = mask_transform

  def __getitem__(self, idx):
    im = Image.open(self.images[idx])
    mask = Image.open(self.labels[idx])
    im = self.im_transform(im)[0:3, ...] # Transform image

    # Add an extra first dimension to mask (needed for transforms), convert
    # to LongTensor b/c values are integers, and apply transforms.
    mask = np.asarray(mask)[None, ...]
    mask = torch.LongTensor(mask)
    mask = self.mask_transform(mask)

    # Apply random horizontal flip to image and mask
    if np.random.rand() > 0.5:
      im = TF.hflip(im)
      mask  = TF.hflip(mask)

    return im, mask

  def __len__(self):
    return len(self.images)

In [None]:
batch_size = 16

# Make image and mask transforms.
im_transform = [T.ToTensor()]
im_transform.append(T.Resize((256, 256), interpolation=T.InterpolationMode.BILINEAR))
im_transform = T.Compose(im_transform)

mask_transform = T.Resize((256, 256), interpolation=T.InterpolationMode.NEAREST)

def get_dataloader(im_path):
  images = sorted(glob.glob(im_path + '/*8bit.jpg'))
  labels = sorted(glob.glob(im_path + '/*labelIds.png'))
  dataset = CityScapesDataset(images, labels, im_transform, mask_transform)
  return data.DataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=1)

# Create dataloaders
train_dataloader = get_dataloader('./cityscapes/train')
val_dataloader = get_dataloader('./cityscapes/val')

Problem 1a: Implement Segmentation model

In [None]:
class Segmenter(torch.nn.Module):
    def __init__(self, n_classes, encoder):
        super(Segmenter, self).__init__()
        self.encoder = encoder
        #self.decoder = Your code for Problem 1a goes here

    def forward(self, x):
      return None # Your code for Problem 1a goes here

In [None]:
# Get features from VGG16 up through 3 downsampling (maxpool) operations.
vgg = models.vgg16(pretrained=True);
encoder = nn.Sequential(*(list(vgg.children())[:1])[0][0:17]);

# Create model
n_classes = 34
model = Segmenter(n_classes, encoder);
model.to('cuda');

Problem 1b: Train your segmentation model

In [None]:
lr = 1e-4
loss = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
num_epochs = 7

# Problem 1b: Your training loop code goes here

Problem 1c: Evaluate your model

In [None]:
# Problem 1c: Your IoU evaluation code goes here

Problem 1d: Visualize validation images

In [None]:
# Problem 1d: Your image results code goes here

Problem 1e:	Look at the lines of code for resizing the images and masks to 256 x 256. We use bilinear interpolation when resizing the image, but nearest neighbor interpolation when resizing the mask. Why do we not use bilinear interpolation for the mask?

!!!YOUR ANSWER HERE!!!

Problem 1e:	Look at the lines of code for resizing the images and masks to 256 x 256. We use bilinear interpolation when resizing the image, but nearest neighbor interpolation when resizing the mask. Why do we not use bilinear interpolation for the mask?

!!!YOUR ANSWER HERE!!!

Problem 1f. Look at the `__getitem__` function for the `CityScapesDataset` class and notice that we apply a horizontal flip augmentation to the image and mask usnig a random number generator. Why do we apply the flip in this way instea of simply adding a `T.RandomHorizontalFlip` to the sequence of transforms in `im_transform` and `mask_transform` (similar to what you did in HWK 4)?

!!! YOUR ANSWER HERE !!!

# Problem 2: Vision Transformers

In [None]:
import urllib

import io
import numpy as np
from PIL import Image


def load_image_from_url(url: str) -> Image:
    with urllib.request.urlopen(url) as f:
        return Image.open(f).convert("RGB")


EXAMPLE_IMAGE_URL = "https://dl.fbaipublicfiles.com/dinov2/images/example.jpg"
example_image = load_image_from_url(EXAMPLE_IMAGE_URL)
display(example_image)

In [None]:
import torch
from torchvision.models.feature_extraction import create_feature_extractor
import torchvision.transforms as transforms
import timm

# Load DinoV2 with registers model
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

dinov2_model = timm.create_model('vit_base_patch14_reg4_dinov2', pretrained=True)
dinov2_model.eval()
dinov2_model.to(DEVICE)

# Create feature extractor to output q, k values from last layer
# and final output tokens from vision transformer
dinov2_feature_extractor = create_feature_extractor(
    dinov2_model, return_nodes=['blocks.11.attn.q_norm',
                                'blocks.11.attn.k_norm',
                                'norm'],
)

# Useful variables for model
IMAGE_CHANNELS, IMAGE_HEIGHT, IMAGE_WIDTH = dinov2_model.pretrained_cfg['input_size']
PATCH_SIZE, _ = dinov2_model.patch_embed.patch_size
DINOV2_IMAGE_MEAN = dinov2_model.pretrained_cfg['mean']
DINOV2_IMAGE_STD = dinov2_model.pretrained_cfg['std']
NUM_PREFIX_TOKENS = dinov2_model.num_prefix_tokens
dinov2_transforms = transforms.Compose([
    transforms.Resize((IMAGE_HEIGHT, IMAGE_WIDTH),
                      interpolation=Image.Resampling.BICUBIC,
                      antialias=True),
        transforms.ToTensor(),
        transforms.Normalize(mean=DINOV2_IMAGE_MEAN, std=DINOV2_IMAGE_STD),
    ])

In [None]:
# Easy function to grab q, k, and final output tokens from DinoV2-Reg model
def get_features_from_dinov2(image: Image.Image):
  image_pt = dinov2_transforms(image).unsqueeze(0)
  with torch.no_grad():
    out = dinov2_feature_extractor(image_pt.to(DEVICE))
  q = out['blocks.11.attn.q_norm'].squeeze(0).cpu()
  k = out['blocks.11.attn.k_norm'].squeeze(0).cpu()
  out_tokens = out['norm'].squeeze(0).cpu()
  image_feat_tokens = out_tokens[NUM_PREFIX_TOKENS:, :]

  return q, k, image_feat_tokens

q, k, image_feat_tokens = get_features_from_dinov2(example_image)
print("Q:", q.shape)
print("K:", k.shape)
print("Image feature tokens:", image_feat_tokens.shape)

In [None]:
# Function to visualize self-attention with class embedding
def visualize_class_attention(
    attn_matrix: torch.Tensor,  # attention weights (n_heads, n_tokens, n_tokens)
    num_prefix_tokens=5,
    image_height=518,
    image_width=518,
    patch_size=14,
    ncols=3):

  assert (attn_matrix.ndim == 3), "Attention map should have shape (n_heads, n_tokens, n_tokens)"
  assert (attn_matrix.shape[1] == attn_matrix.shape[2]), "Attention map should be square"
  n_heads, n_tokens, _ = attn_matrix.shape
  nrows = n_heads // ncols

  fig, axs = plt.subplots(nrows, ncols, figsize=(10, 10))
  for i, ax in enumerate(axs.flatten()):
    # Get attention weights between class token and image tokens
    class_token_attn = attn_matrix[i, 0, num_prefix_tokens:]
    class_token_attn = class_token_attn.reshape(image_height // patch_size, image_width // patch_size)
    # Plotting
    ax.imshow(class_token_attn, cmap='hot', aspect='auto')
    ax.axis('off')
  plt.subplots_adjust(hspace=0.1, wspace=0.1)
  plt.show()

Problem 2a: Visualize self-attention of vision transformer

In [None]:
def compute_attention_weight(q, k):
  # Your code for Problem 2a goes here
  pass

In [None]:
visualize_class_attention(compute_attention_weight(q, k),
                          num_prefix_tokens=NUM_PREFIX_TOKENS,
                          image_height=IMAGE_HEIGHT,
                          image_width=IMAGE_WIDTH,
                          patch_size=PATCH_SIZE)

Problem 2a: Comment on the similarities and differences between the attention maps across the different heads.

!!! YOUR ANSWER HERE !!!

(ELEC/COMP 546) Problem 2b: PCA analysis on output feature patches.

In [None]:
from sklearn.decomposition import PCA
# Your code for Problem 2b goes here

Problem 2b: Comment on how the feature patches from similar objects (i.e. the dogs) are colored.

!!! YOUR ANSWER HERE !!!

# Problem 3: Using CLIP for Zero-Shot Classification


In [None]:
! pip install git+https://github.com/openai/CLIP.git

In [None]:
import clip

model, preprocess = clip.load("ViT-B/32")
model.cuda().eval()
input_resolution = model.visual.input_resolution
context_length = model.context_length
vocab_size = model.vocab_size

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")
print("Input resolution:", input_resolution)
print("Context length:", context_length)
print("Vocab size:", vocab_size)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

cifar = np.load('/content/drive/MyDrive/ELEC 477/CIFAR.npz') # Replace with your path to CIFAR.
X,y,label_names = cifar['X'], cifar['y']*1.0, cifar['label_names']
print(label_names)

Problem 3a: Implement zero-shot classification with CLIP

In [None]:
from tqdm import tqdm
captions = None # Your code goes here.

# Iterate over all test examples.
for i in tqdm(range(50000, 60000)):
  image = preprocess(Image.fromarray(np.uint8(X[i,...]))).unsqueeze(0).to('cuda')
  text = clip.tokenize(captions).to(device)

  with torch.no_grad():
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

    # Your code goes here.

# Your code goes here.

Problem 3b: Prompt engineering for zero-shot classification

In [None]:
from tqdm import tqdm
engineered_captions = None # Your code goes here.

# Iterate over all test examples.
for i in tqdm(range(50000, 60000)):
  image = preprocess(Image.fromarray(np.uint8(X[i,...]))).unsqueeze(0).to('cuda')
  text = clip.tokenize(engineered_captions).to(device)

  with torch.no_grad():
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

    # Your code goes here.

# Your code goes here.

# Problem 4: StyleGAN


In [None]:
# setup correct PyTorch version
!pip install -U torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
import torch

# Download the code
!git clone https://github.com/NVlabs/stylegan2-ada-pytorch.git
%cd stylegan2-ada-pytorch

# install other dependencies
!pip install ninja

print('PyTorch version: {}'.format(torch.__version__) )
!nvidia-smi -L
print('GPU Identified at: {}'.format(torch.cuda.get_device_name()))

In [None]:
# Download the model
import argparse
import numpy as np
import PIL.Image
import dnnlib
import re
import sys
from io import BytesIO
import IPython.display
import numpy as np
from math import ceil
from PIL import Image, ImageDraw
import imageio
import matplotlib.pyplot as plt
import legacy
import cv2
import torch
from tqdm.autonotebook import tqdm

device = torch.device('cuda')

# Choose between these pretrained models
# https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/afhqcat.pkl
# https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/afhqdog.pkl
# https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/afhqwild.pkl
# https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/brecahad.pkl
# https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/cifar10.pkl
# https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/ffhq.pkl
# https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metfaces.pkl

network_pkl = "https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/ffhq.pkl"

# If downloads fails, you can try downloading manually and uploading to the session directly
# network_pkl = "/content/ffhq.pkl"

print('Loading networks from "%s"...' % network_pkl)
with dnnlib.util.open_url(network_pkl) as f:
  G = legacy.load_network_pkl(f)['G_ema'].to(device) # type: ignore

In [None]:
# Useful utility functions...

# Generates an image from a style vector.
def generate_image_from_style(dlatent, noise_mode='none'):

  if len(dlatent.shape) == 1:
    dlatent = dlatent.unsqueeze(0)

  row_images = G.synthesis(dlatent, noise_mode=noise_mode)
  row_images = (row_images.permute(0, 2, 3, 1) * 127.5 + 128).clamp(0, 255).to(torch.uint8)
  return row_images[0].cpu().numpy()

# Converts a noise vector z to a style vector w.
def convert_z_to_w(latent, truncation_psi=0.7, truncation_cutoff=9, class_idx=None):
  label = torch.zeros([1, G.c_dim], device=device)
  if G.c_dim != 0:
    if class_idx is None:
      RuntimeError('Must specify class label with class_idx when using a conditional network')
    label[:, class_idx] = 1
  else:
    if class_idx is not None:
      print(f'warning: class_idx={class_idx} ignored when running on an unconditional network')
  return G.mapping(latent, label, truncation_psi=truncation_psi, truncation_cutoff=truncation_cutoff)

In [None]:
# Sample code to generate images.
np.random.seed(123) # You can change this random seed.

# Generate a random noise (z) vector.
z = torch.from_numpy(np.random.randn(1, G.z_dim)).to(device)

# Convert z vector to w vector.
w = convert_z_to_w(z, truncation_psi=0.7, truncation_cutoff=9)

# Generate and show image.
img = generate_image_from_style(w)
plt.imshow(img)

### LATENT SPACE FACE TRAVERSALS

In [None]:
# download
!gdown "1vekENF84yvVpKhMaChqTVEyttAckZ4PU" -O "../"

Downloading...
From (original): https://drive.google.com/uc?id=1vekENF84yvVpKhMaChqTVEyttAckZ4PU
From (redirected): https://drive.google.com/uc?id=1vekENF84yvVpKhMaChqTVEyttAckZ4PU&confirm=t&uuid=8fd70fec-c83e-408b-a542-6bb49bd5e8aa
To: /ffhq-Gender.weights
100% 94.4M/94.4M [00:01<00:00, 87.5MB/s]


In [None]:
from torchvision import models as tv
cnn = tv.resnet50(pretrained=False, progress=True, num_classes = 1)
cnn.eval()
cnn.load_state_dict(torch.load('../ffhq-Gender.weights', map_location=lambda storage, loc: storage))

# Returns whether face is perceptually female (True) or male (False) given
# an input image of shape (H, W, 3).
def face_is_female(img):
  im = np.asarray(img)/255.0
  im = cv2.resize(im, (256, 256))
  im = np.expand_dims(np.transpose(im, (2,0,1)), 0)
  im = torch.FloatTensor(im)
  logits = cnn(im)[0, 0]
  return (logits < 0.5).numpy()

Problem 4a: Interpolation between two faces and gender classification.


In [None]:
# Your code goes here.

Problem 4a: What differences do you notice when interpolating in style space? Do the intermediate faces look realistic?

!!! YOUR ANSWER HERE !!!

Problem 4b: Latent space traversals

In [None]:
# Your code goes here.

Problem 4b: Do you notice any facial attributes that seem to commonly change when moving between males and females? Why do you think that occurs?

!!! YOUR ANSWER HERE !!!