In [None]:
%load_ext autoreload
%autoreload 2

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

try:
    import jupyter_black

    jupyter_black.load()
except:
    print("black not installed")

# Foundation Models

## Goals

- Download, setup foundation model
- Perform zero-shot image classification

## Setup

Let's define paths, install & load the necessary Python packages.

**Recommended: Save the notebook to your personal google drive to persist changes.**

**Recommended: Change runtime to a GPU instance (if using Google Colab)** 

Mount your google drive to store data and results (if running the code in Google Colab).

In [None]:
try:
    import google.colab

    IN_COLAB = True
except:
    IN_COLAB = False

print(f"In colab: {IN_COLAB}")

In [None]:
if IN_COLAB:
    from google.colab import drive

    drive.mount("/content/drive")

**Modify the following paths if necessary.**

That is where your data will be stored.

In [None]:
from pathlib import Path

if IN_COLAB:
    DATA_PATH = Path("/content/drive/MyDrive/cas-dl-module-compvis-part1")
else:
    DATA_PATH = Path("/workspace/code/data")

Install `dl_cv_lectures`

In [None]:
try:
    import dl_cv_lectures

    print("dl_cv_lectures installed, all good")
except ImportError as e:
    import os

    if Path("/workspace/code/src").exists():
        print("Installing from local repo")
        os.system("cd /workspace/code  && pip install -e .")
    else:
        print("Installing from git repo")
        os.system("pip install git+https://github.com/marco-willi/cas-dl-compvis-exercises-hs2024")

Load all packages

In [None]:
import math
import random
from typing import Callable

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchinfo
import torchshow as ts
import torchvision
from matplotlib import pyplot as plt
from PIL import Image
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
from torchvision import transforms
from torchvision.transforms.v2 import functional as TF
from torchvision.utils import make_grid
from tqdm.notebook import tqdm

import dl_cv_lectures

Define a default device for your computations.

**GPU is strongly recommended!** (otherwise the images have to be restricted in size).

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")

## 1)  The CLIP Model

The CLIP model [Link](https://arxiv.org/abs/2103.00020) has had a profound impact in the deep learning community and in practical applications.

We are going to use it for zero-shot image classification.


In [None]:
import requests
from PIL import Image
from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained(
    "openai/clip-vit-base-patch32", cache_dir=DATA_PATH.joinpath("hf_cache")
)
processor = CLIPProcessor.from_pretrained(
    "openai/clip-vit-base-patch32", cache_dir=DATA_PATH.joinpath("hf_cache")
)

We download an image.

In [None]:
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image

Now we define a prompt for each class that we are interested in. In this example the classes `cat` and `dog`.

In [None]:
inputs = processor(
    text=["a photo of a cat", "a photo of a dog"],
    images=image,
    return_tensors="pt",
    padding=True,
)
# we can take the softmax to get the label probabilities

Now we evaluate the similarities of the image with respect to each prompt.

In [None]:
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score

We can now evaluate the relative similarities and produce a probability distribution (softmax) over all classes.

In [None]:
probs = logits_per_image.softmax(dim=1)
probs

**Task**: Play around with the prompts. Can you also classify / detect other objects in the images?  How about a different image?