# Miniproject 1: Multimodal Learning
## Noah Foster

First, we will import the necessary libraries and modules, the go through the five steps.

These imports are largely cannibalized from the examples. Some code has been modifying to reflect my Macbook Air's lack of a GPU.

In [1]:
import numpy as np
import torch
from pkg_resources import packaging
import clip

import os
import skimage
import IPython.display
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

from collections import OrderedDict
import torch
import random

# set random seed:
random.seed(2952)
np.random.seed(2952)
torch.manual_seed(2952)

from torchvision.datasets import CIFAR10

In [2]:
model, preprocess = clip.load("RN50")
model.eval()

cifar10_test = CIFAR10(os.path.expanduser("~/.cache"), transform=preprocess, download=True, train=False)

text_descriptions = [f"This is a photo of a {label}" for label in cifar10_test.classes]
text_tokens = clip.tokenize(text_descriptions)

with torch.no_grad():
    text_features = model.encode_text(text_tokens).float()
    text_features /= text_features.norm(dim=-1, keepdim=True)

Files already downloaded and verified


# Step 1: 0-Shot Image Classification (CIFAR 10)
## Task 1

Finding the image features is the first step. We will use the pretrained ResNet50 model to extract the features from 500 images.

In [3]:
text_features.shape # (10, 1024) is for the 10 classes and 1024 dimensional embeddings

image_indicies = random.sample(range(len(cifar10_test)), 500) # This comes to 1/20th of the test set
images = [cifar10_test[idx] for idx in image_indicies] # 500 tuples of (image, label) where image is 

test_labels = [image[1] for image in images]
images = [image[0] for image in images]

n = 10
imgs_batched_by_n = [ # 50 batches of (n, 3, 224, 224)
    torch.stack(images[i:i+n])
    for i in range(0, len(images), n)
] 


with torch.no_grad(): # Macbook air does not have fun here. Still only takes a minute and a half
    test_image_features = np.concatenate(
        [model.encode_image(img_batch).float() for img_batch in imgs_batched_by_n], axis=0
    ) # (500, 1024)

    test_image_features /= np.linalg.norm(test_image_features, axis=-1, keepdims=True) # I need the numpy version here because of my stacking

Dotting the normalized image features with the normalized word vectors gives us the cosine similarity between the image and the word. 

In [4]:
similarities = text_features.numpy() @ test_image_features.T

likely_labels = np.argmax(similarities, axis=0)
accuracy = sum(likely_labels == test_labels) / len(test_labels)
print(f"Accuracy: {accuracy:.1%}")

Accuracy: 75.2%


This substantially beats the baseline of random guessing, which is just guessing the most common class. This baseline is:

In [5]:
mode = max(set(test_labels), key=test_labels.count)
print(f"The most common class in my random sample is '{cifar10_test.classes[mode]}', which accounts for {test_labels.count(mode)} of the 500 images.")
print(f"This means that a model that always predicts '{cifar10_test.classes[mode]}' would have an accuracy of {test_labels.count(mode)/len(test_labels):.1%}.")

The most common class in my random sample is 'automobile', which accounts for 65 of the 500 images.
This means that a model that always predicts 'automobile' would have an accuracy of 13.0%.


This is a pretty good baseline, but we can do better, even at 0-shot learning. Let us try some variants of the captions. We will pick a caption format based on performance on the CIFAR-10 training set, then see if it can beat our accuracy on the test set. First we get our train image embeddings:

In [6]:
cifar10_train = CIFAR10(os.path.expanduser("~/.cache"), transform=preprocess, download=True, train=True)
image_indicies = random.sample(range(len(cifar10_train)), 1000) # This comes to 1/5th of the train set
images = [cifar10_train[idx] for idx in image_indicies] # 1000 tuples of (image, label) where image is 

train_labels = [image[1] for image in images]
images = [image[0] for image in images]

n = 10
imgs_batched_by_n = [ # 50 batches of (n, 3, 224, 224)
    torch.stack(images[i:i+n])
    for i in range(0, len(images), n)
] 
with torch.no_grad(): # Macbook air does not have fun here. Still only takes only 3 minutes
    train_image_features = np.concatenate(
        [model.encode_image(img_batch).float() for img_batch in imgs_batched_by_n], axis=0
    ) # (500, 1024)

    train_image_features /= np.linalg.norm(train_image_features, axis=-1, keepdims=True) # I need the numpy version here because of my stacking


Files already downloaded and verified


Then we build a little helper function to get the accuracy of a given caption format:

In [7]:
def test_format(prompt_format, image_embeddings, labels, classes):
    descriptions = [prompt_format.replace("{class}", label) for label in classes]
    text_tokens = clip.tokenize(descriptions)

    with torch.no_grad():
        text_features = model.encode_text(text_tokens).float()
        text_features /= text_features.norm(dim=-1, keepdim=True)

    similarities = text_features.numpy() @ image_embeddings.T

    likely_labels = np.argmax(similarities, axis=0)
    accuracy = sum(likely_labels == labels) / len(labels)
    return accuracy

Now lets try some different caption formats: 
First, keeping it simple, what if we just use the name of the class with no effort to make a full sentence?

In [8]:
print(f"Accuracy on train set: {test_format('{class}', train_image_features, train_labels, cifar10_train.classes):.1%}")

Accuracy on train set: 68.6%


We're not quite there, so here are some other ideas:

In [9]:
formats = [
    "This is a photo of a {class}",
    "This is a photo of a {class}.",
    "The above image is a photo of a {class}.",
    "This picture is a photo of a {class}.",
    "Picture of a {class}.",
    "This image is a photo of a {class}.",
    "This is a {class}.",
    "This is an image of a {class}",
    "{class}",
    "This is a {class} picture.",
    "A {class}.",
    "A {class}"
]
training_accuracies = [test_format(format, train_image_features, train_labels, cifar10_train.classes) for format in formats]

In [10]:
print(f"Performance of our original caption format on the train set: {training_accuracies[0]:.1%}")

for format, accuracy in zip(formats[1:], training_accuracies[1:]):
    print(f"Using the format:   '{format}',\n we get a training accuracy of {accuracy:.1%}\n")

Performance of our original caption format on the train set: 72.6%
Using the format:   'This is a photo of a {class}.',
 we get a training accuracy of 71.5%

Using the format:   'The above image is a photo of a {class}.',
 we get a training accuracy of 73.7%

Using the format:   'This picture is a photo of a {class}.',
 we get a training accuracy of 74.4%

Using the format:   'Picture of a {class}.',
 we get a training accuracy of 71.8%

Using the format:   'This image is a photo of a {class}.',
 we get a training accuracy of 73.4%

Using the format:   'This is a {class}.',
 we get a training accuracy of 69.0%

Using the format:   'This is an image of a {class}',
 we get a training accuracy of 71.3%

Using the format:   '{class}',
 we get a training accuracy of 68.6%

Using the format:   'This is a {class} picture.',
 we get a training accuracy of 65.5%

Using the format:   'A {class}.',
 we get a training accuracy of 71.6%

Using the format:   'A {class}',
 we get a training accuracy 

Thus we see that our original format is marginally outperformed by ```'This picture is a photo of a {class}.'``` So applying this to our testing data, we find that we get an accuracy of:

In [11]:
print(f"Accuracy on test set: {test_format('This picture is a photo of a {class}.', test_image_features, test_labels, cifar10_test.classes):.1%}")

Accuracy on test set: 76.4%


Indeed this does marginally beat our baseline. This is extremely interesting because adding a period to the end of the classes will likely effect the tokanization of the name of the class. Likely our class names are now being tokenized as more than one word, which I would not expect to increase performance. This is a very interesting result. I tried a couple other random seeds and I consistently get the same result.

IMPORTANT NOTE: This prompt format that worked well for one of {Cifar-10, Cifar-100} did not work well for the other. I tried a couple different random seeds and consistently got the same result. I am not sure why this is the case, but it is interesting.

## Task 2: Linear Probing

Turns out those 1000 training images we already embedded will prove very useful! Let's use that to train up a little linear probe! In training, we often hit 100% accuracy, which implies that we should probably do closed form linear regression. I did have to make some interesting hyperparameter choices here though. I feel that using a gridseach to perfect my hyperparameter choices based on picking even more images out of the training to use as validation would be in some ways cheating. That is that it would allow for the hyperparamets on this random seed that just happen to work well on the validation set to be chosen, which will correspond to the test data. So instead I pick hyperparameters so that the classifier I build works well not only on Cifar-10 but also on Cifar-100, where working well implies a near monotonic rise in accuracy where not too many of the last epochs are done at perfect accuracy. I also believe that this data is realizable so I stick with the large batch size of 1/10th of my training data.

In [12]:
# Building a dataloader from the train_image_features and train_labels

training_dataset = torch.utils.data.TensorDataset(torch.tensor(train_image_features), torch.tensor(train_labels))
train_loader = torch.utils.data.DataLoader(training_dataset, batch_size=100, shuffle=True)

linear_probe = torch.nn.Linear(1024, 10)

optimizer = torch.optim.Adam(linear_probe.parameters(), lr=0.1)
loss_fn = torch.nn.CrossEntropyLoss()

for epoch in range(20):
    for images, labels in train_loader:
        optimizer.zero_grad()
        logits = linear_probe(images)
        loss = loss_fn(logits, labels)
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch} training loss: {loss.item():.2f}      Training accuracy: {sum(torch.argmax(logits, axis=1) == labels).item()/len(labels):.1%}")


Epoch 0 training loss: 0.99      Training accuracy: 71.0%
Epoch 1 training loss: 0.61      Training accuracy: 79.0%
Epoch 2 training loss: 0.58      Training accuracy: 82.0%
Epoch 3 training loss: 0.50      Training accuracy: 83.0%
Epoch 4 training loss: 0.41      Training accuracy: 88.0%
Epoch 5 training loss: 0.35      Training accuracy: 87.0%
Epoch 6 training loss: 0.35      Training accuracy: 91.0%
Epoch 7 training loss: 0.26      Training accuracy: 95.0%
Epoch 8 training loss: 0.31      Training accuracy: 94.0%
Epoch 9 training loss: 0.28      Training accuracy: 88.0%
Epoch 10 training loss: 0.24      Training accuracy: 93.0%
Epoch 11 training loss: 0.23      Training accuracy: 93.0%
Epoch 12 training loss: 0.18      Training accuracy: 95.0%
Epoch 13 training loss: 0.24      Training accuracy: 95.0%
Epoch 14 training loss: 0.28      Training accuracy: 93.0%
Epoch 15 training loss: 0.18      Training accuracy: 97.0%
Epoch 16 training loss: 0.17      Training accuracy: 95.0%
Epoch 1

In [13]:
from sklearn.linear_model import LinearRegression

labels_1hot = np.zeros((len(train_labels), 10))
labels_1hot[np.arange(len(train_labels)), train_labels] = 1

reg = LinearRegression().fit(train_image_features, labels_1hot)
print(f"Accuracy on train set: {sum(np.argmax(reg.predict(train_image_features), axis=1) == train_labels)/len(train_labels):.1%}")

Accuracy on train set: 100.0%


We know that the optimization of the linear probe is

In [14]:
linear_probe.eval()
test_image_features_tensor = torch.tensor(test_image_features)
probe_logits = linear_probe(test_image_features_tensor).detach().numpy()
linear_model_accuracy = sum(np.argmax(probe_logits, axis=1) == test_labels)/len(test_labels)
regression_accuracy = sum(np.argmax(reg.predict(test_image_features), axis=1) == test_labels)/len(test_labels)

print(f"Accuracy of linear probe: {linear_model_accuracy:.1%}")
print(f"Accuracy of linear regression: {regression_accuracy:.1%}")

Accuracy of linear probe: 87.2%
Accuracy of linear regression: 21.4%


# Step 1: 0-Shot Image Classification (CIFAR 100)
Largely the same code as before, but with a different dataset. Interestingly, a couple different results.
## Task 1

Finding the image features is the first step. We will use the pretrained ResNet50 model to extract the features from 500 images.

In [15]:
from torchvision.datasets import CIFAR100
cifar100_test = CIFAR100(os.path.expanduser("~/.cache"), transform=preprocess, download=True, train=False)

text_descriptions = [f"This is a photo of a {label}" for label in cifar100_test.classes]
text_tokens = clip.tokenize(text_descriptions)

with torch.no_grad():
    text_features = model.encode_text(text_tokens).float()
    text_features /= text_features.norm(dim=-1, keepdim=True)

Files already downloaded and verified


In [16]:
text_features.shape # (10, 1024) is for the 10 classes and 1024 dimensional embeddings

image_indicies = random.sample(range(len(cifar100_test)), 500) # This comes to 1/20th of the test set
images = [cifar100_test[idx] for idx in image_indicies] # 500 tuples of (image, label) where image is 

test_labels = [image[1] for image in images]
images = [image[0] for image in images]

n = 10
imgs_batched_by_n = [ # 50 batches of (n, 3, 224, 224)
    torch.stack(images[i:i+n])
    for i in range(0, len(images), n)
] 


with torch.no_grad(): # Macbook air does not have fun here. Still only takes a minute and a half
    test_image_features = np.concatenate(
        [model.encode_image(img_batch).float() for img_batch in imgs_batched_by_n], axis=0
    ) # (500, 1024)

    test_image_features /= np.linalg.norm(test_image_features, axis=-1, keepdims=True) # I need the numpy version here because of my stacking

Dotting the normalized image features with the normalized word vectors gives us the cosine similarity between the image and the word. 

In [17]:
similarities = text_features.numpy() @ test_image_features.T

likely_labels = np.argmax(similarities, axis=0)
accuracy = sum(likely_labels == test_labels) / len(test_labels)
print(f"Accuracy: {accuracy:.1%}")

Accuracy: 35.2%


This substantially beats the baseline of random guessing, which is just guessing the most common class. This baseline is:

In [18]:
mode = max(set(test_labels), key=test_labels.count)
print(f"The most common class in my random sample is '{cifar100_test.classes[mode]}', which accounts for {test_labels.count(mode)} of the 500 images.")
print(f"This means that a model that always predicts '{cifar100_test.classes[mode]}' would have an accuracy of {test_labels.count(mode)/len(test_labels):.1%}.")

The most common class in my random sample is 'dinosaur', which accounts for 11 of the 500 images.
This means that a model that always predicts 'dinosaur' would have an accuracy of 2.2%.


This is a pretty good baseline, but we can do better, even at 0-shot learning. Let us try some variants of the captions. We will pick a caption format based on performance on the CIFAR-10 training set, then see if it can beat our accuracy on the test set. First we get our train image embeddings:

In [19]:
cifar100_train = CIFAR100(os.path.expanduser("~/.cache"), transform=preprocess, download=True, train=True)
image_indicies = random.sample(range(len(cifar100_train)), 1000) # This comes to 1/5th of the train set
images = [cifar100_train[idx] for idx in image_indicies] # 1000 tuples of (image, label) where image is 

train_labels = [image[1] for image in images]
images = [image[0] for image in images]

n = 10
imgs_batched_by_n = [ # 50 batches of (n, 3, 224, 224)
    torch.stack(images[i:i+n])
    for i in range(0, len(images), n)
] 
with torch.no_grad(): # Macbook air does not have fun here. Still only takes only 3 minutes
    train_image_features = np.concatenate(
        [model.encode_image(img_batch).float() for img_batch in imgs_batched_by_n], axis=0
    ) # (500, 1024)

    train_image_features /= np.linalg.norm(train_image_features, axis=-1, keepdims=True) # I need the numpy version here because of my stacking


Files already downloaded and verified


Now lets try some different caption formats: 
First, keeping it simple, what if we just use the name of the class with no effort to make a full sentence?

In [20]:
print(f"Accuracy on train set: {test_format('{class}', train_image_features, train_labels, cifar100_train.classes):.1%}")

Accuracy on train set: 28.7%


We're not quite there, so here are some other ideas:

In [21]:
formats = [
    "This is a photo of a {class}",
    "This is a photo of a {class}.",
    "The above image is a photo of a {class}.",
    "This picture is a photo of a {class}.",
    "Picture of a {class}.",
    "This image is a photo of a {class}.",
    "This is a {class}.",
    "This is an image of a {class}",
    "{class}",
    "This is a {class} picture.",
    "A {class}.",
    "A {class}"
]
training_accuracies = [test_format(format, train_image_features, train_labels, cifar100_train.classes) for format in formats]

In [22]:
print(f"Performance of our original caption format on the train set: {training_accuracies[0]:.1%}")

for format, accuracy in zip(formats[1:], training_accuracies[1:]):
    print(f"Using the format:   '{format}',\n we get a training accuracy of {accuracy:.1%}\n")

Performance of our original caption format on the train set: 40.1%
Using the format:   'This is a photo of a {class}.',
 we get a training accuracy of 41.1%

Using the format:   'The above image is a photo of a {class}.',
 we get a training accuracy of 41.0%

Using the format:   'This picture is a photo of a {class}.',
 we get a training accuracy of 38.2%

Using the format:   'Picture of a {class}.',
 we get a training accuracy of 40.9%

Using the format:   'This image is a photo of a {class}.',
 we get a training accuracy of 38.5%

Using the format:   'This is a {class}.',
 we get a training accuracy of 40.8%

Using the format:   'This is an image of a {class}',
 we get a training accuracy of 41.3%

Using the format:   '{class}',
 we get a training accuracy of 28.7%

Using the format:   'This is a {class} picture.',
 we get a training accuracy of 33.9%

Using the format:   'A {class}.',
 we get a training accuracy of 38.0%

Using the format:   'A {class}',
 we get a training accuracy 

Thus we see that our original format is marginally outperformed by ```'The above image is a photo of a {class}.'``` So applying this to our testing data, we find that we get an accuracy of:

In [23]:
print(f"Accuracy on test set: {test_format('The above image is a photo of a {class}.', test_image_features, test_labels, cifar100_test.classes):.1%}")

Accuracy on test set: 37.6%


Indeed this does marginally beat our baseline. This is extremely interesting because adding a period to the end of the classes will likely effect the tokanization of the name of the class. Likely our class names are now being tokenized as more than one word, which I would not expect to increase performance. This is a very interesting result. I tried a couple other random seeds and I consistently get the same result.

IMPORTANT NOTE: This prompt format that worked well for one of {Cifar-10, Cifar-100} did not work well for the other. I tried a couple different random seeds and consistently got the same result. I am not sure why this is the case, but it is interesting.

## Task 2: Linear Probing

Turns out those 1000 training images we already embedded will prove very useful! Let's use that to train up a little linear probe! In training, we often hit 100% accuracy, which implies that we should probably do closed form linear regression. I did have to make some interesting hyperparameter choices here though. I feel that using a gridseach to perfect my hyperparameter choices based on picking even more images out of the training to use as validation would be in some ways cheating. That is that it would allow for the hyperparamets on this random seed that just happen to work well on the validation set to be chosen, which will correspond to the test data. So instead I pick hyperparameters so that the classifier I build works well not only on Cifar-10 but also on Cifar-100, where working well implies a near monotonic rise in accuracy where not too many of the last epochs are done at perfect accuracy. I also believe that this data is realizable so I stick with the large batch size of 1/10th of my training data.

In [24]:
# Building a dataloader from the train_image_features and train_labels

training_dataset = torch.utils.data.TensorDataset(torch.tensor(train_image_features), torch.tensor(train_labels))
train_loader = torch.utils.data.DataLoader(training_dataset, batch_size=100, shuffle=True)

linear_probe = torch.nn.Linear(1024, 100)

optimizer = torch.optim.Adam(linear_probe.parameters(), lr=0.1)
loss_fn = torch.nn.CrossEntropyLoss()

for epoch in range(20):
    for images, labels in train_loader:
        optimizer.zero_grad()
        logits = linear_probe(images)
        loss = loss_fn(logits, labels)
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch} training loss: {loss.item():.2f}      Training accuracy: {sum(torch.argmax(logits, axis=1) == labels).item()/len(labels):.1%}")


Epoch 0 training loss: 3.64      Training accuracy: 28.0%
Epoch 1 training loss: 2.17      Training accuracy: 46.0%
Epoch 2 training loss: 1.57      Training accuracy: 73.0%
Epoch 3 training loss: 1.24      Training accuracy: 72.0%
Epoch 4 training loss: 0.91      Training accuracy: 83.0%
Epoch 5 training loss: 0.73      Training accuracy: 89.0%
Epoch 6 training loss: 0.58      Training accuracy: 91.0%
Epoch 7 training loss: 0.52      Training accuracy: 96.0%
Epoch 8 training loss: 0.40      Training accuracy: 95.0%
Epoch 9 training loss: 0.35      Training accuracy: 96.0%
Epoch 10 training loss: 0.29      Training accuracy: 100.0%
Epoch 11 training loss: 0.28      Training accuracy: 98.0%
Epoch 12 training loss: 0.21      Training accuracy: 99.0%
Epoch 13 training loss: 0.18      Training accuracy: 99.0%
Epoch 14 training loss: 0.19      Training accuracy: 100.0%
Epoch 15 training loss: 0.13      Training accuracy: 100.0%
Epoch 16 training loss: 0.14      Training accuracy: 100.0%
Epo

In [25]:
from sklearn.linear_model import LinearRegression

labels_1hot = np.zeros((len(train_labels), 100))
labels_1hot[np.arange(len(train_labels)), train_labels] = 1

reg = LinearRegression().fit(train_image_features, labels_1hot)
print(f"Accuracy on train set: {sum(np.argmax(reg.predict(train_image_features), axis=1) == train_labels)/len(train_labels):.1%}")

Accuracy on train set: 100.0%


We know that the optimization of the linear probe is convex, so we can just do closed form linear regression. This is a very simple operation but interestingly does not yield as good results.

In [26]:
linear_probe.eval()
test_image_features_tensor = torch.tensor(test_image_features)
probe_logits = linear_probe(test_image_features_tensor).detach().numpy()
linear_model_accuracy = sum(np.argmax(probe_logits, axis=1) == test_labels)/len(test_labels)
regression_accuracy = sum(np.argmax(reg.predict(test_image_features), axis=1) == test_labels)/len(test_labels)

print(f"Accuracy of linear probe: {linear_model_accuracy:.1%}")
print(f"Accuracy of linear regression: {regression_accuracy:.1%}")

Accuracy of linear probe: 44.4%
Accuracy of linear regression: 4.4%


# Step 2: Socratic Method for Image Captioning

Let's start by building a little infrastructure to load the Flickr Dataset, then we will proceed to consult Socrates.