### Image Similarity

This notebook will lead you through:

1. Downloading the image embedding model `vit_base_patch16_224.augreg2_in21k_ft_in1k` from HuggingFace using the built-in interface in Python.
2. Input digitized images to the AI model and retrieve embeddings for them
3. Measure the cosines between embeddings to see that they match intuitions about the similarity of the content of those pictures.

If you are running this notebook in an environment that may not have all the prerequisites installed, run the line below first. It will install the necessary libraries if needed:

In [None]:
!pip install numpy torch timm sentence_transformers

First, we import the Python PIL library, which contains functions for manipulating image files, and the `requests` library, which we will use to fetch images from the internet by their URI:

In [None]:
from PIL import Image
import requests

Next, we are going to fetch a small collection of images, two of apples, one of an orange, and one of a dog:

In [None]:
image_apple_1 = Image.open(requests.get('https://raw.githubusercontent.com/jina-ai/workshops/gymhouse/gymhouse/images/apple.jpg', stream=True).raw)
image_apple_2 = Image.open(requests.get('https://raw.githubusercontent.com/jina-ai/workshops/gymhouse/gymhouse/images/apple2.png', stream=True).raw)
image_orange = Image.open(requests.get('https://raw.githubusercontent.com/jina-ai/workshops/gymhouse/gymhouse/images/orange.png', stream=True).raw)
image_dog = Image.open(requests.get('https://raw.githubusercontent.com/jina-ai/workshops/gymhouse/gymhouse/images/dog.png', stream=True).raw)

You can inspect the individual images just by entering them in a notebook input field and pressing enter:

In [None]:
image_apple_1

Now we will import the `timm` library to load and modify the model and then download the `vit_base_patch16_224.augreg2_in21k_ft_in1k` model from HuggingFace.

In [None]:
import timm

model =  timm.create_model('vit_base_patch16_224.augreg2_in21k_ft_in1k', pretrained=True, num_classes=0)

Loading the model this way gets rid of the layer that classifies the output into the 1,000 categories it was trained for and makes the last layer the embeddings layer.

Images also require some pre-processing before they can become input to the model. We will use the model’s configuration file to create a function `transformer()` that converts Python’s internal image format into an appropriate input vector for the model:

In [None]:
data_config = timm.data.resolve_model_data_config(model)
tf = timm.data.create_transform(**data_config, is_training=False)

def transformer(input):
    return tf(input).unsqueeze(0)

You can see it in action by giving `transformer()` an image:

In [None]:
transformer(image_dog)

To get an embedding, we just pass this to the model:

(Don’t worry about the `.squeeze(0).detach().numpy()` part. This is just a way to convert the internal data format the AI model uses to one more convenient for us to use. Some AI software does this automatically, others do not.)

In [None]:
model(transformer(image_dog)).squeeze(0).detach().numpy()

Each embedding from this model has 576 dimensions.

In [None]:
embedding_dog = model(transformer(image_dog)).squeeze(0).detach().numpy()
len(embedding_dog)

Let's get embeddings for the remaining three images:

In [None]:
embedding_apple1 = model(transformer(image_apple_1)).squeeze(0).detach().numpy()
embedding_apple2 = model(transformer(image_apple_2)).squeeze(0).detach().numpy()
embedding_orange = model(transformer(image_orange)).squeeze(0).detach().numpy()

Define the cosine function over pairs of vectors, so we can compare embeddings:

In [None]:
from numpy import dot
from numpy.linalg import norm

def cosine(a, b):
    return dot(a,b)/(norm(a)*norm(b))

The two apple images have a pretty high cosine:

In [None]:
cosine(embedding_apple1, embedding_apple2)

At least when compared to the orange image:

In [None]:
print(f"Apple 1 to Orange: {cosine(embedding_apple1, embedding_orange)}")
print(f"Apple 2 to Orange: {cosine(embedding_apple2, embedding_orange)}")

And all three fruit pictures are far from the dog picture:

In [None]:
print(f"Apple 1 to Dog: {cosine(embedding_apple1, embedding_dog)}")
print(f"Apple 2 to Dog: {cosine(embedding_apple2, embedding_dog)}")
print(f"Orange to Dog: {cosine(embedding_orange, embedding_dog)}")


You can try this with your own images to see if it matches your intuitions.

To load an image from a URI, you can use:

```python
your_image = Image.open(requests.get('<your image URI>', stream=True).raw)
```

To load one from a file, just use

```python
your_image = Image.open('<your image file path>')
```