# Multimodal Embedding Models

## Setup

### Define a multimodal embedding model

This exercise uses a model called [**CLIP ViT-B/32 - LAION-2B**](https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K), a reasonably small but performant image and text classification model. To keep the code similar to previous embedding exercises, we'll access the model via the experimental Langchain bindings for [OpenCLIP](https://github.com/mlfoundations/open_clip), an open source implementation of OpenAI's CLIP neural network. There are several models and model checkpoints available. 

In [None]:
from PIL import Image
from IPython.display import display
from langchain_experimental.open_clip import OpenCLIPEmbeddings

embeddings = OpenCLIPEmbeddings(model_name="ViT-B-32", checkpoint="laion2b_s34b_b79k")

## Generate embeddings

### Generate an embedding from a sample image

In [None]:
file_path = "../../images/coffee.png"
img = Image.open(file_path).convert('RGB').resize((256,256))
image_embedding = embeddings.embed_image([file_path])[0]

### Show the number of dimensions of the image embedding

In [None]:
len(image_embedding)

### Generate embeddings from text strings

In [None]:
texts = [
  "cup of black coffee",
  "laptop computer",
  "caffe latte",
  "caffe latte on a plate in front of a laptop",
  "laptop showing code",
  "laptop showing a movie",
  "laptop on a wooden table",
  "laptop on an airplane tray table",
  "Godzilla riding a roller coaster"
]
text_embeddings = embeddings.embed_documents(texts)

### Show the number of dimensions of one of the text embeddings

In [None]:
len(text_embeddings[0])

## Comparing image and text embeddings

In [None]:
from langchain_community.utils.math import cosine_similarity

results = [
    { 'text': text, 'similarity': cosine_similarity([image_embedding], [text_embeddings[index]])[0][0] }
    for index, text in enumerate(texts)
]

### Sort results with higher similarity first

In [None]:
results.sort(key=lambda x: x['similarity'], reverse=True)

In [None]:
display(img)
for result in results:
    print(f'Similarity between image and "{result["text"]}": {result["similarity"]}')

## Exercises

- Take what you've learned from `embeddings/01_comparing_embeddings` and experiment with comparing embeddings of images and/or text inputs.

### Discussion Questions

- Images and text "living" in the same semantic space is powerful! What are some of the implications for adding multimodal capability to an embedding model?
- Search around the Internet for other modalities that people are talking about. Do any other modalities look intriguing for your collections or materials?