[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jamescalam/applied-ml-minicourse/blob/main/code/05-diffusion-search.ipynb)

# 05: Applying Semantic Search to Diffusion

Semantic search is prominent in text search but can be applied to any modality that we can think of as containing some degree of potential human meaning. We could search across [text and images based on their *meaningful* content](https://www.pinecone.io/learn/clip/), perform [question-answering across video](https://www.pinecone.io/learn/openai-whisper/), and more.

The question now is, how do we apply search to make diffusion more efficient?

Ideally, we want each diffusion image to be represented by a vector. When a user requests a new diffusion image, we convert their query into a vector and cross-check it against past diffusion vectors. If we find that something similar has already been generated we can return that rather than waiting through the lengthy diffusion process.

## Finding Diffusion Vectors

Let's start by initializing a stable diffusion pipeline.

In [None]:
import torch
from diffusers import StableDiffusionPipeline

# set the hardware device
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# init all of the pipeline models and move them to a given GPU
pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
  	use_auth_token="<<ACCESS_TOKEN>>"
)
pipe.to(device)

When looking at the diffusion pipeline, it isn't immediately clear where we can extract meaningful vectors from:

![Stable Diffusion pipeline](https://github.com/jamescalam/applied-ml-minicourse/raw/main/images/stable-diffusion-pipeline-annotated.png)

There is a model called *CLIP* that acts as one of the earliest steps in the pipeline. This model is actually fine-tuned with a contrastive loss function between *(text, image)* pairs.

That means CLIP should, in theory, be able to produce the meaningful vectors we need. However, we can see that CLIP *isn't* outputting a single vector embedding. CLIP is outputting *77* vectors, a single vector for each token fed into the model.

We can see this *77x768* dimension output when processing a prompt through the first two steps of the pipeline (the `tokenizer` and CLIP `text_encoder`):

In [None]:
prompt = "a person surfing"

# tokenize the prompt
prompt_inputs = pipe.tokenizer(
    prompt, return_tensors='pt',
    padding='max_length'
).to(device)
# create prompt encoding
out = pipe.text_encoder(**prompt_inputs)
# extract CLIP embedding
prompt_embeds = out['last_hidden_state']
prompt_embeds.shape

To transform these 77-vectors into a single vector we can perform a *mean pooling* operation.

<img src="https://github.com/jamescalam/applied-ml-minicourse/raw/main/images/mean-pooling.png" style="width:60%">

Applying mean pooling means we average the values across each of the 768-dimensions, producing a single 768-d vector.

<img src="https://github.com/jamescalam/applied-ml-minicourse/raw/main/images/vector-extraction.png" style="width:80%">

Fortunately, CLIP does in fact produce this mean-pooled vector by default despite not being used by the diffusion pipeline. It is output within a `'pooler_output'` tensor:

In [None]:
# extract *pooled* CLIP embedding
prompt_embeds = out['pooler_output']
prompt_embeds.shape

Because these pooled *"prompt vectors"* are created before the long diffusion process begins, we can build them quickly and use them to retrieve any past, similar generations.

<img src="https://github.com/jamescalam/applied-ml-minicourse/raw/main/images/creating-embeddings.png" style="width:60%">

These *prompt vectors* are one side of what we will need to store from each diffusion generation. The other is the images themselves. For that we will need to use another service that we will explore in the next chapters.