# Embedding Multimodal Data for Similarity Search Using HuggingFace Transformers, Datasets, and FAISS

Embeddings are semantically meaningful compressions of informations. They can be used to do similarity search, zero-shot classification or simply train a new model.

Use cases for similarity search:
- searching for similar products in e-commerce,
- content search in social media
- etc...

## Setups

In [None]:
!pip install -q datasets faiss-gpu transformers sentencepiece

In this example, we will use the [`clip`](https://huggingface.co/openai/clip-vit-base-patch16) model to extract the features.

**CLIP** introduced joint training of a text encoder and an image encoder to connect two modalities.

In [None]:
import torch
from PIL import Image
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
import faiss
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# text encoder
tokenizer = AutoTokenizer.from_pretrained('openai/clip-vit-base-patch16')
# image encoder
processor = AutoImageProcessor.from_pretrained('openai/clip-vit-base-patch16')
# model
model = AutoModel.from_pretrained('openai/clip-vit-base-patch16').to(device)

We will use a small captioning dataset, [`jmhessel/newyorker_caption_contest`](https://huggingface.co/datasets/jmhessel/newyorker_caption_contest)

In [None]:
from datasets import load_dataset

ds = load_dataset('jmhessel/newyorker_caption_contest', 'explanation')

In [None]:
ds

In [None]:
ds['train']

In [None]:
ds['train'][0]['image']

We do not have to write any function to embed examples or create an index. The HuggingFace datasets library's FAISS integraion abstracts these processes. We can use `map` method to create a new column with the embeddings for each example:

In [None]:
dataset = ds['train']
ds_with_embeddings = dataset.map(
    lambda example: {
        'embeddings': model.get_text_features(
            **tokenizer(
                [example['image_description']],
                truncation=True,
                return_tensors='pt'
            ).to('cuda')
        )[0].detach().cpu().numpy()
    }
)

We can do the same to the image embeddings.

In [None]:
ds_with_embeddings = ds_with_embeddings.map(
    lambda example:{
        'image_embeddings': model.get_image_features(
            **processor(
                [example['image']],
                return_tensors='pt'
            ).to('cuda')
        )[0].detach().cpu().numpy()
    }
)

Now we can create an index for each column:

In [None]:
ds_with_embeddings.add_faiss_index(column='embeddings')
ds_with_embeddings.add_faiss_index(column='image_embeddings')

## Querying the data with text prompts

In [None]:
prompt = 'a snowy day'
prompt_embedding = model.get_text_features(
    **tokenizer(
        [prompt],
        truncation=True,
        return_tensors='pt'
    ).to('cuda')
)[0].detach().cpu().numpy()

scores, retrieved_examples = ds_with_embeddings.get_nearest_examples(
    'embeddings',
    prompt_embedding,
    k=1
)

In [None]:
def downscale_images(image):
    width = 200
    ratio = width / float(image.size[0])
    height = int((float(image.size[1]) * float(ratio)))
    img = image.resize((width, height), Image.Resampling.LANCZOS)

    return img

In [None]:
images = [downscale_images(image) for image in retrieved_examples['image']]
# see the closest text and image
print(retrieved_examples['image_description'])
display(images[0])

## Querying the data with image prompts

In [None]:
import requests

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/beaver.png"
image = Image.open(requests.get(url, stream=True).raw)
display(downscale_images(image))

In [None]:
img_embedding = model.get_image_features(
    **processor(
        [image],
        truncation=True,
        return_tensors='pt'
    ).to('cuda')
)[0].detach().cpu().numpy()

In [None]:
images = [downscale_images(image) for image in retrieved_examples['image']]
# see the cloest text and image
print(retrieved_examples['image_description'])
display(image[0])

## Saving, pushing, and loading the embeddings

Save the embeddings locally:

In [None]:
ds_with_embeddings.save_faiss_index(
    'embeddings',
    'embeddings/embeddings.faiss'
)
ds_with_embeddings.save_faiss_index(
    'image_embeddings',
    'embeddings/image_embeddings.faiss'
)

Push the embeddings to the Hub

In [None]:
from huggingface_hub import HfApi, snapshot_download

api = HfApi()
api.create_repo('<username>/faiss_embeddings', repo_type='dataset')
api_upload_folder(
    folder_path='embeddings',
    repo_id='<username>/faiss_embeddings',
    repo_type='dataset'
)

In [None]:
snapshot_download(
    repo_id="<username>/faiss_embeddings",
    repo_type="dataset",
    local_dir="downloaded_embeddings"
)

We can also load the embeddings to the dataset with no embeddings using `load_faiss_index`.

In [None]:
ds = ds['train']
ds.load_faiss_index(
    'embeddings',
    './downloaded_embeddings/embeddings.faiss'
)

In [None]:
# test inference
prompt = 'people under the rain'
prompt_embedding = model.get_text_features(
    **tokenizer(
        [prompt],
        truncation=True,
        return_tensors='pt'
    ).to('cuda')
)[0].detach().cpu().numpy()

scores, retrieved_examples = ds.get_nearest_examples(
    'embeddings',
    prompt_embedding,
    k=1
)

In [None]:
display(retrieved_examples['image'][0])