# ImageBind example

**Environment**

- Raspberry Pi5 8GB, running Raspberry Pi OS (Bookworm, 64-bit) and booting from SSD.
- The Pi has swap size of 2048MB.
- GUI is disabled on the Pi.
- ImageBind model pre-downloaded to the designated/expected path (see Directory structure below).

**Directory structure**

```bash
app
├── .packages                   
├── ImageBind                   <-- cloned ImageBind repository
│   ├── .checkpoints            <-- ImageBind looks for this directory when loading the model
│   │   └── imagebind_huge.pth  <-- pre-downloaded ImageBind model
│   └── ...
└── scripts
    └── example.ipynb           <-- this notebook
```

**References**
- https://github.com/facebookresearch/ImageBind?tab=readme-ov-file#usage
- https://jina.ai/news/cross-modal-search-with-imagebind-and-docarray/

<br />

In [None]:
import os
os.chdir('/app/ImageBind')

print(os.getcwd()) # expected output: /app/ImageBind

In [None]:
import torch

from imagebind import data
from imagebind.models import imagebind_model
from imagebind.models.imagebind_model import ModalityType

device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [None]:
# Load ImageBind model
# This can take 1-2 minutes. Make sure you have 2GB swap on 8GB Pi5, otherwise this will likely crash the Pi.

model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)

In [None]:
# Load data

text = ["A dog.", "A car.", "A bird."]
image_paths = [".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths = [".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]

inputs = {
    ModalityType.TEXT: data.load_and_transform_text(text, device),
    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
    ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
}

## Basic example

In [None]:
# Predictions

with torch.no_grad():
    embeddings = model(inputs)

print(
    "Vision x Text: ",
    torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.TEXT].T, dim=-1),
)
print(
    "Audio x Text: ",
    torch.softmax(embeddings[ModalityType.AUDIO] @ embeddings[ModalityType.TEXT].T, dim=-1),
)
print(
    "Vision x Audio: ",
    torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.AUDIO].T, dim=-1),
)

## Cross-modal search example

In [None]:
!python3 -m pip install docarray

In [None]:
from typing import Union
from docarray.documents import TextDoc, ImageDoc, AudioDoc

def embed(doc: Union[TextDoc, ImageDoc, AudioDoc]):
    """inplace embedding of document"""
    with torch.no_grad():
        if isinstance(doc, TextDoc):
            embedding = model({ModalityType.TEXT: data.load_and_transform_text([doc.text], device)})[ModalityType.TEXT]
        elif isinstance(doc, ImageDoc):
            embedding = model({ModalityType.VISION: data.load_and_transform_vision_data([doc.url], device)})[ModalityType.VISION]
        elif isinstance(doc, AudioDoc):
            embedding = model({ModalityType.AUDIO: data.load_and_transform_audio_data([doc.url], device)})[ModalityType.AUDIO]
        else:
            raise ValueError('one of the modality fields need to be set')

    doc.embedding = embedding.detach().cpu().numpy()[0]

    return doc

In [None]:
# Text-to-image

from docarray.index.backends.in_memory import InMemoryExactNNIndex

image_index = InMemoryExactNNIndex[ImageDoc]()
image_index.index([
    embed(doc) for doc in 
    [ImageDoc(url=path) for path in image_paths]
])

match = image_index.find(embed(TextDoc(text='bird')).embedding, search_field='embedding', limit=1).documents[0]
match.url.display()

In [None]:
# Text-to-audio

from docarray.index.backends.in_memory import InMemoryExactNNIndex

audio_index = InMemoryExactNNIndex[AudioDoc]()
audio_index.index([
    embed(doc) for doc in
    [AudioDoc(url=path) for path in audio_paths]
])

match = audio_index.find(embed(TextDoc(text='bird')).embedding, search_field='embedding', limit=1).documents[0]
match.url.display()

In [None]:
# Image-to-audio

from docarray.index.backends.in_memory import InMemoryExactNNIndex

audio_index = InMemoryExactNNIndex[AudioDoc]()
audio_index.index([
    embed(doc) for doc in
    [AudioDoc(url=path) for path in audio_paths]
])

match = audio_index.find(embed(ImageDoc(url='.assets/dog_image.jpg')).embedding, search_field='embedding', limit=1).documents[0]
match.url.display()