The [Audio-to-Image Search Plugin](https://github.com/jacobmarks/audio-retrieval-plugin) is a FiftyOne plugin that allows searching for images similar to a given audio clip. 

It works by:

- Using the `ImageBind` embedding model to embed images and audio clips into a shared 1024-dimensional space.

- Storing the embeddings in a `Qdrant` similarity index for fast similarity search.

- Providing a FiftyOne UI for uploading audio clips, pre-filtering, and searching the index.

The plugin supports `ogg` and `wav` audio files, but not `mp3`. It makes an API call to Replicate to avoid potential installation issues with running the embedding model locally.


### ImageBind learns a joint embedding space across multiple modalities


ImageBind exploits the natural co-occurrence of images with other modalities to learn a unified representation that binds them together without requiring exhaustive paired data. It uses the frozen CLIP image-text embedding space as the target for alignment. This simple yet powerful approach enables novel multimodal capabilities to emerge.

- ImageBind learns a single embedding space that binds together six modalities: images, text, audio, depth, thermal, and IMU data. This allows it to align and connect information from these various sources.

- Importantly, ImageBind does not require all combinations of paired data to train this joint embedding. It leverages the fact that images naturally co-occur with the other modalities. By training on image-paired it can implicitly align all the modalities together using the images as an anchor or "binding" modality.

 - Each modality has its own encoder, e.g., a ViT for images, audio spectrograms, depth maps, etc., and a transformer for IMU sequences. A modality-specific linear projection head is added to each encoder to obtain a fixed-dimensional embedding. This embedding is normalized and used in the contrastive InfoNCE loss during training. `This is important, so let me break it down:`

    1. **Separate encoders for each modality**:

    - ImageBind uses different encoder architectures tailored to each modality. 

    - For images, audio spectrograms, and depth maps, they employ Vision Transformers (ViT) as the encoder backbone.

    - For Inertial Measurement Unit (IMU) sequences, which consist of accelerometer and gyroscope readings over time, they use a standard Transformer encoder.

    2. **Modality-specific linear projection heads:**

    - The raw output from each modality's encoder may have different dimensionalities.

    - To align them into a common embedding space, ImageBind adds a linear projection head on top of each encoder.

    - This projection head is just a simple fully-connected layer that maps the encoder output to a fixed target dimensionality.

    - The weights of these projection heads are learned during training.

    3. **Embedding normalization:**

    - After the linear projection, the embeddings from each modality are normalized. 

    - Normalization here likely refers to L2 normalization, where the embedding vectors are scaled to have unit length.

    - Normalization makes the embeddings more comparable and is a common practice in contrastive learning.

    4. **InfoNCE contrastive loss:**
    - The normalized embeddings from all modalities are used to compute the InfoNCE contrastive loss during training.

    - InfoNCE is a popular choice for self-supervised contrastive learning. It encourages embeddings of positive pairs (e.g. an image and its caption) to be close, while pushing apart embeddings of negative pairs.

    - By using InfoNCE across embeddings from different modalities, ImageBind learns to align them into a shared semantic space.

- The image and text encoders are initialized from a pretrained CLIP model and frozen during training. The other modality encoders are learned to align with this frozen CLIP image-text embedding space.

- Through this training on image-paired data, ImageBind implicitly learns to align non-paired modalities, like text and audio. For example, training on (image, text) and (image, audio) pairs enables ImageBind to connect text and audio without seeing them paired.

In [None]:
import os
import fiftyone as fo

You'll need to install the following plugin:

In [None]:
!fiftyone plugins download https://github.com/jacobmarks/audio-retrieval-plugin

You will need a Replicate API Token for this notebook. You can sign up [here](https://replicate.com/docs).

In [None]:
import getpass

os.environ['REPLICATE_API_TOKEN'] = getpass.getpass("Enter your Replicate API token: ")

FiftyOne has integrations with Hugging Face, which allow you to easily pull datasets from the hub! Learn more about the integration [here](https://docs.voxel51.com/integrations/huggingface.html) and how you can pull datasets from the hub [here](https://docs.voxel51.com/integrations/huggingface.html#loading-datasets-from-the-hub).

In [None]:
import fiftyone.utils.huggingface as fouh

instruments_dataset = fouh.load_from_hub(
    "YakupAkdin/instrument-images",
    split="train",
    format= "ParquetFilesDataset",
    overwrite=True,
    persistent=True,
    name="instruments"
    )

If already have the dataset downloaded, so you can just load it:

```python
instruments_dataset = fo.load_dataset("instruments")
```

For this example, you can use the `take` method of the dataset which randomly samples the given number of samples from the collection. This will keep costs low for you on Replicate and save time.

In [None]:
dataset_sample = instrument_dataset.take(
    size=250, 
    seed=51, 
    )

The following code overrides the default operator timeout. This is done because it may take a long time to generate embeddings, as they are coming from the Replicate API.

In [None]:
fo.config.operator_timeout = 6_000_000

The audio-retrieval-plugin allows you to search a dataset for images that are similar to a given audio clip. It works by using the `ImageBind` embedding model to embed images and audio clips into a shared 

To use the plugin, you need to:

1. Start a local Qdrant server using Docker: `docker run -p "6333:6333" -p "6334:6334" -d qdrant/qdrant`

2. Run the `create_imagebind_index` operator to create the similarity index

3. Run the `open_audio_retrieval_panel` operator to open the search panel

The search panel allows you to upload an audio clip (in `ogg` or `wav` format), and then searches the index for similar images. This plugin is a proof of concept and not intended for production use.


In [None]:
session = fo.launch_app(dataset_sample)