# Multimodal Search Using CLIP

## Introduction

This notebook showcases the capabilities of SuperDuperDB for performing multimodal searches using the `VectorIndex`. SuperDuperDB's flexibility enables users and developers to integrate various models into the system and use them for vectorizing diverse queries during search and inference. In this demonstration, we leverage the [CLIP multimodal architecture](https://openai.com/research/clip).

## Prerequisites

Before diving into the implementation, ensure that you have the necessary libraries installed by running the following commands:

In [None]:
!pip install pinnacledb[demo]

## Connect to datastore 

First, we need to establish a connection to a MongoDB datastore via SuperDuperDB. You can configure the `MongoDB_URI` based on your specific setup. 
Here are some examples of MongoDB URIs:

* For testing (default connection): `mongomock://test`
* Local MongoDB instance: `mongodb://localhost:27017`
* MongoDB with authentication: `mongodb://pinnacle:pinnacle@mongodb:27017/documents`
* MongoDB Atlas: `mongodb+srv://<username>:<password>@<atlas_cluster>/<database>`

In [None]:
import os
from pinnacledb import pinnacle
from pinnacledb.backends.mongodb import Collection

mongodb_uri = os.getenv("MONGODB_URI", "mongomock://test")
db = pinnacle(mongodb_uri, artifact_store='filesystem://./models/')

# Create a collection for Tiny ImageNet
imagenet_collection = Collection('tiny-imagenet')

## Load Dataset 

To make this notebook easily executable and interactive, we'll work with a sub-sample of the [Tiny-Imagenet dataset](https://paperswithcode.com/dataset/tiny-imagenet). The processes demonstrated here can be applied to larger datasets with higher resolution images as well. For such use-cases, however, it's advisable to use a machine with a GPU, otherwise they'll be some significant thumb twiddling to do.

To insert images into the database, we utilize the `Encoder`-`Document` framework, which allows saving Python class instances as blobs in the `Datalayer` and retrieving them as Python objects. To this end, SuperDuperDB contains pre-configured support for `PIL.Image` instances. This simplifies the integration of Python AI models with the datalayer. It's also possible to create your own encoders.


In [None]:
from pinnacledb import Document
from pinnacledb.ext.pillow import pil_image as i
from datasets import load_dataset
import random

# Load the dataset
dataset = load_dataset("zh-plus/tiny-imagenet")['valid']

# Wrap images into encodable objects
dataset = [Document({'image': i(r['image'])}) for r in dataset]

# Randomly sample 1000 images from the dataset
dataset = random.sample(dataset, 1000)

# Encode and insert images to the database
db.execute(imagenet_collection.insert_many(dataset), encoders=(i,))

You can verify that the images are correctly stored as follows:

In [None]:
x = db.execute(imagenet_collection.find_one()).unpack()['image']
display(x.resize((300, 300 * int(x.size[1] / x.size[0]))))

## Build Models

Now, let's prepare the CLIP model for multimodal search, which involves two components: `text encoding` and `visual encoding`. After installing both components, you can perform searches using both images and text to find matching items:

In [None]:
import clip
from pinnacledb import vector
from pinnacledb.ext.torch import TorchModel

# Load the CLIP model
model, preprocess = clip.load("RN50", device='cpu')

# Define a vector
e = vector(shape=(1024,))

# Create a TorchModel for text encoding
text_model = TorchModel(
    identifier='clip_text',
    object=model,
    preprocess=lambda x: clip.tokenize(x)[0],
    postprocess=lambda x: x.tolist(),
    encoder=e,
    forward_method='encode_text',    
)

# Create a TorchModel for visual encoding
visual_model = TorchModel(
    identifier='clip_image',
    object=model.visual,    
    preprocess=preprocess,
    postprocess=lambda x: x.tolist(),
    encoder=e,
)

## Create a Vector-Search Index

Let's create the index for vector-based searching. We'll register both models with the index simultaneously, but specify that the `visual_model` will be responsible for creating the vectors in the database (`indexing_listener`). The `compatible_listener` specifies how an alternative model can be used to search the vectors, enabling multimodal search with models expecting different types of indexes.

In [None]:
from pinnacledb import VectorIndex
from pinnacledb import Listener

# Create a VectorIndex and add it to the database
db.add(
    VectorIndex(
        'my-index',
        indexing_listener=Listener(
            model=visual_model,
            key='image',
            select=imagenet_collection.find(),
        ),
        compatible_listener=Listener(
            model=text_model,
            key='text',
            active=False,
            select=None,
        )
    )
)

## Search Images Using Text

Now we can demonstrate searching for images using text queries:

In [None]:
from IPython.display import display
from pinnacledb import Document

# Define the search parameters
search_term = "mushroom"
num_results = 6

# Execute the query
search_results = db.execute(
    imagenet_collection.like(Document({'text': search_term}), vector_index='my-index', n=num_results).find({})
)

# Display the images from the search results
for r in search_results:
    x = r['image'].x
    display(x.resize((300, 300 * int(x.size[1] / x.size[0]))))