# Search Image from Text via CLIP model

In this tutorial, we will create an image search system that retrieves images based on short text descriptions as query.

The interest behind this is that in regular search, image description or meta data describing the content of the image needs to be indexed first before retrieving the images via text query. This can be expensive because you need a person to write that description and also information about image content is not always available.

We need to look for another solution! What if we can directly compare text with images?

To do so, we need to figure out a way to match images and text. One way is finding related images with similar semantics to the query text. This requires us to represent both images and query text in the same embedding space to be able to do the matching. In this case, pre-trained cross-modal models can help us out.

For example when we write the word "dog" in query we want to be able to retrieve pictures with a dog solely by using the embeddings similarity.



In this tutorial we will guide you to create an Image from Text search application with Jina

## ⏰ Installing & Importing Dependencies



We will start this tutorial by installing the necessary ***pip*** dependencies.

In [None]:
!pip install Pillow jina transformers>=4.9.1 matplotlib torch>=1.9.0 torchvision>=0.10.0

In [None]:
! rm -rf workspace images query data*.zip* SimpleIndexer

we will download the data to follow the tutorials

In [None]:
! wget https://open-images.s3.eu-central-1.amazonaws.com/data.zip
! unzip data.zip

We will import the necessary dependencies.

In [4]:
import os
from typing import Dict, Optional, Sequence, Tuple

In [5]:
import matplotlib.pyplot as plt
from jina import Executor, Flow, requests
from jina.logging.logger import JinaLogger
from jina.types.request import Request

In [6]:
from docarray import Document, DocumentArray

In [7]:
import torch
from transformers import CLIPFeatureExtractor, CLIPModel, CLIPTokenizer

## Defining the Indexer Flow

Now that we understand the problem and we that have an idea on how to fix it, let's try to imagine what the solution would look like: 

1. We have a bunch of images with no text description about the content.
2. We use a model to create an embedding that represents those images. 
3. Now we will index and save our embeddings which we will call Documents inside a workspace folder. 

This is what we will call the index Flow and we will show you how to create it with Jina


### 1. Loading images into a DocumentArray

In [8]:
images = DocumentArray.from_files(f"images/*.jpg")

Here the `DocumentArray` images only contains filenames of the images, we will see later tjat it is the `Executor` that will load the image from disk

### 2. CLIPImageEncoder


We need to add to our Flow (our pipeline) an encoder for the image, so we are going to create an `Executor`. (see details documentation [here](https://docs.jina.ai/concepts/executor/))

This encoder encodes an image into embeddings using the CLIP model. 
We want an executor that loads the CLIP model and encodes it during the index flow. 

Our executor should:
* support both **GPU** and **CPU**: That's why we will provision the `device` parameter and use it when encoding.
* be able to process documents in batches in order to use our resources effectively: To do so, we will use the 
parameter `batch_size`This encoder encodes an image into embeddings using the CLIP model. 
We want an executor that loads the CLIP model and encodes it during the index flow. 



In [9]:
class CLIPImageEncoder(Executor):
    """Encode image into embeddings using the CLIP model."""

    def __init__(
        self,
        pretrained_model_name: str = "openai/clip-vit-base-patch32",
        device: str = "cpu",
        batch_size: int = 32,
        *args,
        **kwargs,
    ):
        super().__init__(*args, **kwargs)
        self.batch_size = batch_size

        self.device = device
        self.preprocessor = CLIPFeatureExtractor.from_pretrained(pretrained_model_name)
        self.model = CLIPModel.from_pretrained(
            pretrained_model_name
        )  # load the pretrained clip model from the transformer library

        self.model.to(
            self.device
        ).eval()  # we want to do only inference so we put the model in eval mode

    @requests
    @torch.inference_mode()  # we don't want to keep track of the gradient during inference
    def encode(self, docs: DocumentArray, parameters: dict, **kwargs):

        for batch_docs in docs.batch(
            batch_size=self.batch_size
        ):  # we want to compute the embedding by batch of size batch_size
            tensor = self._generate_input_features(
                batch_docs
            )  # Transformation from raw images to torch tensor
            batch_docs.embeddings = (
                self.model.get_image_features(**tensor).cpu().numpy()
            )  # we compute the embeddings and store it directly in the DocumentArray

    def _generate_input_features(self, docs: DocumentArray):
        docs.apply(lambda d: d.load_uri_to_image_tensor())
        input_features = self.preprocessor(
            images=[d.tensor for d in docs],
            return_tensors="pt",
        )
        input_features = {
            k: v.to(torch.device(self.device)) for k, v in input_features.items()
        }
        return input_features

### 3. SimpleIndexer


 Now we want to index and save our embeddings to later perform be able to search within the document.
Here we will implement a SimpleIndexer which will store the embedding on disk using the `SQLite` support as a backend in `docarray` ( see further details [here](https://docarray.jina.ai/advanced/document-store/sqlite/?highlight=sqlite)). This indexer is also available on the hub [here](https://hub.jina.ai/executor/zb38xlt4) but we prefer showing you how to create your own one in this tutorial.

This indexer expose two endpoints : `/search` and `/index`

In [10]:
class SimpleIndexer(Executor):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        print(os.path.join(self.workspace, 'index.db'))
        self._index = DocumentArray(
            storage='sqlite',
            config={
                'connection': os.path.join(self.workspace, 'index.db'),
                'table_name': 'clip',
            },
        )

    @requests(on='/index')
    def index(self, docs: DocumentArray, **kwargs):
        self._index.extend(docs)

    @requests(on='/search')
    def search(self, docs: DocumentArray, **kwargs):
        docs.match(self._index)

### 4. Building the index Flow

Here we compose our first `Flow` which will be in charge of indexing the images in the database

In [None]:
flow_index = (
    Flow()
    .add(uses=CLIPImageEncoder, name='encoder', uses_with={'device': "cpu"})
    .add(uses=SimpleIndexer, name='indexer', workspace='workspace')
)
flow_index

As you can see on these image, each `DocumentArray` will first be encode the indexer in the database

Now let's actually indexing the data by calling the FLow

In [None]:
with flow_index:
    flow_index.post(on='/index', inputs=images)

The Flow has indexed the data in the database ! Now Let's define another FLow to query these images with some texts

## Defining the Search Flow

Now to search for an image using text we do the following 

1. We embed the query text into the same embedding space as the image.
2. We compute similarity between the query embedding and previously saved embeddings. 
3. We return the best results.

This is our query Flow. 


### 1. CLIPTextEncoder


This part is very similar to the CLIPImageEncoder, however instead of using the clip model to embeds images we are going to embed text. So code changed are very little, mainly using a tokenizer instead of the image preprocessing

In [13]:
class CLIPTextEncoder(Executor):
    """Encode text into embeddings using the CLIP model."""

    def __init__(
        self,
        encode_text=True,
        pretrained_model_name: str = 'openai/clip-vit-base-patch32',
        device: str = 'cpu',
        batch_size: int = 32,
        *args,
        **kwargs,
    ):
        super().__init__(*args, **kwargs)
        self.batch_size = batch_size
        self.device = device

        self.tokenizer = CLIPTokenizer.from_pretrained(
            pretrained_model_name
        )  # load the tokenizer from the transformer library

        self.model = CLIPModel.from_pretrained(
            pretrained_model_name
        )  # load the pretrained clip model from the transformer library

        self.model.eval().to(
            device
        )  # we want to do only inference so we put the model in eval mode

    @requests
    @torch.inference_mode()  # we don't want to keep track of the gradient during inference
    def encode(self, docs: Optional[DocumentArray], parameters: Dict, **kwargs):

        for docs_batch in docs.batch(
            batch_size=self.batch_size
        ):  # we want to compute the embedding by batch of size batch_size
            input_tokens = self._generate_input_tokens(
                docs_batch.texts
            )  # Transformation from raw texts to torch tensor
            docs_batch.embeddings = (
                self.model.get_text_features(**input_tokens).cpu().numpy()
            )  # we compute the embeddings and store it directly in the DocumentArray

    def _generate_input_tokens(self, texts: Sequence[str]):

        input_tokens = self.tokenizer(
            texts,
            max_length=77,
            padding='longest',
            truncation=True,
            return_tensors='pt',
        )
        input_tokens = {k: v.to(self.device) for k, v in input_tokens.items()}
        return input_tokens

### 2. Compute similarity between the query embedding and previously saved embeddings

This part will be done by the SimpleInexer that we define above. Let's take a deep dive on how this indexer actually perform the search part

In [14]:
class DumbSimpleIndexer(
    Executor
):  # this executor is here for example and is never used
    @requests(on='/search')
    def search(self, docs: DocumentArray, **kwargs):
        docs.match(self._index)

The indexer just call the inner [match method](https://docarray.jina.ai/api/docarray.array.match/?highlight=match#module-docarray.array.match) from `DocumentArray` to perform a cosine similarity search betweem the embedding of the query and the embeding of the indexed images. The idea is that we search for the closests vectors of our query in the semantic space define by clip to find the images which correspond the most to our text query

### 3. Define how to return the best results


Here we will define a simple function that will plot the 3 closest image for each query texts. 

In [15]:
def plot_search_results(resp: Request):
    for doc in resp.docs:
        print(f'Query text: {doc.text}')
        print(f'Matches:')
        print('-' * 10)
        doc.matches[:3].plot_image_sprites()

This function will be called as a callback at then end of the Search Flow that we will now define

### 4. Building the index Flow


let's define the search flow

In [None]:
flow_search = (
    Flow()
    .add(uses=CLIPTextEncoder, name='encoder', uses_with={'device': "cpu"})
    .add(uses=SimpleIndexer, name='indexer', workspace='workspace')
)
flow_search

And let's query it

In [None]:
with flow_search:
    resp = flow_search.search(
        inputs=DocumentArray(
            [
                Document(text='dog'),
                Document(text='cat'),
                Document(text='kids on their bikes'),
            ]
        ),
        on_done=plot_search_results,
    )

Here you see that we can recover image of cat,dog or even Kigs on their bike easily !

In [18]:
# clean up
! rm -rf workspace images query
! rm data.zip