# Text-2-Image document retrieval using pretrained CLIP

In the first AI task which we implement for this collection of data, we'll be setting up a model to retrieve relevant images using provided text. We'll use the data from the `captions` field to retrieve the `img` field. 

In [1]:
import sys

sys.path.append('../../')

from pinnacledb.client import the_client

docs = the_client.coco.documents

Let's start adding a model to the collection.
A nice open source model to test text-2-image retrieval is [CLIP](https://openai.com/blog/clip/) which understands images and texts and embeds these in a common vector space.

Note that we are specifying the type of the model output, so that the collection knows how to store the results, as well as "activating" the model with `active=True`. That means, whenever we add data which fall under the `filter`, then these will get processed by the model, and the outputs will be added to the collection documents.

The `key` argument specifies which part of the document the model should act. If `key="_base"` then the model takes the whole document as input. Since we'll be encoding documents as images, then we'll chose `key="img"`.

We import the clip model from the `examples` directory on GitHub. However, for completeness, we quote the code here - it's a thin wrapper aroung the OpenAI model.

In `examples.models`:

```python
from clip import load as load_clip, tokenize as clip_tokenize

class CLIP(torch.nn.Module):
    def __init__(self, name):
        super().__init__()
        self.model, self.image_preprocess = load_clip(name)

    def preprocess(self, r):
        if isinstance(r, str):
            return clip_tokenize(r, truncate=True)[0, :]
        elif isinstance(r, list) and isinstance(r[0], str):
            return clip_tokenize(' '.join(r), truncate=True)[0, :]
        return self.image_preprocess(r)

    def forward(self, r):
        if len(r.shape) == 2:
            return self.model.encode_text(r)
        return self.model.encode_image(r)
```

In [None]:
from examples.models import CLIP

docs.create_model(
    name='clip',
    object=CLIP('RN50'),
    filter={},
    type='float_tensor',
    key='img',
    verbose=True,
    active=True,
    loader_kwargs={'batch_size': 10},
)

computing chunk (1/17)
finding documents under filter
done.
processing with clip


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [07:49<00:00,  1.06it/s]


bulk writing...
done.
computing chunk (2/17)
finding documents under filter
done.
processing with clip


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [07:43<00:00,  1.08it/s]


bulk writing...
done.
computing chunk (3/17)
finding documents under filter
done.
processing with clip


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [07:39<00:00,  1.09it/s]


bulk writing...
done.
computing chunk (4/17)
finding documents under filter
done.
processing with clip


  2%|███▋                                                                                                                                                      | 12/500 [00:11<07:32,  1.08it/s]

We'll create a companion model which uses the same underlying object as the previous model. That's specified by adding the name instead of the object in the `object` argument. In this case the model is not `active`, since we'll only be using it for querying the collection. We don't need to specify a `type` since that was done in the last step.

In [None]:
docs.create_model(
    name='clip_text',
    object='clip',
    key='captions',
    active=False,
)

We'll also create a measure which tests how similar to each other two outputs might be. Since CLIP was trained with cosine-similarity we'll use that here too.

In `examples.measures`:

```python

def dot(x, y):
    return x.matmul(y.T)


def css(x, y):
    x = x.div(x.norm(dim=1)[:, None])
    y = y.div(y.norm(dim=1)[:, None])
    return dot(x, y)

```

In [None]:
from examples.measures import css

docs.create_measure('css', css)

In order to be able to measure performance on the validation set, we'll add a **metric**.

In `examples.metrics`:

```python
class PatK:
    def __init__(self, k):
        self.k = k

    def __call__(self, x, y):
        return y in x[:self.k]

```

In [None]:
from examples.metrics import PatK

docs.create_metric('p_at_10', PatK(10))

Now we're ready to go to add a **semantic index**. This is a tuple of models, one of which is activated in order to populate the collection with vectors. The idea is that any of the models in the **semantic index** can be used to query the collection using nearest neighbour lookup based on the **measure** chosen.

In [None]:
from examples.models import CLIP

docs.create_semantic_index(
    'clip',
    models=['clip', 'clip_text'],
    measure='css',
    metrics=['p_at_10'],
)

Now the semantic index has been created, we can search through the data using that index.

We can see that we can get nice meaningful retrievals using the CLIP model from short descriptive pieces of text.
This is very useful, since the model is now deployed to the database, listening for incoming queries.

In [None]:
from IPython.display import display

docs.semantic_index = 'clip'
r = docs.find_one()

# example using item id directly
for r in docs.find(like={'_id': r['_id']}, n=10):
    display(r['img'])
    
# or a query which is interpreted by the CLIP model
for r in docs.find(like={'captions': ['Dog catches a frisbee']}, n=10):
    display(r['img'])

Let's now evaluate the quality of this semantic index

In [None]:
docs.create_validation_set(
    'text2image_retrieval', 
    filter={},
    splitter=lambda x: ({'img': x['img']}, {'captions': [x['captions'][0]]}),
    sample_size=1000,
)

In [None]:
docs.validate_semantic_index('clip', ['text2image_retrieval'], ['p_at_10'])

In [None]:
docs['_semantic_indexes'].find_one()

In the next section of this example, let us train our own model from scratch. The model will be much simpler than the clip model, but will yield faster retrievals. It will be interesting to see how this compares to CLIP, and show-case SuperDuperDB as a framework for easily integrating and benchmarking AI models, in particular for retrieval.