# Image retrieval, captioning and classification with CoCo

This tutorial uses the [CoCo dataset "Common objects in Context"](https://cocodataset.org/#home) to show case some of the key-features of SuperDuperDB. In this example, you'll learn how to:

- Prepare data in the best way for SuperDuperDB usage
- Define data types
- Upload and query data to and from the data base
- Define multiple models on the database, including models with dependencies
- Define a searchable semantic index based on existing models
- Train a semantic index from scratch

If you haven't downloaded the data already, execute the lines of bash below. We've tried to keep it clean,
and for reasons of efficiency have resized the images using imagemagick.

In [None]:
!mkdir -o data/coco/
!curl http://images.cocodataset.org/annotations/annotations_trainval2014.zip -o data/coco/raw.zip
!unzip data/coco/raw.zip
!mv data/coco/annotations/captions_train2014.json data/coco/
!rm -rf data/coco/annotations
!rm data/coco/raw.zip
!curl http://images.cocodataset.org/zips/train2014.zip -o data/coco/images.zip
!unzip data/coco/images.zip
!rm data/coco/images.zip
!sudo apt install imagemagick
!mogrify -resize 224x data/coco/images/*.jpg

SuperDuperDB uses MongoDB for data storage. If you haven't done so already, install it using the following lines of bash.

In [None]:
!wget -qO - https://www.mongodb.org/static/pgp/server-6.0.asc | sudo apt-key add -
!sudo apt-get install gnupg
!wget -qO - https://www.mongodb.org/static/pgp/server-6.0.asc | sudo apt-key add -
!echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/6.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-6.0.list
!sudo apt-get update
!sudo apt-get install -y mongodb-org

In case you haven't done so already, install the dependencies for this tutorial, including SuperDuperDB,
which is a simple pip install.

In [None]:
!pip install pandas
!pip install pillow
!pip install torch
!pip install pinnacledb

SuperDuperDB can handle data in any format, including images. The documents in the database are MongoDB `bson` documents, which mix `json` with raw bytes and `ObjectId` objects. SuperDuperDB takes advantage of this by 
serializing more sophisticated objects to bytes, and reinstantiating the objects in memory, when data is queried.

In order to tell SuperDuperDB what type an object has, one specifies this with a subdocument of the form:

```json
{
    "_content": {
        "bytes": ...,
        "type": "<my-type>",
    }
}
```

If however, the content is located on the web or the filesystem, one can specify the URLs directly:

```json
{
    "_content": {
        "url": "<url-or-file>",
        "type": "<my-type>",
    }
}
```

Let's see this now in action. We reformat the CoCo data, so that each image is associated in one document with all of the captions which describe it, and add the location of the images using the `_content` formalism.

In [None]:
import json

with open('data/coco/captions_train2014.json') as f:
    raw = json.load(f)
    
raw['images'] = {x['id']: x for x in raw['images']}

for im in raw['images']:
    raw['images'][im]['captions'] = []
    
for a in raw['annotations']:
    raw['images'][a['image_id']]['captions'].append(a['caption'])

raw = list(raw['images'].values())

for i, im in enumerate(raw):
    # if image is already in memory, then add 'bytes': b'...' instead of 'url': '...'
    # for content located on the web, use 'http://' or 'https://' instead of 'file://'
    im['img'] = {
        '_content': {'url': f'file://data/coco/images/{im["file_name"]}', 'type': 'image'}
    }
    raw[i] = {'captions': im['captions'], 'img': im['img']}

with open('data/coco/data.json', 'w') as f:
    json.dump(raw, f)

In [None]:
import json
import sys

sys.path.append('../../')

from pinnacledb.client import the_client
from IPython.display import display, clear_output
import torch

docs = the_client.coco_example.documents

We'll load the data and add most of it to the database. We'll hold back some data so that we can see how to update 
the database later.

In [None]:
with open('data/coco/data.json') as f:
    data = json.load(f)
    
docs.insert_many(data[:-1000]), verbose=True)

We previously added the type `image` to the `_content` subrecords earlier.
So that we can load the data using this type, we need to add this type to the database.
You can see in `examples/types.py` how the class encodes and decodes data. Suffice to say at this point, 
that each type has an `encode` and `decode` method, which convert to and from `bytes`.

In [None]:
from examples.types import FloatTensor, Image

docs.create_type('float_tensor', FloatTensor())
docs.create_type('image', Image())

In the first AI task which we implement for the `docs` collection, we'll be setting up a model to retrieve relevant images using provided text. For this data, that means the `captions` field being used to retrieve the `img` field. In order to be able to keep an objective record of performance, we can set up an immutable validation dataset from the collection. We use a **splitter** to define how we'd like to test retrieval. This splits the documents into query and retrieved document.

In [None]:
docs.create_validation_set(
    'text2image_retrieval', 
    filter={},
    splitter=lambda x: ({'img': x['img']}, {'captions': [x['captions'][0]]}),
    sample_size=1000,
)

We can see what the data points in the validation set look like by querying:

In [None]:
docs['_validation_sets'].find_one({'_validation_set': 'text2image_retrieval'})

You can see that the sample "query" is split into the `_other` field. This is important when evaluating semantic indexes.

Now let's start adding a model to the collection.
A nice open source model to test text-2-image retrieval is [CLIP](https://openai.com/blog/clip/) which understands images and texts and embeds these in a common vector space.

Note that we are specifying the type of the model output, so that the collection knows how to store the results, as well as "activating" the model with `active=True`. That means, whenever we add data which fall under the `filter`, then these will get processed by the model, and the outputs will be added to the collection documents.

The `key` argument specifies which part of the document the model should act. If `key="_base"` then the model takes the whole document as input. Since we'll be encoding documents as images, then we'll chose `key="img`.

In [None]:
from examples.models import CLIP

docs.create_model(
    name='clip',
    object=CLIP('RN50'),
    filter={},
    type='float_tensor',
    key='img',
    verbose=True,
    active=True
)

We'll create a companion model which uses the same underlying object as the previous model. That's specified by adding the name instead of the object in the `object` argument. In this case the model is not `active`, since we'll only be using it for querying the collection. We don't need to specify a `type` since that was done in the last step.

In [None]:
docs.create_model(
    name='clip_text',
    object='clip',
    key='captions',
    active=False,
)

We'll also create a measure which tests how similar to each other two outputs might be. Since CLIP was trained with cosine-similarity we'll use that here too.

In [None]:
from examples.measures import css

docs.create_measure('css', css)

In order to be able to measure performance on the validation set, we'll add a **metric**.

In [None]:
from examples.metrics import PatK

docs.create_metric('p_at_10', PatK(10))

Now we're ready to go to add a **semantic index**. This is a tuple of models, one of which is activated in order to populate the collection with vectors. The idea is that any of the models in the **semantic index** can be used to query the collection using nearest neighbour lookup based on the **measure** chosen.

In [None]:
from examples.models import CLIP

docs.create_semantic_index(
    'clip',
    models=['clip', 'clip_text'],
    measure='css',
    metrics=['p_at_10'],
)

In [None]:
from bson import ObjectId
from IPython.display import display

docs.semantic_index = 'clip'
for r in docs.find({'$like': {'document': {'_id': ObjectId('63d27372745cc274ef3518f2')}, 'n': 10}}):
    display(r['img'])

Let's now evaluate the quality of this semantic index

In [None]:
docs.validate_semantic_index('clip', ['text2image_retrieval'], ['p_at_10'])

In [None]:
docs['_semantic_indexes'].find_one()

We can see that we can get nice meaningful retrievals using the CLIP model from short descriptive pieces of text.
This is very useful, since the model is now deployed to the database, listening for incoming queries.

In [None]:
docs.list_semantic_indexes()

In [None]:
from IPython.display import display

docs.semantic_index = 'clip'
for r in docs.find({'$like': {'document': {'captions': ['Dog catches a frisbee']}, 'n': 5}}):
    display(r['img'])

In the next section of this example, let us train our own model from scratch. The model will be much simpler than the clip model, but will yield faster retrievals. It will be interesting to see how this compares to CLIP, and show-case SuperDuperDB as a framework for easily integrating and benchmarking AI models, in particular for retrieval.

First we will implement a simpler sentence embedding, using a simple word-embedding approach based around Glove.
Please look at the model in `examples.models.AverageOfGloves`.

In [None]:
!curl https://nlp.stanford.edu/data/glove.6B.zip -o data/glove.6B.zip
!unzip data/glove.6B.zip

We may register this model to the collection in the same way we did for the textual part of CLIP:

In [None]:
import numpy
import torch
from examples.models import AverageOfGloves

with open('data/glove.6B/glove.6B.50d.txt') as f:
    lines = f.read().split('\n')
    
lines = [x.split(' ') for x in lines[:-1]]
index = [x[0] for x in lines]
vectors = [[float(y) for y in x[1:]] for x in lines]
vectors = numpy.array(vectors)

glove = AverageOfGloves(torch.from_numpy(vectors).type(torch.float), index)

In [None]:
docs.create_model(
    'average_glove',
    object=glove,
    key='captions',
    active=False,
)

In [None]:
docs.create_model(
    'clip_projection',
    object=torch.nn.Linear(1024, 50),
    active=True,
    key='img',
    type='float_tensor',
    features={'img': 'clip'},
    verbose=True,
)

Let's also create a loss function, in order to be able to perform the learning task:

In [None]:
from examples.losses import ranking_loss

docs.create_loss('ranking_loss', ranking_loss)

A semantic index training requires:

- 1 or more models
- A measure function to measure similarity between model outputs
- A loss function
- One or more validation sets
- One or more metrics to measure performance

We now have all of these things ready and registered with the database, so we can start the training:

In [None]:
docs.create_semantic_index(
    'simple_image_search',
    models=['clip_projection', 'average_glove'],
    loss='ranking_loss',
    filter={},
    projection={'image': 0, '_like': 0},
    metrics=['p_at_10'],
    measure='css',
    validation_sets=['text2image_retrieval'],
    batch_size=250,
    num_workers=0,
    n_epochs=20,
    lr=0.001,
    log_weights=True,
    download=True,
    validation_interval=50,
    no_improve_then_stop=5,
    n_iterations=5000,
    use_grads={'clip_projection': True, 'average_glove': False},
)

We now can see that we've set and trained our own semantic index. Let's take a look:

In [None]:
docs.list_semantic_indexes()

In [None]:
from matplotlib import pyplot as plt
info = docs['_semantic_indexes'].find_one({'name': 'simple_image_search'})

In [None]:
for k in info['metric_values']:
    if k == 'loss':
        print(info['metric_values'][k])
        plt.figure()
        plt.title('loss')
        plt.plot(info['metric_values'][k])
        continue
    for result in info['metric_values'][k]:
        plt.figure()
        plt.title(f'{k}/{result}')
        plt.plot(info['metric_values'][k][result])
plt.show()

In [None]:
for parameter in info['weights']:
    plt.figure()
    plt.title(parameter)
    plt.plot(info['weights'][parameter])

In [None]:
docs.list_models()

In [None]:
docs.refresh_model('clip_projection')

In [None]:
from IPython.display import display

docs.semantic_index = 'simple_image_search'
for r in docs.find({'$like': {'document': {'captions': ['Dog catches frisbee']}, 'n': 5}}):
    display(r['img'])

In [None]:
from examples.models import NounWords
docs.create_model('noun_words', NounWords(), verbose=True, key='captions')

In [None]:
docs.create_validation_set('attribute_prediction', sample_size=250)

In [None]:
import collections
import tqdm
all_nouns = []
for r in tqdm.tqdm(docs.find({'_fold': 'train'}, {'_outputs.captions.noun_words': 1}), total=docs.count_documents({})):
    all_nouns.extend(r['_outputs']['captions']['noun_words'])
    
counts = dict(collections.Counter(all_nouns))
all_nouns = [w for w in counts if counts[w] > 30]
total = docs.count_documents({})
pos_weights = [counts[w] / total for w in all_nouns]

In [None]:
from examples.models import FewHot, TopK
from examples.metrics import jacquard_index

docs.create_model('nouns_to_few_hot', FewHot(all_nouns))
docs.create_postprocessor('top_5', TopK(all_nouns, 5))
docs.create_forward('attribute_predictor', torch.nn.Linear(1024, len(all_nouns)))
docs.create_loss('nouns_loss', torch.nn.BCEWithLogitsLoss(pos_weight=torch.tensor(pos_weights)))
docs.create_metric('jacquard_index', jacquard_index)

In [None]:
from examples.models import FewHot
docs.create_model('nouns_to_few_hot', FewHot(post.tokens), active=False,
                 key='_outputs.captions.noun_words')

In [None]:
docs.create_model('attribute_predictor', forward='attribute_predictor', postprocessor='top_5',
                  key='img', features={'img': 'clip'})

Let's test the model, using the `apply_model` method:

In [None]:
docs.apply_model('attribute_predictor', docs.find_one())

In [None]:
docs.create_imputation(
    'noun_prediction',
    model='attribute_predictor',
    target='nouns_to_few_hot',
    loss='nouns_loss',
    metrics=['jacquard_index'],
    validation_sets=['attribute_prediction'],
    lr=0.001,
    validation_interval=10,
    n_iterations=20,
)

We can view the results of learning (metrics, loss etc.) by looking in the `_imputations` subcollection:

In [1]:
import json
import sys

sys.path.append('../../')

from pinnacledb.client import the_client
from IPython.display import display, clear_output
import torch

docs = the_client.coco_example.documents

In [2]:
import tqdm 

all_captions = []
n = docs.count_documents({'_fold': 'train'})
for r in tqdm.tqdm_notebook(docs.find({'_fold': 'train'}, {'captions': 1, '_id': 0}), total=n):
    all_captions.extend(r['captions'])

  0%|          | 0/78560 [00:00<?, ?it/s]

In [3]:
import collections
import re
all_captions = [re.sub('[^a-z ]', '', x.lower()).strip() for x in all_captions]
words = ' '.join(all_captions).split(' ')
counts = dict(collections.Counter(words))
vocab = sorted([w for w in counts if counts[w] > 5 and w])

In [4]:
from examples.models import ConditionalLM, SimpleTokenizer
tokenizer = SimpleTokenizer(vocab)
m = ConditionalLM(tokenizer)

In [None]:
docs.create_model('conditional_lm', object=m, active=False, features={'img': 'clip'}, key='img')
docs.create_model('captioning_tokenizer', tokenizer, key='caption', active=False)

In [None]:
from examples.losses import AutoRegressiveLoss
loss = AutoRegressiveLoss(stop_token=tokenizer.lookup['</s>'])
docs.create_loss('autoregressive_loss', loss)

In [None]:
docs.list_losses()

In [None]:
docs.list_splitters()

In [None]:
from examples.splitters import captioning_splitter
#docs.create_splitter('captioning_splitter', captioning_splitter)

In [None]:
captioning_splitter(docs.find_one())

In [None]:
docs.create_imputation(
    'image_captioner',
    model='conditional_lm',
    loss='autoregressive_loss',
    target='captioning_target',
    splitter='captioning_splitter',
)

Now we have trained and evaluated several models of various types. This includes multiple interacting models with mutual dependencies. In the case of our own efficient semantic search, and also the attribute predictor, these models are downstream of the image clip model, in the sense that at inference time, clip must be present in order to be able to execute these models. In the case of attribute prediction, the training task was downstream from the 
spacy pipeline for part-of-speech tagging; these tags were used to produce targets for training. However at run-time, the spacy pipeline won't be necessary.

The models which we've added and trained are now ready to go, and when new data is added or updated to the collection, they will automatically process this data, and insert the model outputs into the collection documents.

Here is the complete set of models which exist in the collection:

In [None]:
docs.list_models()

Not all of these respond to incoming data, for that we need to specify the `active` argument:

In [None]:
docs.list_models(active=True)

We can see that these models have processed all documents and their outputs saved:

In [None]:
docs.find_one()

Now, let's test what happens when we add new data to the collection, by adding the remaining data points from the 
CoCo data set:

In [None]:
update = [{**r, "update": True} for r in data[-1000:]]
docs.insert_many(update, verbose=True)