# Image retrieval, captioning and classification with CoCo

This tutorial uses the [CoCo dataset "Common objects in Context"](https://cocodataset.org/#home) to show case some of the key-features of SuperDuperDB. In this example, you'll learn how to:

- Prepare data in the best way for SuperDuperDB usage
- Define data types
- Upload and query data to and from the data base
- Define multiple models on the database, including models with dependencies
- Define a searchable semantic index based on existing models
- Train a semantic index from scratch

## Setting up the data and installing requirements

If you haven't downloaded the data already, execute the lines of bash below. We've tried to keep it clean,
and for reasons of efficiency have resized the images using imagemagick.

In [None]:
!curl http://images.cocodataset.org/annotations/annotations_trainval2014.zip -o raw.zip
!unzip raw.zip
!curl http://images.cocodataset.org/zips/train2014.zip -o images.zip
!unzip -q images.zip
!sudo apt install -y imagemagick
!mogrify -resize 224x 'train2014/*.jpg'
!mkdir -p data/coco
!mv train2014 data/coco/images
!mv annotations/* data/coco
!rm raw.zip
!rm images.zip

SuperDuperDB uses MongoDB for data storage. If you haven't done so already, install it using the following lines of bash.

In [None]:
!wget -qO - https://www.mongodb.org/static/pgp/server-6.0.asc | sudo apt-key add -
!sudo apt-get install gnupg
!wget -qO - https://www.mongodb.org/static/pgp/server-6.0.asc | sudo apt-key add -
!echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/6.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-6.0.list
!sudo apt-get update
!sudo apt-get install -y mongodb-org

In case you haven't done so already, install the dependencies for this tutorial, including SuperDuperDB,
which is a simple pip install.

In [None]:
!pip install pandas
!pip install pillow
!pip install torch
!pip install -r ../../requirements.txt

## Preparing the data for injestion

SuperDuperDB can handle data in any format, including images. The documents in the database are MongoDB `bson` documents, which mix `json` with raw bytes and `ObjectId` objects. SuperDuperDB takes advantage of this by 
serializing more sophisticated objects to bytes, and reinstantiating the objects in memory, when data is queried.

In order to tell SuperDuperDB what type an object has, one specifies this with a subdocument of the form:

```json
{
    "_content": {
        "bytes": ...,
        "type": "<my-type>",
    }
}
```

If however, the content is located on the web or the filesystem, one can specify the URLs directly:

```json
{
    "_content": {
        "url": "<url-or-file>",
        "type": "<my-type>",
    }
}
```

Let's see this now in action. We reformat the CoCo data, so that each image is associated in one document with all of the captions which describe it, and add the location of the images using the `_content` keyword.

In [3]:
import json

with open('data/coco/captions_train2014.json') as f:
    raw = json.load(f)
    
raw['images'] = {x['id']: x for x in raw['images']}

for im in raw['images']:
    raw['images'][im]['captions'] = []
    
for a in raw['annotations']:
    raw['images'][a['image_id']]['captions'].append(a['caption'])

raw = list(raw['images'].values())

for i, im in enumerate(raw):
    # if image is already in memory, then add 'bytes': b'...' instead of 'url': '...'
    # for content located on the web, use 'http://' or 'https://' instead of 'file://'
    im['img'] = {
        '_content': {'url': f'file://data/coco/images/{im["file_name"]}', 'type': 'image'}
    }
    raw[i] = {'captions': im['captions'], 'img': im['img']}
    
print(json.dumps(raw[:2], indent=2))

[
  {
    "captions": [
      "A restaurant has modern wooden tables and chairs.",
      "A long restaurant table with rattan rounded back chairs.",
      "a long table with a plant on top of it surrounded with wooden chairs ",
      "A long table with a flower arrangement in the middle for meetings",
      "A table is adorned with wooden chairs with blue accents."
    ],
    "img": {
      "_content": {
        "url": "file://data/coco/images/COCO_train2014_000000057870.jpg",
        "type": "image"
      }
    }
  },
  {
    "captions": [
      "A man preparing desserts in a kitchen covered in frosting.",
      "A chef is preparing and decorating many small pastries.",
      "A baker prepares various types of baked goods.",
      "a close up of a person grabbing a pastry in a container",
      "Close up of a hand touching various pastries."
    ],
    "img": {
      "_content": {
        "url": "file://data/coco/images/COCO_train2014_000000384029.jpg",
        "type": "image"
      }

In [4]:
with open('data/coco/data.json', 'w') as f:
    json.dump(raw, f)

## Inserting the data and retrieving data

Importing a collection with SuperDuperDB is exactly like importing a collection with PyMongo.

In [5]:
import json
import sys

sys.path.append('../../')

from pinnacledb.client import the_client
from IPython.display import display, clear_output
import torch

docs = the_client.coco.documents

We'll load the data and add most of it to the database. We'll hold back some data so that we can see how to update 
the database later.

In [8]:
with open('data/coco/data.json') as f:
    data = json.load(f)
    
docs.insert_many(data[:-1000], verbose=True)

downloading content from retrieved urls
found 81783 urls
number of workers 0


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 81783/81783 [04:53<00:00, 278.90it/s]


<pymongo.results.InsertManyResult at 0x118cfb5e0>

We previously added the type `image` to the `_content` subrecords earlier. This gives SuperDuperDB the chance to interpret the raw data within the `_content.bytes` field.


So that we can load the data using this type, we need to create this type and add it to the database.
Each type is a class with an `encode` and `decode` method, which convert to and from `bytes`.

We import the types from the `examples` directory on GitHub. However, in this notebook we include the code snippets for completeness:

In `examples.types`:

```python
import io
import numpy
import PIL.Image
import torch


class Image:

    @staticmethod
    def encode(x):
        buffer = io.BytesIO()
        x.save(buffer, format='png')
        return buffer.getvalue()

    @staticmethod
    def decode(bytes_):
        return PIL.Image.open(io.BytesIO(bytes_))


class FloatTensor:
    types = (torch.FloatTensor, torch.Tensor)

    @staticmethod
    def encode(x):
        x = x.numpy()
        assert x.dtype == numpy.float32
        return memoryview(x).tobytes()

    @staticmethod
    def decode(bytes_):
        array = numpy.frombuffer(bytes_, dtype=numpy.float32)
        return torch.from_numpy(array).type(torch.float)
```

In [9]:
from examples.types import FloatTensor, Image

docs.create_type('float_tensor', FloatTensor())
docs.create_type('image', Image())

## Text-2-image retrieval with pre-trained model

In the first AI task which we implement for this collection of data, we'll be setting up a model to retrieve relevant images using provided text. We'll use the data from the `captions` field to retrieve the `img` field. In order to be able to keep an objective record of performance, it's necessary up to set up a validation dataset from the collection. We use a **splitter** to define how we'd like to test retrieval. This splits the documents into query and retrieved document.

You may have noticed that the records are already randomly split into "train" and "valid" folds during data insertion. The goal of creating a validation set, is to fix the data for the model evaluation once and for all. Otherwise, when new data come in, the documents with `"_fold": "valid"` may have changed.

In [20]:
r = docs.find(raw=False)[1000]

{'_id': ObjectId('63f36c946bb7bf3d196c86ce'),
 'captions': ['a table topped with bowls of food and a glass of water.',
  'Dishes of prepared food laid out on a table with empty glasses and plates',
  'Several bowls of food next to a stack of plates.',
  'Prepared Asian foods sit on a table with dinner plates and glasses.',
  'A black table holds white bowls and plates and different foods as water glasses also sit on the table.'],
 'img': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=224x149>,
 '_fold': 'train'}

In [10]:
from examples.splitters import retrieval_splitter

docs.create_splitter('retrieval_splitter', retrieval_splitter)
docs.create_validation_set(
    'text2image_retrieval', 
    filter={},
    splitter=lambda x: ({'img': x['img']}, {'captions': [x['captions'][0]]}),
    sample_size=1000,
)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1013.08it/s]

downloading content from retrieved urls
found 0 urls





We can see what the data points in the validation set look like by querying:

In [12]:
docs['_validation_sets'].delete_many({})

<pymongo.results.DeleteResult at 0x1206cc790>

You can see that the sample "query" is split into the `_other` field. This is important when evaluating semantic indexes.

Now let's start adding a model to the collection.
A nice open source model to test text-2-image retrieval is [CLIP](https://openai.com/blog/clip/) which understands images and texts and embeds these in a common vector space.

Note that we are specifying the type of the model output, so that the collection knows how to store the results, as well as "activating" the model with `active=True`. That means, whenever we add data which fall under the `filter`, then these will get processed by the model, and the outputs will be added to the collection documents.

The `key` argument specifies which part of the document the model should act. If `key="_base"` then the model takes the whole document as input. Since we'll be encoding documents as images, then we'll chose `key="img`.

We import the clip model from the `examples` directory on GitHub. However, for completeness, we quote the code here - it's a thin wrapper aroung the OpenAI model.

In `examples.models`:

```python
from clip import load as load_clip, tokenize as clip_tokenize

class CLIP(torch.nn.Module):
    def __init__(self, name):
        super().__init__()
        self.model, self.image_preprocess = load_clip(name)

    def preprocess(self, r):
        if isinstance(r, str):
            return clip_tokenize(r, truncate=True)[0, :]
        elif isinstance(r, list) and isinstance(r[0], str):
            return clip_tokenize(' '.join(r), truncate=True)[0, :]
        return self.image_preprocess(r)

    def forward(self, r):
        if len(r.shape) == 2:
            return self.model.encode_text(r)
        return self.model.encode_image(r)
```

In [24]:
docs.find_one()

{'_id': ObjectId('63f36c946bb7bf3d196c82e6'),
 'captions': ['A restaurant has modern wooden tables and chairs.',
  'A long restaurant table with rattan rounded back chairs.',
  'a long table with a plant on top of it surrounded with wooden chairs ',
  'A long table with a flower arrangement in the middle for meetings',
  'A table is adorned with wooden chairs with blue accents.'],
 'img': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=224x168>,
 '_fold': 'train'}

In [None]:
from examples.models import CLIP

docs.create_model(
    name='clip',
    object=CLIP('RN50'),
    filter={},
    type='float_tensor',
    key='img',
    verbose=True,
    active=True,
)

computing chunk (1/17)
finding documents under filter
done.
processing with clip


 38%|████████████████████████████████████████████████████████▋                                                                                              | 1878/5000 [04:48<07:57,  6.54it/s]

We'll create a companion model which uses the same underlying object as the previous model. That's specified by adding the name instead of the object in the `object` argument. In this case the model is not `active`, since we'll only be using it for querying the collection. We don't need to specify a `type` since that was done in the last step.

In [None]:
docs.create_model(
    name='clip_text',
    object='clip',
    key='captions',
    active=False,
)

We'll also create a measure which tests how similar to each other two outputs might be. Since CLIP was trained with cosine-similarity we'll use that here too.

In `examples.measures`:

```python

def dot(x, y):
    return x.matmul(y.T)


def css(x, y):
    x = x.div(x.norm(dim=1)[:, None])
    y = y.div(y.norm(dim=1)[:, None])
    return dot(x, y)

```

In [None]:
from examples.measures import css
docs.create_measure('css', css)

In order to be able to measure performance on the validation set, we'll add a **metric**.

In `examples.metrics`:

```python
class PatK:
    def __init__(self, k):
        self.k = k

    def __call__(self, x, y):
        return y in x[:self.k]

```

In [None]:
from examples.metrics import PatK
docs.create_metric('p_at_10', PatK(10))

Now we're ready to go to add a **semantic index**. This is a tuple of models, one of which is activated in order to populate the collection with vectors. The idea is that any of the models in the **semantic index** can be used to query the collection using nearest neighbour lookup based on the **measure** chosen.

In [None]:
from examples.models import CLIP

docs.create_semantic_index(
    'clip',
    models=['clip', 'clip_text'],
    measure='css',
    metrics=['p_at_10'],
)

Now the semantic index has been created, we can search through the data using that index.

We can see that we can get nice meaningful retrievals using the CLIP model from short descriptive pieces of text.
This is very useful, since the model is now deployed to the database, listening for incoming queries.

In [None]:
from bson import ObjectId
from IPython.display import display

docs.semantic_index = 'clip'

# example using item id directly
for r in docs.find({'$like': {'document': {'_id': ObjectId('63d27372745cc274ef3518f2')}, 'n': 10}}):
    display(r['img'])
    
# or a query which is interpreted by the CLIP model
for r in docs.find({'$like': {'document': {'captions': ['Dog catches a frisbee']}, 'n': 5}}):
    display(r['img'])

Let's now evaluate the quality of this semantic index

In [None]:
docs.validate_semantic_index('clip', ['text2image_retrieval'], ['p_at_10'])

In the next section of this example, let us train our own model from scratch. The model will be much simpler than the clip model, but will yield faster retrievals. It will be interesting to see how this compares to CLIP, and show-case SuperDuperDB as a framework for easily integrating and benchmarking AI models, in particular for retrieval.

## Bespoke text-2-image retriever using custom PyTorch model

First we will implement a simpler sentence embedding, using a simple word-embedding approach based around Glove.
Please look at the model in `examples.models.AverageOfGloves`.

In [None]:
!curl https://nlp.stanford.edu/data/glove.6B.zip -o data/glove.6B.zip
!unzip data/glove.6B.zip

We may register this model to the collection in the same way we did for the textual part of CLIP. We quote the full code from the `examples` directory here for completeness:

In `examples.models`:

```python
class AverageOfGloves:
    def __init__(self, embeddings, index):
        self.embeddings = embeddings
        self.index = index
        self.lookup = dict(zip(self.index, range(len(self.index))))

    def preprocess(self, sentence):
        if isinstance(sentence, list):
            sentence = ' '.join(sentence)
        cleaned = re.sub('[^a-z0-9 ]', ' ',  sentence.lower())
        cleaned = re.sub('[ ]+', ' ',  cleaned)
        words = cleaned.split()
        words = [x for x in words if x in self.index]
        if not words:
            return torch.ones(50).type(torch.float)
        ix = list(map(self.lookup.__getitem__, words))
        vectors = self.embeddings[ix, :]
        return vectors.mean(0)
```

In [None]:
import numpy
import torch
from examples.models import AverageOfGloves

with open('data/glove.6B/glove.6B.50d.txt') as f:
    lines = f.read().split('\n')
    
lines = [x.split(' ') for x in lines[:-1]]
index = [x[0] for x in lines]
vectors = [[float(y) for y in x[1:]] for x in lines]
vectors = numpy.array(vectors)

glove = AverageOfGloves(torch.from_numpy(vectors).type(torch.float), index)

docs.create_model(
    'average_glove',
    object=glove,
    key='captions',
    active=False,
)

We also need to create a projection layer, which takes the CLIP outputs, and projects them to the same dimensionality as our `average_glove` model:

In [None]:
docs.create_model(
    'clip_projection',
    object=torch.nn.Linear(1024, 50),
    active=True,
    key='img',
    type='float_tensor',
    features={'img': 'clip'},
    verbose=True,
)

Let's also create a loss function, in order to be able to perform the learning task. Here's the code in full:

In `examples.losses`:

```python
import torch


def ranking_loss(x, y):
    x = x.div(x.norm(dim=1)[:, None])
    y = y.div(y.norm(dim=1)[:, None])
    similarities = x.matmul(y.T)
    return -torch.nn.functional.log_softmax(similarities, dim=1).diag().mean()
```

In [None]:
from examples.losses import ranking_loss

docs.create_loss('ranking_loss', ranking_loss)

A semantic index training requires:

- 1 or more models
- A measure function to measure similarity between model outputs
- A loss function
- One or more validation sets
- One or more metrics to measure performance

We now have all of these things ready and registered with the database, so we can start the training:

In [None]:
docs.create_semantic_index(
    'simple_image_search',
    models=['clip_projection', 'average_glove'],
    loss='ranking_loss',
    filter={},
    projection={'image': 0, '_like': 0},
    metrics=['p_at_10'],
    measure='css',
    validation_sets=['text2image_retrieval'],
    batch_size=250,
    num_workers=0,
    n_epochs=20,
    lr=0.001,
    log_weights=True,
    download=True,
    validation_interval=50,
    no_improve_then_stop=5,
    n_iterations=5000,
    use_grads={'clip_projection': True, 'average_glove': False},
)

We now can see that we've set and trained our own semantic index. Let's take a look at the results of the training:

In [None]:
from matplotlib import pyplot as plt
info = docs['_semantic_indexes'].find_one({'name': 'simple_image_search'})

We can visualize the improvement of metrics during training using standard functionality from the scientific Python ecosystem. No need for Tensorboards or special visualization interfaces!

In [None]:
for k in info['metric_values']:
    if k == 'loss':
        print(info['metric_values'][k])
        plt.figure()
        plt.title('loss')
        plt.plot(info['metric_values'][k])
        continue
    for result in info['metric_values'][k]:
        plt.figure()
        plt.title(f'{k}/{result}')
        plt.plot(info['metric_values'][k][result])
plt.show()

The same can be done for the progression of weights during training:

In [None]:
for parameter in info['weights']:
    plt.figure()
    plt.title(parameter)
    plt.plot(info['weights'][parameter])

In [None]:
docs.refresh_model('clip_projection')

We see that our model provides quick retrievals using its simpler architecture, and we succeeded in doing this with a very small resource footprint:

In [None]:
from IPython.display import display

docs.semantic_index = 'simple_image_search'
for r in docs.find({'$like': {'document': {'captions': ['Dog catches frisbee']}, 'n': 5}}):
    display(r['img'])

Let's see how well our model has done:

In [None]:
docs.validate_semantic_index('simple_image_search', ['text2image_retrieval'], ['p_at_10'])

## Attribute prediction using "imputations"; transfer learning, preparation using SpaCy and simple PyTorch linear predictor

Now we're going to train a different type of model, using a meta-class of machine
learning problems we call "imputation". The basic idea is to generate tags, by using as training data
the noun words extracted from the sentences attached to each image. For this we'll need one model to prepare
the data, by extracting these noun words, and another model to predict the tags. Imputations aren't restricted to this however. We can handle any machine learning problem which involves predicting one thing from another thing.
This subsumes classification, attribute prediction, generative adversarial learning, language modelling and more.

For the noun extraction, we'll use an external library `spacy`. You can see this in the source code `examples.models.NounWords`.

In `examples.models`:

```python
class NounWords:
    def __init__(self):
        self.nlp = spacy.load('en_core_web_sm')

    def preprocess(self, sentences):
        sentence = ' '.join(sentences)
        nouns = []
        for w in self.nlp(sentence):
            if w.pos_ == 'NOUN':
                nouns.append(str(w).lower())
        nouns = sorted(list(set(nouns)))
        return nouns
```

In [None]:
from examples.models import NounWords
docs.create_model('noun_words', NounWords(), verbose=True, key='captions')

We'll also need a simple validation set:

In [None]:
docs.create_validation_set('attribute_prediction', sample_size=250)

Let's prepare the tag set we'd like to train on in the next few cells:

In [None]:
import collections
import tqdm
all_nouns = []
for r in tqdm.tqdm(docs.find({'_fold': 'train'}, {'_outputs.captions.noun_words': 1}), 
                   total=docs.count_documents({})):
    all_nouns.extend(r['_outputs']['captions']['noun_words'])
    
counts = dict(collections.Counter(all_nouns))
all_nouns = [w for w in counts if counts[w] > 30]
total = docs.count_documents({})
pos_weights = [counts[w] / total for w in all_nouns]

Now we've got our tag set ready, we can create the items necessary to train the model.

In `examples.models`:

```python
import torch


class FewHot:
    def __init__(self, tokens):
        self.tokens = tokens
        self.lookup = dict(zip(tokens, range(len(tokens))))

    def preprocess(self, x):
        x = [y for y in x if y in self.tokens]
        integers = list(map(self.lookup.__getitem__, x))
        empty = torch.zeros(len(self.tokens))
        empty[integers] = 1
        return empty


class TopK:
    def __init__(self, tokens, n=10):
        self.tokens = tokens
        self.n = n

    def __call__(self, x):
        pred = x.topk(self.n)[1].tolist()
        return [self.tokens[i] for i in pred]

```

In `examples.metrics`:

```python
def jacquard_index(x, y):
    return len(set(x).intersection(set(y))) / len(set(x).union(set(y)))
```


In [None]:
from examples.models import FewHot, TopK
from examples.metrics import jacquard_index

docs.create_model('nouns_to_few_hot', FewHot(all_nouns))
docs.create_postprocessor('top_5', TopK(all_nouns, 5))
docs.create_forward('attribute_predictor', torch.nn.Linear(1024, len(all_nouns)))
docs.create_loss('nouns_loss', torch.nn.BCEWithLogitsLoss(pos_weight=torch.tensor(pos_weights)))
docs.create_metric('jacquard_index', jacquard_index)

We demonstrate a different way of creating a model. Instead of using a class with `preprocess`, `forward`
and `postprocess` methods, we supply these separately. Sometimes this is advantageous when it comes 
to sharing aspects between models.

In [None]:
docs.create_model('attribute_predictor', forward='attribute_predictor', postprocessor='top_5',
                  key='img', features={'img': 'clip'})

Let's test the model, using the `apply_model` method:

In [None]:
docs.apply_model('attribute_predictor', docs.find_one())

Now let's train the model:

In [None]:
docs.create_imputation(
    'noun_prediction',
    model='attribute_predictor',
    target='nouns_to_few_hot',
    loss='nouns_loss',
    metrics=['jacquard_index'],
    validation_sets=['attribute_prediction'],
    lr=0.001,
    validation_interval=10,
    n_iterations=20,
)

We can view the results of learning (metrics, loss etc.) by looking in the `_imputations` subcollection:

## Image captioning using recurrent language modelling and transfer learning based on CLIP

In the final modelling part of this tutorial, we show that a radically different type of model can be created
but which also leverages the `create_imputation` functionality. This is possible, because we utilize a
different loss, target, and also use a model which has a different inference `forward` pass than training
`forward` pass. This is a very common occurrence, in AI, especially when doing, for example, autoregressive 
training.

The model we create, will be input an image, and will write out a sentence describing that image in unconstrained English. This task is known as "captioning" in AI speak.

This model will leverage a fixed vocabulary of "allowed" words. Let us first create this quickly in the 
following cell:

In [None]:
import tqdm 
import collections
import re

all_captions = []
n = docs.count_documents({'_fold': 'train'})
for r in tqdm.tqdm_notebook(docs.find({'_fold': 'train'}, {'captions': 1, '_id': 0}), total=n):
    all_captions.extend(r['captions'])
    
all_captions = [re.sub('[^a-z ]', '', x.lower()).strip() for x in all_captions]
words = ' '.join(all_captions).split(' ')
counts = dict(collections.Counter(words))
vocab = sorted([w for w in counts if counts[w] > 5 and w])

Now we can create the model - it utilizes a "tokenizer" for preprocessing the captioning data.

In `examples.models`:

```python

class SimpleTokenizer:
    def __init__(self, tokens, max_length=15):
        self.tokens = tokens
        if '<unk>' not in tokens:
            tokens.append('<unk>')
        self._set_tokens = set(self.tokens)
        self.lookup = dict(zip(self.tokens, range(len(self.tokens))))
        self.dictionary = {k: i for i, k in enumerate(tokens)}
        self.max_length = max_length

    def __len__(self):
        return len(self.tokens)

    def preprocess(self, sentence):
        sentence = re.sub('[^a-z]]', '', sentence.lower()).strip()
        words = [x for x in sentence.split(' ') if x]
        words = [x if x in self.tokens else '<unk>' for x in words]
        words = words[:self.max_length]
        tokenized = list(map(self.lookup.__getitem__, words))
        tokenized = tokenized + [len(self) + 1 for _ in range(self.max_length - len(words))]
        return torch.tensor(tokenized)


class ConditionalLM(torch.nn.Module):
    def __init__(self, tokenizer, n_hidden=512, max_length=15, n_condition=1024):
        super().__init__()

        self.tokenizer = tokenizer
        self.n_hidden = n_hidden
        self.embedding = torch.nn.Embedding(len(self.tokenizer) + 2, self.n_hidden)
        self.conditioning_linear = torch.nn.Linear(n_condition, self.n_hidden)
        self.rnn = torch.nn.GRU(self.n_hidden, self.n_hidden, batch_first=True)
        self.prediction = torch.nn.Linear(self.n_hidden, len(self.tokenizer) + 2)
        self.max_length = max_length

    def preprocess(self, r):
        out = {}
        if 'caption' in r:
            out['caption'] = [len(self.tokenizer)]  + self.tokenizer.preprocess(r['caption']).tolist()[:-1]
        else:
            out['caption'] = [len(self.tokenizer)]
        out['caption'] = torch.tensor(out['caption'])
        if 'img' in r:
            out['img'] = r['img']
        return out

    def train_forward(self, r):
        input_ = self.embedding(r['caption'])
        img_vectors = self.conditioning_linear(r['img']).unsqueeze(0)
        rnn_outputs = self.rnn(input_, img_vectors)[0]
        return self.prediction(rnn_outputs)

    def forward(self, r):
        hidden_states = self.conditioning_linear(r['img']).unsqueeze(0)
        predictions = \
            torch.zeros(r['caption'].shape[0], self.max_length).to(r['caption'].device).type(torch.long)
        predictions[:, 0] = r['caption'][:, 0]
        for i in range(self.max_length - 1):
            rnn_outputs, hidden_states = self.rnn(self.embedding(predictions[:, i]).unsqueeze(1),
                                                  hidden_states)
            logits = self.prediction(rnn_outputs)[:, 0, :]
            predictions[:, i + 1] = logits.topk(1, dim=1)[1][:, 0].type(torch.long)
        return predictions

    def postprocess(self, output):
        output = output.tolist()
        try:
            first_end_token = next(x for x in output if x == len(self.tokenizer) + 2)
            output = output[:first_end_token]
        except StopIteration:
            pass
        output = [x for x in output if x < len(self.tokenizer)]
        return ' '.join(list(map(self.tokenizer.tokens.__getitem__, output)))

```

In [None]:
from examples.models import ConditionalLM, SimpleTokenizer

tokenizer = SimpleTokenizer(vocab)
m = ConditionalLM(tokenizer)

Let us know create the required models necessary for training this model. One of the models is fairly 
trivial, only used to create the prediction target for the learning task:

In [None]:
docs.create_model('conditional_lm', object=m, active=False, features={'img': 'clip'}, key='_base')
docs.create_model('captioning_tokenizer', tokenizer, key='caption', active=False)

We'll use a standard autoregressive loss, of the sort used as a matter of course in language modelling tasks.

In `examples.losses`:


```python
def auto_regressive_loss(x, y):
    # start token = x.shape[2] - 2, stop_token = x.shape[2] - 1 (by convention)
    stop_token = x.shape[2] - 1
    x = x.transpose(2, 1)
    losses = torch.nn.functional.cross_entropy(x, y, reduce=False)
    not_stops = torch.ones_like(losses)
    not_stops[:, 1:] = (y[:, :-1] != stop_token).type(torch.long)
    normalizing_factors = not_stops.sum(axis=1).unsqueeze(1)
    av_loss_per_row = (losses * not_stops).div(normalizing_factors).sum(axis=1)
    return av_loss_per_row.mean()
```

In [None]:
from examples.losses import auto_regressive_loss
docs.create_loss('autoregressive_loss', auto_regressive_loss)

Since each record in the database has several captions per image, we'll need to use a so-called "splitter", to 
align the prediction model and prediction target during training. You can see that the splitter randomly chooses
one of the captions to train on for an iteration.

In `examples.splitters`:

```python
import random


def captioning_splitter(r):
    index = random.randrange(len(r['captions']))
    target = {}
    target['caption'] = r['captions'][index]
    r['caption'] = r['captions'][index]
    return r, target
```

In [None]:
from examples.splitters import captioning_splitter

docs.create_splitter('captioning_splitter', captioning_splitter)
captioning_splitter(docs.find_one())

Since we have this new splitter, we need to create a new validation data set

In [None]:
docs.create_validation_set('captioning', splitter=docs.splitters['captioning_splitter'],
                           sample_size=500, chunk_size=100)

Now we're ready to start the training:

In [None]:
docs.create_imputation(
    'image_captioner',
    model='conditional_lm',
    loss='autoregressive_loss',
    target='captioning_tokenizer',
    splitter='captioning_splitter',
    validation_sets=['captioning'],
    batch_size=50,
    lr=0.001,
)

Let's test the model on a sample data point:

In [None]:
test_docs = list(docs.find().limit(20))
images = list(docs.find({}, {'img': 1}).limit(100))

results = docs.apply_model('conditional_lm', test_docs, batch_size=10)

for r, res in zip(images, results):
    display(r['img'])
    print(res)

Now we have trained and evaluated several models of various types. This includes multiple interacting models with mutual dependencies. In the case of our own efficient semantic search, and also the attribute predictor, these models are downstream of the image clip model, in the sense that at inference time, clip must be present in order to be able to execute these models. In the case of attribute prediction, the training task was downstream from the 
spacy pipeline for part-of-speech tagging; these tags were used to produce targets for training. However at run-time, the spacy pipeline won't be necessary.

The models which we've added and trained are now ready to go, and when new data is added or updated to the collection, they will automatically process this data, and insert the model outputs into the collection documents.

Here is the complete set of models which exist in the collection:

In [None]:
docs.list_models()

Not all of these respond to incoming data, for that we need to specify the `active` argument:

In [None]:
docs.list_models(active=True)

We can see that these models have processed all documents and their outputs saved:

In [None]:
docs.find_one()

## Conclusion and next steps

In this tutorial we showed how to deploy interdependent open source, and user supplied PyTorch models on a 
collection of records, containing image and text. SuperDuperDB can be used on arbitrary data types, and 
allows users to define their own models. These models can be set up, so that they are interdependent with one another. For example, we applied multiple models in this tutorial which were downstream of the CLIP model. These may additionally be trained or fine-tuned on the 
data at hand.