# Image Search by Caption

In this notebook, we will train a linear embedding matrix that maps shape-$(512,)$ ResNet descriptor vectors of COCO images into a word-embedding space.
This will enable us to query images based on user-submitted text.

```python
>>> searcher("horses on a beach");
```

![Horses on a beach](https://user-images.githubusercontent.com/29104956/126541255-1e1353a3-cd38-4e00-a0e9-baf9370a9eb6.png)


Image-descriptor vectors (which have already been processed for us) will be denoted by $\vec{d}$.
Vectors in the embedded language space will be denoted by $\vec{w}$.

That is, we want to learn the following linear encoding:

\begin{align}
&\begin{bmatrix}\leftarrow & \vec{d}_{\mathrm{image}} & \rightarrow \end{bmatrix} W_{\mathrm{embed}} = \begin{bmatrix}\leftarrow & \vec{w}_{\mathrm{image}} & \rightarrow \end{bmatrix}
\end{align}

where $\vec{d}_{\mathrm{image}}$ is a $512$-dimensional descriptor vector of an image, produced by a pre-trained ResNet-18 image-classification model.
$\vec{w}_{\mathrm{image}}$ is a $D$-dimensional embedding of this image descriptor.
We want this embedded vector to "live" in the same "semantic space" as word embeddings.

Suppose we want to search for pictures of "horses on a beach".
We can use the $D$-dimensional GloVe embedding for each word in this "caption", and sum these word-embeddings with weights determined by the inverse document frequency (IDF) of each word (we will discuss how these IDFs get computed later).
Thus we can form the embedding for this caption as:

\begin{equation}
\mathrm{IDF(\mathrm{horses}})\vec{w}_{\mathrm{horses}} + \mathrm{IDF(\mathrm{on}})\vec{w}_{\mathrm{on}} + \mathrm{IDF(\mathrm{a}})\vec{w}_{\mathrm{a}} + \mathrm{IDF(\mathrm{beach}})\vec{w}_{\mathrm{beach}} = \vec{w}_{\mathrm{caption}}
\end{equation}

where $\vec{w}_{\mathrm{horses}}$ is the $D$-dimensional GloVe embedding vector for the word "horses", and $\mathrm{IDF(\mathrm{horses}})$ is the inverse document-frequency for "horses" (a positive scalar quantity).

If we have a picture depicting horses on a beach and its corresponding descriptor vector, $\vec{d}_{\mathrm{image}}$ (which we are given – these image descriptor vectors have been pre-created for us), then we want to be able to embed the descriptor vector for that image such that an embedding vector for the caption, $\vec{w}_{\mathrm{caption}}$ overlaps substantially with the image's embedding, $\vec{w}_{\mathrm{image}}$.


\begin{equation}
\vec{d}_{\mathrm{\mathrm{image}}}W_{\mathrm{embed}} \rightarrow \vec{w}_{\mathrm{image}}\\
\hat{w}_{\mathrm{image}}\cdot \hat{w}_{\mathrm{caption}} >> 0\\
\end{equation}


## Our Data

We are going to be working with the [MSCOCO 2014 dataset](https://cocodataset.org/#home).
This dataset consists of 82,783 images, and each image has at least five plain-text captions that describe that image.
These images have also been processed using a pre-trained ResNet-18 classification model, such that we also have a $512$-dimensional descriptor vector, $\vec{d}_{\mathrm{\mathrm{image}}}$, associated with each image, which captures the contents of that image in an abstract way.

All of the pertinent data for this project is found in three data file:
1. Images and associated captions from the MSCOCO 2014 dataset. All of this information is stored in the `data/captions_train2014.json` JSON file. A few notes about this:
    - We won't download all of the images at once, rather we will have a URL that we can use to download any given image.
    - Each image has associated with it at least one, but possibly more, plain-text captions that describe it.
2. A shape-$(1, 512)$ descriptor vector, $\vec{d}_{\mathrm{\mathrm{image}}}$,  for each image from the MSCOCO dataset. Each of these was produced by processing each image with a pre-trained ResNet-18 classification model. This serves as an enriched/abstract encoding for each image in the dataset. `data/resnet18_features.pkl` contains a dictionary of `image-ID -> descriptor-vector` mappings, where `image-ID` is a unique integer ID for each image in the COCO-dataset.
There are three files that we need for this project:
3. The GloVe-200 word embeddings for a broad vocabulary of words. This will be used to compute $D=200$-dimensional embedding vectors $\hat{w}_{\mathrm{caption}}$ for each caption. These are stored in `"data/glove.6B.200d.txt.w2v"`


### Loading COCO Data

In [36]:
from pathlib import Path
import json 

# load COCO metadata
filename = "data/captions_train2014.json"
with Path(filename).open() as f:
    coco_data = json.load(f)


The `"data/captions_train2014.json"` JSON file has two fields that we care about: "images" and "annotations".

`coco_data["images"]` contains a list; each entry corresponds to a distinct **image**. For example `image_info = coco_data["images"][0]` stores information for the first image.
Each such entry contains:
- A unique integer ID for the image (`image_info["id"]`)
- The URL where you can download the image (`image_info["coco_url"]`)
- The shape of the image (`image_info["height"]`, `image_info["width"]`)

`coco_data["annotations"]` contains a list; each entry corresponds to a distinct **caption**. For example `caption_info = coco_data["annotations"][0]` stores information for the first caption.
Each such entry contains:
- A unique integer ID for the caption (`caption_info["id"]`)
- The ID of the image that this caption is associated with (`caption_info["image_id"]`)
- The caption, stored as a string (`caption_info["caption"]`)

Keep in mind that there are multiple captions associated with each image. Thus there are 82,783 entries to `coco_data["images"]` and 414,113 entries to `coco_data["annotations"]`.

#### Organizing this data

You should create functionality that stores:
- All the image IDs
- All the caption IDs
- Various mappings between image/caption IDs, and associating caption-IDs with captions
   - `image-ID -> [cap-ID-1, cap-ID-2, ...]`
   - `caption-ID -> image-ID`
   - `caption-ID -> "two dogs on the grass"`

In [61]:
# STUDENT CODE HERE
from collections import defaultdict
from typing import Dict, List

image_id_to_caption_ids: Dict[int, List[int]] = defaultdict(list)
for each in coco_data['annotations']:
    image_id_to_caption_ids[each['image_id']].append(each['id'])
print(image_id_to_caption_ids.values())



image_ids = []
caption_ids = []
for i in range(len(coco_data["annotations"])):
    caption_info = coco_data["annotations"][i]
    image_ids.append(caption_info["image_id"])
    caption_ids.append(caption_info["id"])
    
    

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



### Loading GloVe Embedding and Creating Embeddings of Our Captions
First, we will load the GloVe-200 embeddings:

In [38]:
from gensim.models import KeyedVectors
filename = "data/glove.6B.200d.txt.w2v"
glove = KeyedVectors.load_word2vec_format(filename, binary=False)

Note that we can query `glove` like a dictionary to get the shape-(D=200,) embedding for any word:

```python
>>> glove["cat"]
array([ 0.14823  , -0.53152  , -0.25973  , -0.44095  ,  0.38555  ,
       -0.4114   , -0.56649  , -0.024739 , -0.2788   , -0.051034 ,
       ...
       0.33923  , -0.071309 ,  0.33717  , -0.0037631, -0.23328  ],
      dtype=float32)
```

Because your functionalities above should have access of all of the captions, we can compute a single embedding vector for each of our captions. Some notes on processing captions:
 - We will lowercase, remove punctuation, and tokenize any caption that we work with.
 - **We will not worry about removing stop (a.k.a glue) words from our captions**.

We compute the inverse document frequency (IDF) of every term that appears in the captions, across all captions

\begin{equation}
\mathrm{IDF(t)} = \log_{10}{\frac{N_{\mathrm{captions}}}{n_{t}}}
\end{equation}

where $n_{t}$ is the number of captions that term-$t$ appears in.

Each caption's embedding is created via an IDF-weighted sum of the glove-embedding for each word in the caption.
We then normalize this vector
E.g, if the caption was "Horses on a beach", then the following shape-($D=200$,) embedding would be formed via:

\begin{equation}
\mathrm{IDF(\mathrm{horses}})\vec{w}_{\mathrm{horses}} + \mathrm{IDF(\mathrm{on}})\vec{w}_{\mathrm{on}} + \mathrm{IDF(\mathrm{a}})\vec{w}_{\mathrm{a}} + \mathrm{IDF(\mathrm{beach}})\vec{w}_{\mathrm{beach}} = \vec{w}_{\mathrm{caption}}
\end{equation}
\begin{equation}
\mathrm{norm}(\vec{w}_{\mathrm{caption}}) \rightarrow \hat{w}_{\mathrm{caption}}
\end{equation}

In [None]:
# STUDENT CODE HERE
from collections import Counter
import numpy as np
import re, string

punc_regex = re.compile('[{}]'.format(re.escape(string.punctuation)))
def strip_punc(corpus):
    return punc_regex.sub('', corpus)

def to_idf(vocab, counters):
    N = len(counters)
    nt = [sum(1 if t in counter else 0 for counter in counters) for t in vocab]
    nt = np.array(nt, dtype=float)
    return np.log10(N / nt)
    
captions = []
for i in range(len(coco_data["annotations"])):
    caption = coco_data["annotations"][i]
    captions.append(caption["caption"])

idf_sums = []
for i in range(len(captions)):
    caption_str = captions[i]
    captions_arr = strip_punc(caption_str).lower().split()
    counters = Counter(captions_arr)
    idfs = to_idf(captions_arr, counters)
    
    idf_sum = 0
    for q in range(len(idfs)):
        idf_sum += idfs[q]*glove[counters[q]] 
    idf_sums.append(idf_sum)

print(idf_sums[0])

### Loading Image Descriptor Vectors

In [None]:
# load saved features
import pickle
with Path('data/resnet18_features.pkl').open('rb') as f:
    resnet18_features = pickle.load(f)
image_keys = sorted(resnet18_features.keys()) # The list of image keys in ascending order

`resnet18_features` is simply a dictionary that stores a $\vec{d}_{\mathrm{image}}$ for each image:

```
image-ID -> shape-(512,) descriptor
```

where the image-IDs correspond to those in the COCO dataset.

## Training Data
The basics of forming our training data is the following process:
- **Separate out image IDs into distinct sets for training and validation**
- Pick a random training image and one of its associated captions. We'll call these our "true image" and "true caption"
- Pick a different image. We'll call this our "confusor image".
    - We can also use a fancier way to find a "harder" confusor image (THIS IS OPTIONAL)
        - For some set "tournament size", $n$ (e.g. $n=4$)...
        - Pick $n$ random image (must all be distinct from the true image!) and pick an associated caption for each.
        - Compare the embeddings of the good caption and the $n$ confusor captions using cosine-similarity.
        - The confusor caption with the highest cosine-similarity is the "hardest" confusor. We will use the image associated with this caption as our confusor image.

Thus our training and each validation data consist of triplets: `(true-caption-ID, true-image-ID, confusor-image-ID)`.
We will use batches of these triplets to train our model.

In [50]:
# STUDENT CODE HERE 
import random 

train_data = image_ids[:(int)(0.8*(len(image_ids)))]
validation_data = image_ids[(int)(0.8*(len(image_ids))):]

num = random.randint(0, 414113)
true_caption = coco_data["annotations"][num]

true_image = coco_data["images"][num]

for i in range(82783):
    image = coco_data["images"][i]
    if(true_caption["image_id"] == image["image_id"]):
        true_image = image
        break

new_num = random.randint(0, 82783)
if(new_num == num):
    new_num = random.randint(0, 82783)

confusor_image = coco_data["images"][new_num]

IndexError: list index out of range

## Training

### Our Model

Our model simply consists of one matrix that maps a shape-(512,) image descriptor into a shape-(D=200,) embedded vector, and normalizes that vector.

\begin{align}
&\begin{bmatrix}\leftarrow & \vec{d}_{\mathrm{image}} & \rightarrow \end{bmatrix} W_{\mathrm{embed}} = \begin{bmatrix}\leftarrow & \vec{w}_{\mathrm{image}} & \rightarrow \end{bmatrix} \\
\end{align}
\begin{align}
\mathrm{norm}(\vec{w}_{\mathrm{image}}) \rightarrow \hat{w}_{\mathrm{image}}
\end{align}



### Our Loss Function

Recall that we have formed triplets of `(true-caption-ID, true-image-ID, confusor-image-ID)`.
We will use these to form a triplet of embedding vectors.
We can simply look up the embedding vector for our caption:
 - `true-caption-ID` $\rightarrow \hat{w}^{\mathrm{(true)}}_{\mathrm{caption}}$

And we can retrieve the descriptor vector for both of our images
- `true-image-ID` $\rightarrow \vec{d}^{\mathrm{(true)}}_{\mathrm{image}}$
- `confusor-image-ID` $\rightarrow \vec{d}^{\mathrm{(confusor)}}_{\mathrm{image}}$

Processing these descriptors with our model will embed them in the same $D=200$-dimensional space as our captions:

\begin{align}
\mathrm{model}(\vec{d}^{\mathrm{(true)}}_{\mathrm{image}}) &= \hat{w}^{\mathrm{(true)}}_{\mathrm{image}} \\
\mathrm{model}(\vec{d}^{\mathrm{(confusor)}}_{\mathrm{image}}) &= \hat{w}^{\mathrm{(confusor)}}_{\mathrm{image}}\\
\end{align}

We want to embed our image's descriptor in a meaningful way, such that the contents of the image reflect the semantics of its captions.
Thus we want

\begin{equation}
\hat{w}^{\mathrm{(true)}}_{\mathrm{image}} \cdot \hat{w}^{\mathrm{(true)}}_{\mathrm{caption}} > \hat{w}^{\mathrm{(confusor)}}_{\mathrm{image}} \cdot \hat{w}^{\mathrm{(true)}}_{\mathrm{caption}}
\end{equation}

We can enforce this using a margin ranking loss:

\begin{align}
\mathrm{sim}_{\mathrm{true}} &=  \hat{w}^{\mathrm{(true)}}_{\mathrm{image}} \cdot \hat{w}^{\mathrm{(true)}}_{\mathrm{caption}} \\
\mathrm{sim}_{\mathrm{confusor}} &=  \hat{w}^{\mathrm{(confusor)}}_{\mathrm{image}} \cdot \hat{w}^{\mathrm{(true)}}_{\mathrm{caption}} \\
\end{align}

\begin{align}
\mathscr{L}(\mathrm{sim}_{\mathrm{true}}, \mathrm{sim}_{\mathrm{confusor}}; \Delta) = \max(0, \Delta - (\mathrm{sim}_{\mathrm{true}} - \mathrm{sim}_{\mathrm{confusor}}))
\end{align}

Note that all of our dot-products are involving unit vectors, thus we are computing cosine-similarities.
See that this loss function encourages $\mathrm{sim}_{\mathrm{true}}$ to be larger than $\mathrm{sim}_{\mathrm{confusor}}$ by at least a margin of $\Delta$.

Of course, we will be training on **batches** of triplets. MyGrad's [margin ranking loss](https://mygrad.readthedocs.io/en/latest/generated/mygrad.nnet.losses.margin_ranking_loss.html) will automatically compute the mean over the batch dimension.
Note that [einsum](https://mygrad.readthedocs.io/en/latest/generated/mygrad.einsum.html#mygrad.einsum) can be used to take pair-wise dot products across two batches of vectors.
E.g. `mg.einsum("ni,ni -> n", a, b)` will take two shape-$(N, D)$ arrays and compute $N$ dot products between corresponding pairs of shape-($D$,) vectors.

In [None]:
# These imports will assist in the writing of your model class.  
# You should use the below layer type, loss function, optimizer, and initializer
from mynn.optimizers import SGD
import mygrad as mg
from mygrad.nnet.losses import margin_ranking_loss
from mynn.layers import dense
from mygrad.nnet.initializers import glorot_normal

# Create your model class here (you can use the class structure from 
# your autoencoder word embeddings notebook for reference, 
# though they will not be exactly the same)

# STUDENT CODE HERE

num_epochs = #
batch_size = #
margin = #
# STUDENT CODE HERE

In [None]:
%matplotlib notebook
from noggin import create_plot
plotter, fig, ax = create_plot(["loss", "accuracy"])

In [None]:
# Please write a function for retrieving the descriptor vectors of each 
# image in an input batch in the form of a list of image IDs
# and a function to process your batch using your model and loss function.  
# The expected inputs to this function can be seen in the training loop code below
# in the input parameters of the process_batch function

# STUDENT CODE HERE

# Here is the actual training loop code which should save you some time.  
# You may need to change some of the variable names to match the ones you used above.  
# Please ask for help with this if you get stuck

for epoch in range(num_epochs):
    for i in range(0, len(triples_train), batch_size):
        loss, acc = process_batch(
            triples_train[i : i + batch_size],
            model,
            margin,
            coco=coco,
            resnet18_features=resnet18_features,
        )

        loss.backward()
        optimizer.step()

        plotter.set_train_batch(
            dict(loss=loss.item(), accuracy=acc),
            batch_size=len(triples_valid[i : i + batch_size]),
        )
        mg.turn_memory_guarding_off()  # slightly speeds up training
    
    with mg.no_autodiff:
        for i in range(0, len(triples_valid), batch_size):
            loss, acc = process_batch(
                triples_valid[i : i + batch_size],
                model,
                margin,
                coco=coco,
                resnet18_features=resnet18_features,
            )
            plotter.set_test_batch(
                dict(loss=loss.item(), accuracy=acc),
                batch_size=len(triples_valid[i : i + batch_size]),
            )
    plotter.set_train_epoch()
    plotter.set_test_epoch()
plotter.plot()

In [None]:
# You can use this code to save your model
with open("image_embed_model.npy", "wb") as f:
    np.save(f, model.dense.weight.data)

## Searching Our Database

It is time to create a database of images that we can search through based on user-written queries.
We will populate this database **using only images from our validation set** so that we know that the quality of our results isn't from "overfitting" on our data.

We have trained our embedding matrix, $W_{\mathrm{embed}}$, we can embed each of the image descriptors from our validation set into the caption semantic space.

\begin{align}
&\begin{bmatrix}\leftarrow & \vec{d}^{(image)}_1 & \rightarrow \\ \leftarrow & \vec{d}^{(image)}_2 & \rightarrow \\ \vdots & \vdots & \vdots \\ \leftarrow & \vec{d}^{(image)}_{N_{val}} & \rightarrow\end{bmatrix} \rightarrow \mathrm{model(\dots)} \rightarrow \begin{bmatrix}\leftarrow & \hat{w}^{(image)}_1 & \rightarrow \\ \leftarrow & \hat{w}^{(image)}_2 & \rightarrow \\ \vdots & \vdots & \vdots \\ \leftarrow & \hat{w}^{(image)}_{N_{val}} & \rightarrow\end{bmatrix}
\end{align}

This is our "database" of images.
How do we search for relevant images given a user-supplied query?
First, we embed the query in the same way that we embedded the captions (using an IDF-weighted sum of GloVe embeddings).

\begin{equation}
\mathrm{"horses \; on \; a \; beach"} \rightarrow \mathrm{IDF(\mathrm{horses}})\vec{w}_{\mathrm{horses}} + \mathrm{IDF(\mathrm{on}})\vec{w}_{\mathrm{on}} + \mathrm{IDF(\mathrm{a}})\vec{w}_{\mathrm{a}} + \mathrm{IDF(\mathrm{beach}})\vec{w}_{\mathrm{beach}} \rightarrow \hat{w}_{\mathrm{query}}
\end{equation}

Then we compute the dot product of this query's embedding against all of our image embeddings in our database.

\begin{align}
\begin{bmatrix}\hat{w}_{\mathrm{query}} \cdot \hat{w}^{(image)}_1 \\ \hat{w}_{\mathrm{query}} \cdot \hat{w}^{(image)}_2 \\ \vdots \\ \hat{w}_{\mathrm{query}} \cdot \hat{w}^{(image)}_{N_{val}}\end{bmatrix} \rightarrow \mathrm{top-}k\;\mathrm{similarity \; scores}
\end{align}

the top-$k$ cosine-similarities points us to the top-$k$ most relevant images to this query!
We need image-IDs associated with these matches and then we can fetch their associated URLs from our `CocoData` instance.

In [None]:
# Write a function to generate your database of output 200 length vectors for each image as generated by your model
# STUDENT CODE HERE

# Write a function to determine the n most relevant images for a given input text
# STUDENT CODE HERE

In [None]:
# At this point, you should have everything you need to create the full pipeline from input query to output image.  
# If you need help putting the finishing touches together, please don't hesitate to ask.  This is where things can 
# very easily get jumbled, so it's better to ask for help early than later.  