In [None]:
from gensim.models import KeyedVectors
filename = "glove.6B.200d.txt.w2v"

# this takes a while to load -- keep this in mind when designing your capstone project
glove = KeyedVectors.load_word2vec_format(get_data_path(filename), binary=False)

## 4 Image Embedding
### 4.1 Image Features

To bootstrap our image embedding capability, we're going to make use of a pre-trained computer vision model that was trained to do image classification on the ImageNet dataset. (For more info on ImageNet, see: http://www.image-net.org/).

In particular, we're going to use the ResNet-18 model (implemented in PyTorch). ResNet-18 is a type of convolutional neural network (CNN). The last layer of ResNet-18 is a fully connected layer ("dense" layer in MyNN terminology; "Linear" layer in PyTorch terminology) that projects the final 512-dimensional "abstract features" to 1000-dimensional scores used for computing probabilities that an input image is one of the 1000 possible ImageNet classes. We're going to use these 512-dimensional abstract features (which have distilled useful properties/characteristics/aspects of the image) as a starting point for our image embedder.

You can imagine that we have a function called `extract_features_resnet18(image)` that takes an image (e.g., in PIL format), runs it through the ResNet-18 model, and then returns a NumPy array of shape (1, 512). You don't have to write this function yourself, though! We've already pre-extracted these 512-dimensional image features for images in the COCO dataset (described below).

The file `resnet18_features.pkl` contains a dictionary that maps image ids (from the COCO dataset) to extracted image features, imported below: 

In [None]:
# Load saved image descriptor vectors
import pickle
with Path(get_data_path('resnet18_features.pkl')).open('rb') as f:
    resnet18_features = pickle.load(f)

### 4.2 Semantic Embedding

We will learn a mapping from the 512-dimensional image feature space to the common 50-dimensional semantic space of the following form:

&nbsp;&nbsp;&nbsp;&nbsp; `se_image(img_features) = img_features M + b`

where `M` is a parameter of shape `(512, 50)` and `b` is a parameter of shape `(1, 50)`.

### 4.3 Training

To find a good values for parameters `M` and `b`, we'll create a training set containing triples of the form: 

&nbsp;&nbsp;&nbsp;&nbsp; `(text, good_image, bad_image)`

We want the similarity in semantic space between `text` and `good_image` to be greater than the similarity between `text` and `bad_image`, i.e.,

&nbsp;&nbsp;&nbsp;&nbsp; `sim(se_text(text), se_image(good_image)) > sim(se_text(text), se_image(bad_image))`

To encourage this relationship to be true for a triple, we'll use a loss function called the "margin ranking loss" (e.g., `mygrad.nnet.margin_ranking_loss`). This loss penalizes when the ordering of the values is wrong, and stops penalizing once the order is right "enough" (determined by the desired margin). The reasoning is that once the ordering between values is right, we don't need to waste effort trying to make it even more right.

The margin ranking loss is defined as:

&nbsp;&nbsp;&nbsp;&nbsp; `loss(x1, x2, y, margin) = maximum(0, margin - y * (x1 - x2)))`

where `y = 1` means `x1` should be higher than `x2` and `y = 0` means `x2` should be ranked higher than `x1`.

If we let

&nbsp;&nbsp;&nbsp;&nbsp; `sim_to_good = sim(se_text(text), se_image(good_image))`

&nbsp;&nbsp;&nbsp;&nbsp; `sim_to_bad = sim(se_text(text), se_image(bad_image))`

then the loss for a single triple would be:

&nbsp;&nbsp;&nbsp;&nbsp; `loss(sim_to_good, sim_to_bad, 1, margin`)

### 4.4 Enhanced training set

Researchers have found that generating totally random triples for training doesn't usually result in the best performance. In the context of our image search project, notice that picking a random image and one of its captions for `text` and `good_image`, and then picking a totally random other image for `bad_image` will often result in an "easy" triple.

For example, `good_image` and `text` might be "a dog catching a frisbee", while `bad_image` is a picture of a pizza. During training, the model will learn to make these kinds of easy distinctions, but might not be able to make harder ones. For example, it might have trouble properly ranking images of dogs with frisbees verses dogs swimming (in response to queries like "dog with frisbee" or "dog in water") because somewhat similar images won't be generated too often when constructing the training set totally randomly.

Here's a simple approach for generating more "challenging" triples for the training set. Once a `good_image` and `text` (one of the good image's captions) are chosen, randomly sample a small set of potential bad images. Then pick the bad image that has a caption that's most similar to `text` (in terms of cosine similarity between the semantic embeddings of the captions). This should result in better query performance.

## 5 Dataset

We'll be using the Microsoft COCO dataset. From the website (http://cocodataset.org/):

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "COCO is a large-scale object detection, segmentation, and captioning dataset."

The file `captions_train2014.json` contains all the COCO image metadata and annotations (captions) for the official training set from 2014.

Use `json.load()` to convert this file to a dictionary with keys:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; `['info', 'images', 'licenses', 'annotations']`

In [None]:
from cogworks_data.language import get_data_path

from pathlib import Path
import json

# Load COCO metadata
filename = get_data_path("captions_train2014.json")
with Path(filename).open() as f:
    coco_data = json.load(f)

# Semantic Image Search -- Overview

## 1 Goal

The goal of this week's capstone project is to combine techniques from computer vision and natural language processing (NLP) to build a system that allows users to query for images using keywords. For example, a search for "**pizza**" would return results such as:

<img src="pizza_crop.png" width="480">

## 2 Approach

How are we going to achieve this? We're going to map (or "embed") both textual captions and images to a common "semantic" space. Let's assume this semantic space has dimension 50. Let

&nbsp;&nbsp;&nbsp;&nbsp; `se_text()` be a function that maps a piece of text to its semantic embedding, and

&nbsp;&nbsp;&nbsp;&nbsp; `se_image()` be a function that maps an image to its semantic embedding.

We want these mappings to have the property that a caption and image with similar meanings will map to semantic embeddings that are near each other in the semantic space. (This concept should sound familiar from our earlier work with word embeddings...) For example, we would like `se_text('pizza')` to be close to `se_image(pizza_slice_img)`. 

We'll use cosine similarity to measure the similarity between two vectors: 

\begin{equation}
\text{sim}(\vec{x}, \vec{y}) = \cos{\theta} = \frac{\vec{x} \cdot \vec{y}}{\lVert \vec{x} \rVert \lVert \vec{y} \rVert}
\end{equation}

where $\lVert \vec{x} \rVert = \sqrt{x_0^2 + x_1^2 + ...}$ is the magnitude of $\vec{x}$ and $\vec{x} \cdot \vec{y}$ is the *dot product* of the two vectors. Note that if both vectors are already normalized to have unit length, then cosine similarity is equivalent to the dot product between the two vectors.

Once we have `se_text()` and `se_image()`, we can build our image search system:
- **Preprocessing**: Create an image database by running a collection of images through `se_image()` and saving their semantic embeddings.
- **Search by text**: To process a text query, compute the cosine similarity between the semantic embedding of the query, `se_text(query)`, with all the semantic embeddings in the image database and return the ones with the highest scores.

## 3 Text Embedding

We'll embed text by essentially averaging the word embeddings (e.g., GloVe embeddings of dimension 50) for all words in the string.

We'll weight each word by the inverse document frequency (IDF) of the word (computed across all captions in the training data) in order to down-weight common words.

We'll also normalize the weighted sum to have unit length so that similarities can be computed with the dot product.

Note that there are no trainable parameters here. You can load in the GloVe embeddings as follows: 

## 4 Image Embedding
### 4.1 Image Features

To bootstrap our image embedding capability, we're going to make use of a pre-trained computer vision model that was trained to do image classification on the ImageNet dataset. (For more info on ImageNet, see: http://www.image-net.org/).

In particular, we're going to use the ResNet-18 model (implemented in PyTorch). ResNet-18 is a type of convolutional neural network (CNN). The last layer of ResNet-18 is a fully connected layer ("dense" layer in MyNN terminology; "Linear" layer in PyTorch terminology) that projects the final 512-dimensional "abstract features" to 1000-dimensional scores used for computing probabilities that an input image is one of the 1000 possible ImageNet classes. We're going to use these 512-dimensional abstract features (which have distilled useful properties/characteristics/aspects of the image) as a starting point for our image embedder.

You can imagine that we have a function called `extract_features_resnet18(image)` that takes an image (e.g., in PIL format), runs it through the ResNet-18 model, and then returns a NumPy array of shape (1, 512). You don't have to write this function yourself, though! We've already pre-extracted these 512-dimensional image features for images in the COCO dataset (described below).

The file `resnet18_features.pkl` contains a dictionary that maps image ids (from the COCO dataset) to extracted image features, imported below: 

### 4.2 Semantic Embedding

We will learn a mapping from the 512-dimensional image feature space to the common 50-dimensional semantic space of the following form:

&nbsp;&nbsp;&nbsp;&nbsp; `se_image(img_features) = img_features M + b`

where `M` is a parameter of shape `(512, 50)` and `b` is a parameter of shape `(1, 50)`.

### 4.4 Enhanced training set

Researchers have found that generating totally random triples for training doesn't usually result in the best performance. In the context of our image search project, notice that picking a random image and one of its captions for `text` and `good_image`, and then picking a totally random other image for `bad_image` will often result in an "easy" triple.

For example, `good_image` and `text` might be "a dog catching a frisbee", while `bad_image` is a picture of a pizza. During training, the model will learn to make these kinds of easy distinctions, but might not be able to make harder ones. For example, it might have trouble properly ranking images of dogs with frisbees verses dogs swimming (in response to queries like "dog with frisbee" or "dog in water") because somewhat similar images won't be generated too often when constructing the training set totally randomly.

Here's a simple approach for generating more "challenging" triples for the training set. Once a `good_image` and `text` (one of the good image's captions) are chosen, randomly sample a small set of potential bad images. Then pick the bad image that has a caption that's most similar to `text` (in terms of cosine similarity between the semantic embeddings of the captions). This should result in better query performance.

In [None]:
from cogworks_data.language import get_data_path

from pathlib import Path
import json

# Load COCO metadata
filename = get_data_path("captions_train2014.json")
with Path(filename).open() as f:
    coco_data = json.load(f)

## 6 Tasks for Team

These tasks are the minimum set of things that need to be completed as part of the capstone project. You should coordinate with your team about how to divide them up. (Note: Some might naturally be combined.)

* create capability to embed text (captions and queries)
* create training and validation sets of triples
* create function to compute loss, accuracy (in terms of triples correct)
* create MyNN model for embedding images
* train model
 * embed caption
 * embed good image
 * embed bad image
 * compute similarities from caption to good and caption to bad
 * compute loss with margin ranking loss
 * take optimization step
* create image "database" by mapping whole set of image features to semantic features with trained model
* create function to query database and return top k images
* create function to display set of images (given image ids)
 * note that the image metadata (contained in `captions_train2014.json`) includes a property called "coco_url" that can be used download a particular image on demand for display
 * maybe display their captions, too (for debugging)
* create function that finds top k similar images to a query image
 * maybe give option for doing similarity search in either semantic space or original image feature space