# Session 12: Embedding the Corpora

These notes embed a corpus of images using the VGG16 corpus. 

In [1]:
%pylab inline

import numpy as np
import scipy as sp
import pandas as pd
import sklearn
from sklearn import linear_model
import urllib

import os
from os.path import join

Populating the interactive namespace from numpy and matplotlib


In [2]:
import matplotlib.pyplot as plt
import matplotlib.patches as patches

plt.rcParams["figure.figsize"] = (8,8)

In [3]:
from keras.applications.vgg19 import VGG19
from keras.preprocessing import image
from keras.applications.vgg19 import preprocess_input, decode_predictions
from keras.models import Model

Using TensorFlow backend.


In [4]:
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

## Transfer learning

We start by loading the VGG19 model and extracting the penultimate layer.

In [5]:
vgg19_full = VGG19(weights='imagenet')
vgg_fc2 = Model(inputs=vgg19_full.input, outputs=vgg19_full.get_layer('fc2').output)

Instructions for updating:
Colocations handled automatically by placer.


Next, we want to select a corpus and apply the model. Here, we are going to
define a function that wraps up the functionality.

In [None]:
def create_embed(corpus_name):

    df = pd.read_csv(join("..", "data", corpus_name + ".csv"))
    output = np.zeros((len(df), 224, 224, 3))

    for i in range(len(df)):
        img_path = join("..", "images", corpus_name, df.filename[i])
        img = image.load_img(img_path, target_size=(224, 224))
        x = image.img_to_array(img)
        output[i, :, :, :] = x
        if (i % 100) == 0:
            print("Loaded image {0:03d}".format(i))

    output = preprocess_input(output)
    img_embed = vgg_fc2.predict(output, verbose=True)
    
    np.save(join("..", "data", corpus_name + "_vgg19_fc2.npy"), img_embed)

Let's apply this to the example corpus, which is relatively small and should
not take very long to embed.

In [None]:
create_embed("example")

If you look in your data directory, you should see a file called "example_vgg19_fc2.npy".
We can read this back into python using:

In [None]:
X = np.load(join("..", "data", "example_vgg19_fc2.npy"))
X.shape

## Iterating over the data

As you saw in the last session, it can take a while to embed a corpus
using a large neural network. Here is the code that would create embeddings
for the three datasets that we will look at in the next set of notes. We
have already run them and can just distribute the npy files, but its
useful to have this code if you want to run a new dataset.

In [None]:
# create_embed("wikiart")

## Recommendation system

Now that we have the embeddings, we can build a similarity system similar
to the one that used with hue, saturation, and value. 

In [None]:
ref_img_num = 1       # change this number!

corpus_name = "wikiart"
df = pd.read_csv(join("..", "data", corpus_name + ".csv"))
X = np.load(join("..", "data", corpus_name + "_vgg19_fc2.npy"))
plt.figure(figsize=(14, 14))

print(df.iloc[ref_img_num])
idx = np.argsort(np.sum(np.abs(X - X[ref_img_num, :])**2, axis=1))[:9]

for ind, i in enumerate(idx):
    plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
    plt.subplot(3, 3, ind + 1)

    img = imread(join('..', 'images', corpus_name, df.filename[i]))
    plt.imshow(img)
    plt.axis("off")

Try some different numbers and see what happens!