## Explore Text2Image models

This notebook explores the computation of image similarity as the dot product of embeddings computed based on the pretrained ResNet50 model (pretrained on ImageNet).
This similarity computation is a necessary component for potential Text2Image based language drift metrics / models.

More specifically, as a proof of concept, I checked whether the dot product of ResNet based embeddings for images from the same categories is higher than for images from different categories. This is important in order to check whether simple dot product computations based on ResNet embeddings are viable as a similarity metric.

The exploration is as follows:
* sample images are retrieved from the COCO dataset
* based on captions similar to the original captions, images are generated from the CogView Text2Imgae transformer model (a demonstration can be found [here](https://wudao.aminer.cn/CogView/index.html) and was used to generate the images) (Ding et al., 2021, see [here](https://pythonawesome.com/a-pretrained-transformer-for-text-to-image-generation-in-general-domain/)). 
    * this demo only works for Chinese text, so the sample input texts were translated into simplified Chinese using Google translate.
* images from texts with unrelated categories were generated as distractors    
* a ResNet instance pretrained on ImageNet was loaded and embeddings were created for all the images.
* dot products were computed for all pairs of embeddings.

First, a COCO image containing the categories "skateboard", "person", "dog" was retrieved. Then, two sample images were generated from the CogView model. Then, the dot product for both embeddings is computed. 

In [3]:
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions
import numpy as np
import tensorflow as tf

In [4]:
# instantiate pretrained ResNet model
model = ResNet50(weights='imagenet', include_top=False, pooling=max)

# load image from unrelated category
img_path = 'cog-view-apple.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
# compute embedding
features_apple = model.predict(x)
# flatten the embedding
features_apple_flat = tf.reshape(features_apple, -1)
features_apple_flat.shape

TensorShape([100352])

In [26]:
# load CogView sample for skateboard+person+dog
img2_path = 'cog-view-sample2.jpg'
img2 = image.load_img(img2_path, target_size=(224, 224))
x2 = image.img_to_array(img2)
x2 = np.expand_dims(x2, axis=0)
x2 = preprocess_input(x2)
# compute embedding
features_cog_skateboard = model.predict(x2)
features_cog_skateboard.shape
features_cog_skateboard_flat = tf.reshape(features_cog_skateboard, -1)
features_cog_skateboard_flat.shape

TensorShape([100352])

In [14]:
# load original COCO image for skateboard+person+dog
img3_path = 'person-skateboard-dog-coco.png'
img3 = image.load_img(img3_path, target_size=(224, 224))
x3 = image.img_to_array(img3)
x3 = np.expand_dims(x3, axis=0)
x3 = preprocess_input(x3)
# compute embedding
features_coco_skateboard = model.predict(x3)
features_coco_skateboard.shape
features_coco_skateboard_flat = tf.reshape(features_coco_skateboard, -1)
features_coco_skateboard_flat.shape

TensorShape([100352])

In [31]:
# compute dot product of the original COCO image and the CogView category-related image
tf.tensordot(features_coco_skateboard_flat, features_cog_skateboard_flat, 1)

<tf.Tensor: shape=(), dtype=float32, numpy=31308.82>

In [29]:
# compute dot product of the original COCO image and the CogView category-UNRELATED image
tf.tensordot(features_coco_skateboard_flat, features_apple_flat, 1)

<tf.Tensor: shape=(), dtype=float32, numpy=19613.654>

--> We can see that the dor product is higher for the category related images.

Now, the same procedure is applied to a second example. 

In [12]:
# load original COCO image for apple+sandwich+cup
img4_path = 'apple-sandwich-cup-coco1.png'
img4 = image.load_img(img4_path, target_size=(224, 224))
x4 = image.img_to_array(img4)
x4 = np.expand_dims(x4, axis=0)
x4 = preprocess_input(x4)
# compute embedding
features_coco_food = model.predict(x4)
features_coco_food_flat = tf.reshape(features_coco_food, -1)
features_coco_food_flat.shape

TensorShape([100352])

In [6]:
# load another COCO image for apple+sandwich+cup
img5_path = 'apple-sandwich-cup-coco2.png'
img5 = image.load_img(img5_path, target_size=(224, 224))
x5 = image.img_to_array(img5)
x5 = np.expand_dims(x5, axis=0)
x5 = preprocess_input(x5)
# compute embedding
features_coco_food2 = model.predict(x5)
features_coco_food2_flat = tf.reshape(features_coco_food2, -1)
features_coco_food2_flat.shape

TensorShape([100352])

In [8]:
# load images generated from CogView for the respective categories
# load original COCO image for apple+sandwich+cup
img6_path = 'cog-view-food-sample1.jpg'
img6 = image.load_img(img6_path, target_size=(224, 224))
x6 = image.img_to_array(img6)
x6 = np.expand_dims(x6, axis=0)
x6 = preprocess_input(x6)
# compute embedding
features_cog_food1 = model.predict(x6)
features_cog_food1_flat = tf.reshape(features_cog_food1, -1)
features_cog_food1_flat.shape

TensorShape([100352])

In [9]:
# load images generated from CogView for the respective categories
# load original COCO image for apple+sandwich+cup
img7_path = 'cog-view-food-sample2.jpg'
img7 = image.load_img(img7_path, target_size=(224, 224))
x7 = image.img_to_array(img7)
x7 = np.expand_dims(x7, axis=0)
x7 = preprocess_input(x7)
# compute embedding
features_cog_food2 = model.predict(x7)
features_cog_food2_flat = tf.reshape(features_cog_food2, -1)
features_cog_food2_flat.shape

TensorShape([100352])

In [10]:
# compute dot product between first COCO food image and both generated images
print(tf.tensordot(features_coco_food_flat, features_cog_food1_flat, 1))
print(tf.tensordot(features_coco_food_flat, features_cog_food2_flat, 1))

tf.Tensor(61500.562, shape=(), dtype=float32)
tf.Tensor(66899.2, shape=(), dtype=float32)


In [15]:
# compute dot product between first COCO food image and skateboard+person+dog COCO image
print(tf.tensordot(features_coco_food_flat, features_coco_skateboard_flat, 1))

tf.Tensor(43730.094, shape=(), dtype=float32)


--> We can see that again within category images have higher dot products.

In [None]:
# compute dot product between second COCO food image and 