##Probing Image-Text Models


This Colab evaluates pretrained image--text models (in a zero-shot way) with respect to fine-grained **s**ubject, **v**erb, and **o**bject understanding using the [SVO-Probes](https://arxiv.org/abs/2106.09141) dataset. 

The SVO-Probes dataset provides two positive and negative images for a given sentence.  The negative image differs from the positive one with respect to either subject, verb, or object. Here are some examples of the SVO-Probes dataset: 
<figure>
<center>
<img src='https://storage.googleapis.com/dm-mmt-models/svo_probes_examples.png'>
<figcaption>Examples from the SVO-Probes dataset</figcaption></center>
</figure>


Given a sentnece, we test if a model can correctly classify both positive and negative images; the notebook will output raw numbers on the SVO-Probes dataset and generate bar charts to visualize the results.


### Setting Up Your Models

You can find the SVO-Probes raw data [here](svo_probes.csv). Each row of the data contains of two datapoints, the `<sentence,positive-image>` and` <sentence,negative-image>` pairs. Each image is identfied by a url and a unique id: `pos_image_id` (`pos_url`) or `neg_image_id` (`neg_url`) to mark the positive and negative images, respectively.

To evaluate your models on SVO-Probes, for each image and sentence pair, it needs to output whether or not the image and text match. 

You can also download the images using their urls listed [here](image_urls.txt). 

### Using the Notebook

To use the notebook, evaluate your model for all image and sentence pairs. Save these results in a json dictionary with key:value pairs formatted as `sentence|imaage_id`: 1 if sentence matches the image (and 0 if does not match).






In [None]:
import numpy as np
import sys
import json
import csv
import pandas as pd
import re
import matplotlib.pyplot as plt


import seaborn as sns

csv.field_size_limit(sys.maxsize)


Reading the image-sentence scores from the a json file


In [None]:
def scores(path):
  id_to_scores = {}
  d = json.load(
      gfile.Open(
          path,
          'r'))
  # id --> score
  for item in d:
    id_to_scores[item] = float(d[item])
  return id_to_scores

Read a model's scores on each sentence-image pair

In [None]:
def get_score_dataframe(path, df):
  sentences = [item for item in df.sentence.values]
  pos_image_id = [item for item in df.pos_image_id.values]
  neg_image_id = [item for item in df.neg_image_id.values]
  idxs = [item for item in df.index]

  pair_to_scores = scores(path)

  pos_scores = []
  neg_scores = []
  count = 0
  for idx, sentence, pos_image_id, neg_image_id in zip(idxs, sentences, 
                                                       pos_image_id, 
                                                       neg_image_id):
    neg_key = re.sub(' +', ' ', '%s|%d' % (sentence.lower(), neg_image_id))    
    pos_key = re.sub(' +', ' ', '%s|%d' % (sentence.lower(), pos_image_id))
    if (pos_key in pair_to_scores) and (neg_key in pair_to_scores):
      pos_scores.append(pair_to_scores[pos_key])
      neg_scores.append(pair_to_scores[neg_key])
    else:
      df = df.drop([idx])
    count += 1
  
  df['pos_scores'] = pos_scores
  df['neg_scores'] = neg_scores
  return df

## SVO-Probes: How do models perform on subject, verb, and object pairs?

Read our raw data.

In [None]:
# SVO Probes dataset
!wget  https://storage.googleapis.com/dm-mmt-models/svo_probes.csv  --no-check-certificate  -P '/tmp'
%ls /tmp
df = pd.read_csv(gfile.Open('/tmp/svo_probes.csv', 'r'))


Computing accuracy across different types.

In [None]:
# Change this path to include the scores from your model
!wget https://storage.googleapis.com/dm-mmt-models/mmt_cc_svo_results.json  --no-check-certificate  -P '/tmp'
json_path = '/tmp/mmt_cc_svo_results.json'

In [None]:
def accuracy(frame):
  neg = frame[['sentence', 'neg_image_id', 'neg_scores']].drop_duplicates()
  pos = frame[['sentence', 'pos_image_id', 'pos_scores']].drop_duplicates()

  neg_acc = np.mean([item == 0 for item in neg['neg_scores'].values])
  pos_acc = np.mean([item == 1 for item in pos['pos_scores'].values])
  # macro
  acc = (neg_acc + pos_acc)/2
  return acc, pos_acc, neg_acc


data_df = get_score_dataframe(json_path, df)

subj_neg_df = data_df[data_df['subj_neg'] & ~data_df['obj_neg'] 
                      & ~data_df['verb_neg']]
verb_neg_df = data_df[data_df['verb_neg'] & ~data_df['obj_neg'] 
                      & ~data_df['subj_neg']]
obj_neg_df = data_df[data_df['obj_neg'] & ~data_df['verb_neg'] 
                     & ~data_df['subj_neg']]

all_df = pd.concat([subj_neg_df, verb_neg_df, obj_neg_df])

acc_all, pos_acc_all, neg_acc_all = accuracy(pd.concat([subj_neg_df, 
                                                        verb_neg_df, 
                                                        obj_neg_df]))
acc_subj, pos_acc_subj, neg_acc_subj = accuracy(subj_neg_df)
acc_verb, pos_acc_verb, neg_acc_verb = accuracy(verb_neg_df)
acc_obj, pos_acc_obj, neg_acc_obj = accuracy(obj_neg_df)

results = [['All', acc_all, pos_acc_all, neg_acc_all], 
           ['Subj', acc_subj, pos_acc_subj, neg_acc_subj],
           ['Verb', acc_verb, pos_acc_verb, neg_acc_verb],
           ['Obj', acc_obj, pos_acc_obj, neg_acc_obj]]

results_df = pd.DataFrame.from_records(results, 
                                       columns=['Type', 'Avg Accuracy', 
                                                'Pos Accuracy', 'Neg Accuracy'])
results_df

Plot positive, negative, and average results for subjects, verbs, and objects.

In [None]:
tmp_df = results_df
tmp_df = tmp_df.rename(columns={"Avg Accuracy": "Avg", 
                                "Pos Accuracy": "Pos", 
                                "Neg Accuracy": "Neg"})

melt_df = pd.melt(tmp_df, id_vars = "Type")
melt_df = melt_df.rename(columns={"variable": "Accuracy Type", 
                                  'value': 'Accuracy', 
                                  'Type': 'Word Type'})
_ = sns.barplot(data=melt_df, x='Accuracy Type', y='Accuracy', hue='Word Type')

fig = plt.gcf()
_ = fig.set_size_inches(10, 5, forward=True)

Plot average, positive, or negative accuracy for different word types.

In [None]:
input_or_select = "Avg Accuracy"  # @param ["Avg Accuracy", "Pos Accuracy", "Neg Accuracy"]
_ = sns.barplot(data=results_df, x='Type', y=input_or_select)

fig = plt.gcf()
_ = fig.set_size_inches(10, 5, forward=True)