# Visual Relationship Detection

In this tutorial, we focus on the task of classifying visual relationships between objects in an image. For any given image, there might be many such relationships, defined formally as a `subject <predictate> object` (e.g. `person <riding> bike`). As an example, in the relationship `man riding bicycle`), "man" and "bicycle" are the subject and object, respectively, and "riding" is the relationship predicate.

![Visual Relationships](https://cs.stanford.edu/people/ranjaykrishna/vrd/dataset.png)

In the examples of the relationships shown above, the red box represents the _subject_ while the green box represents the _object_. The _predicate_ (e.g. kick) denotes what relationship connects the subject and the object.

For the purpose of this tutorial, we operate over the [Visual Relationship Detection (VRD) dataset](https://cs.stanford.edu/people/ranjaykrishna/vrd/) and focus on action relationships. We define our classification task as **identifying which of three relationships holds between the objects represented by a pair of bounding boxes.**

In [1]:
PARAM = 0.5
N_EPS = 50
LR = 0.1
DEPS_NAME = 'CDGAM'
FLIP = True

#PARAM = 1
#N_EPS = 50
#LR = 0.05
#DEPS_NAME = 'Varma'
#FLIP = True

#PARAM = -1
#N_EPS = 50
#LR = 0.05
#DEPS_NAME = 'Empty' 
#FLIP = True

#PARAM = 0.5
#N_EPS = 100
#LR = 0.025
#DEPS_NAME = 'NM_NP'# New Method "New" Policy (where old policy is delta heuristic and new policy is discarding matrix if zero cols/rows)
#FLIP = True

#PARAM = 0.5
#N_EPS = 100
#LR = 0.1
#DEPS_NAME = 'Varma_Gold' 
#FLIP = True

In [2]:
import os

if os.path.basename(os.getcwd()) == "snorkel-tutorials":
    os.chdir("visual_relation")

### 1. Load Dataset
We load the VRD dataset and filter images with at least one action predicate in it, since these are more difficult to classify than geometric relationships like `above` or `next to`. We load the train, valid, and test sets as Pandas `DataFrame` objects with the following fields:
- `label`: The relationship between the objects. 0: `RIDE`, 1: `CARRY`, 2: `OTHER` action predicates
- `object_bbox`: coordinates of the bounding box for the object `[ymin, ymax, xmin, xmax]`
- `object_category`: category of the object
- `source_img`: filename for the corresponding image the relationship is in
- `subject_bbox`: coordinates of the bounding box for the object `[ymin, ymax, xmin, xmax]`
- `subject_category`: category of the subject

If you are running this notebook for the first time, it will take ~15 mins to download all the required sample data.

The sampled version of the dataset **uses the same 26 data points across the train, dev, and test sets.
This setting is meant to demonstrate quickly how Snorkel works with this task, not to demonstrate performance.**

In [3]:
from utils import load_vrd_data
# changed IMAGES_URL in download_full_data.sh to "http://imagenet.stanford.edu/internal/jcjohns/scene_graphs/sg_dataset.zip"
# setting sample=False will take ~3 hours to run (downloads full VRD dataset)
sample = False
is_test = os.environ.get("TRAVIS") == "true" or os.environ.get("IS_TEST") == "true"

if FLIP:
    df_train, df_test, df_valid = load_vrd_data(sample, is_test)
else:
    df_train, df_valid, df_test = load_vrd_data(sample, is_test)
    
print("Train Relationships: ", len(df_train))
print("Dev Relationships: ", len(df_valid))
print("Test Relationships: ", len(df_test))

df_train.head()

Train Relationships:  635
Dev Relationships:  194
Test Relationships:  216


Unnamed: 0,subject_category,object_category,subject_bbox,object_bbox,label,source_img
0,umbrella,table,"[94, 175, 306, 590]","[336, 489, 324, 458]",2,2113966890_c65030a7e7_o.jpg
1,person,bench,"[159, 594, 504, 767]","[200, 479, 109, 846]",2,8054281885_ebbbfa2672_b.jpg
2,person,table,"[152, 540, 342, 648]","[539, 767, 1, 1023]",2,5813297357_f210a455f9_b.jpg
3,person,train,"[275, 346, 440, 489]","[226, 641, 254, 712]",2,3572969356_2b01616f71_b.jpg
4,train,person,"[226, 641, 254, 712]","[320, 353, 345, 375]",1,3572969356_2b01616f71_b.jpg


Note that the training `DataFrame` will have a labels field with all -1s. This denotes the lack of labels for that particular dataset. In this tutorial, we will assign probabilistic labels to the training set by writing labeling functions over attributes of the subject and objects!

## 2. Writing Labeling Functions
We now write labeling functions to detect what relationship exists between pairs of bounding boxes. To do so, we can encode various intuitions into the labeling functions:
* _Categorical_ intution: knowledge about the categories of subjects and objects usually involved in these relationships (e.g., `person` is usually the subject for predicates like `ride` and `carry`)
* _Spatial_ intuition: knowledge about the relative positions of the subject and objects (e.g., subject is usually higher than the object for the predicate `ride`)

In [4]:
RIDE = 0
CARRY = 1
OTHER = 2
ABSTAIN = -1

We begin with labeling functions that encode categorical intuition: we use knowledge about common subject-object category pairs that are common for `RIDE` and `CARRY` and also knowledge about what subjects or objects are unlikely to be involved in the two relationships.

In [5]:
from snorkel.labeling import labeling_function

# Category-based LFs
@labeling_function()
def lf_ride_object(x):
    if x.subject_category == "person":
        if x.object_category in [
            "bike",
            "snowboard",
            "motorcycle",
            "horse",
            "bus",
            "truck",
            "elephant",
        ]:
            return RIDE
    return ABSTAIN


@labeling_function()
def lf_carry_object(x):
    if x.subject_category == "person":
        if x.object_category in ["bag", "surfboard", "skis"]:
            return CARRY
    return ABSTAIN


@labeling_function()
def lf_carry_subject(x):
    if x.object_category == "person":
        if x.subject_category in ["chair", "bike", "snowboard", "motorcycle", "horse"]:
            return CARRY
    return ABSTAIN


@labeling_function()
def lf_not_person(x):
    if x.subject_category != "person":
        return OTHER
    return ABSTAIN

We now encode our spatial intuition, which includes measuring the distance between the bounding boxes and comparing their relative areas.

In [6]:
YMIN = 0
YMAX = 1
XMIN = 2
XMAX = 3

In [7]:
import numpy as np

# Distance-based LFs
@labeling_function()
def lf_ydist(x):
    if x.subject_bbox[XMAX] < x.object_bbox[XMAX]:
        return OTHER
    return ABSTAIN


@labeling_function()
def lf_dist(x):
    if np.linalg.norm(np.array(x.subject_bbox) - np.array(x.object_bbox)) <= 1000:
        return OTHER
    return ABSTAIN


def area(bbox):
    return (bbox[YMAX] - bbox[YMIN]) * (bbox[XMAX] - bbox[XMIN])


# Size-based LF
@labeling_function()
def lf_area(x):
    if area(x.subject_bbox) / area(x.object_bbox) <= 0.5:
        return OTHER
    return ABSTAIN

Note that the labeling functions have varying empirical accuracies and coverages. Due to class imbalance in our chosen relationships, labeling functions that label the `OTHER` class have higher coverage than labeling functions for `RIDE` or `CARRY`. This reflects the distribution of classes in the dataset as well.

In [8]:
from snorkel.labeling import PandasLFApplier

lfs = [
    lf_ride_object,
    lf_carry_object,
    lf_carry_subject,
    lf_not_person,
    lf_ydist,
    lf_dist,
    lf_area,
]

applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)
L_valid = applier.apply(df_valid)

100%|██████████| 635/635 [00:00<00:00, 6063.34it/s]
100%|██████████| 194/194 [00:00<00:00, 5648.93it/s]


In [9]:
from snorkel.labeling import LFAnalysis

Y_valid = df_valid.label.values
LFAnalysis(L_valid, lfs).lf_summary(Y_valid)

Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
lf_ride_object,0,[0],0.164948,0.164948,0.164948,24,8,0.75
lf_carry_object,1,[1],0.113402,0.113402,0.113402,18,4,0.818182
lf_carry_subject,2,[1],0.020619,0.020619,0.020619,4,0,1.0
lf_not_person,3,[2],0.252577,0.252577,0.020619,40,9,0.816327
lf_ydist,4,[2],0.618557,0.618557,0.175258,86,34,0.716667
lf_dist,5,[2],0.994845,0.845361,0.298969,123,70,0.637306
lf_area,6,[2],0.324742,0.324742,0.061856,51,12,0.809524


## 2b. Find dependencies

In [10]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.append('../')
from Our_Monitors.CD_Monitor import CDM, Informed_LabelModel
from Our_Monitors.CDGA_Monitor import CDGAM
from Our_Monitors.New_Monitor import NM
from Our_Monitors.utils import ModVarma_InCov

L_dev = L_valid
Y_dev = Y_valid

from dependency_model.varma_deps_functions import get_varma_edges, get_varma_with_gold_edges

In [11]:
def overall_deps_fn(deps_name, param):
    if deps_name == 'CDM':
        deps = CDM(L_dev, Y_dev, k=3, sig=param, policy = 'new', verbose=False, return_more_info = False)
    elif deps_name == 'CDGAM':
        deps = CDGAM(L_dev, k=3, sig=param, policy = 'new', verbose = False, return_more_info = False)
    elif deps_name == 'NM':
        deps = NM(L_dev, Y_dev, k=3, sig=param, policy = 'old', verbose=False, return_more_info = False)
    elif deps_name == 'NM_NP':
        deps = NM(L_dev, Y_dev, k=3, sig=param, policy = 'new', verbose=False, return_more_info = False)
    elif deps_name == 'Mod_Varma':
        deps = ModVarma_InCov(L_dev, Y_dev, thresh=param)
    elif deps_name == 'Varma':
        deps = get_varma_edges(L_dev, thresh=param)
    elif deps_name == 'Varma_Gold':
        deps = get_varma_with_gold_edges(L_dev, Y_dev, thresh=param)
    elif deps_name == 'Empty':
        deps = []
    return deps

deps = overall_deps_fn(DEPS_NAME, PARAM)
print(deps)

[(0, 1), (0, 3), (0, 4), (0, 5), (1, 3), (1, 4), (1, 6), (2, 3), (3, 6), (4, 6), (5, 6)]


## 3. Train and evaluate Label Model

In [12]:
from snorkel.labeling.model import LabelModel
from snorkel.classification import DictDataLoader
from model import SceneGraphDataset, create_model
from snorkel.utils import probs_to_preds # added
import torchvision.models as models
from snorkel.classification import Trainer
from sklearn.metrics import precision_recall_fscore_support, accuracy_score, confusion_matrix, roc_auc_score

In [13]:
%%capture output2

label_model = Informed_LabelModel(edges = deps, cardinality=3, verbose=True)
label_model.fit(L_train, seed=12345, lr=LR, log_freq=N_EPS/10, n_epochs=N_EPS)

In [14]:
L_test = applier.apply(df_test)
Y_test = df_test.label.values

score = label_model.score(L_dev, Y_dev)['accuracy']
score_test = label_model.score(L_test, Y_test)['accuracy']

print("label model's score on valid set: ", score)
print("label model's score on test set: ", score_test)

100%|██████████| 216/216 [00:00<00:00, 2545.59it/s]

label model's score on valid set:  0.8170731707317073
label model's score on test set:  0.7988505747126436





### Set up dataloaders for end extraction model

In [15]:
# generate labels
df_train["labels"] = probs_to_preds(label_model.predict_proba(L_train))

#Changes ./model.py's line 79 to take labels column instead of label column
#So, add a column called labels to valid, test dl which is duplicate of label
df_valid["labels"] = df_valid["label"]
df_test["labels"] = df_test["label"]

# Also set up dataloaders
if sample:
    TRAIN_DIR = "data/VRD/sg_dataset/samples"
else:
    TRAIN_DIR = "data/VRD/sg_dataset/sg_train_images"
# added test dl
TEST_DIR = "data/VRD/sg_dataset/sg_test_images"

if FLIP:
    DIR2 = TEST_DIR
    DIR3 = TRAIN_DIR
else:
    DIR2 = TRAIN_DIR
    DIR3 = TEST_DIR

dl_train = DictDataLoader(
    SceneGraphDataset("train_dataset", "train", TRAIN_DIR, df_train),
    batch_size=16,
    shuffle=True,
)
dl_valid = DictDataLoader(
    SceneGraphDataset("valid_dataset", "valid", DIR2, df_valid),
    batch_size=16,
    shuffle=False,
)
dl_test = DictDataLoader(
    SceneGraphDataset("test_dataset", "test", DIR3, df_test),
    batch_size=16,
    shuffle=False,
)

## 4. Train and evaluate end extraction model

In [16]:
# CLF VALIDATION!
n_clf_epochs = 4 # from validation analysis notebook

# define clf architecture
# initialize pretrained feature extractor
cnn = models.resnet18(pretrained=True)
model = create_model(cnn)

# train clf
trainer = Trainer(
    n_epochs=n_clf_epochs,  # increase for improved performance
    lr=1e-3,
    checkpointing=True,
    checkpointer_config={"checkpoint_dir": "checkpoint"},
)
trainer.fit(model, [dl_train, dl_valid])

  return self.word_embs.loc[word].as_matrix()
Epoch 0:: 100%|██████████| 40/40 [01:28<00:00,  2.22s/it, model/all/train/loss=0.925, model/all/train/lr=0.001, visual_relation_task/valid_dataset/valid/f1_micro=0.665]
Epoch 1:: 100%|██████████| 40/40 [01:40<00:00,  2.51s/it, model/all/train/loss=0.521, model/all/train/lr=0.001, visual_relation_task/valid_dataset/valid/f1_micro=0.691]
Epoch 2:: 100%|██████████| 40/40 [01:43<00:00,  2.60s/it, model/all/train/loss=0.383, model/all/train/lr=0.001, visual_relation_task/valid_dataset/valid/f1_micro=0.711]
Epoch 3:: 100%|██████████| 40/40 [01:04<00:00,  1.62s/it, model/all/train/loss=0.368, model/all/train/lr=0.001, visual_relation_task/valid_dataset/valid/f1_micro=0.691]


In [17]:
# evaluate clf additions
results = model.predict(dl_test, return_preds = True)
gold = results['golds']['visual_relation_task']
preds = results['preds']['visual_relation_task']
print(accuracy_score(gold, preds))
#print(precision_recall_fscore_support(gold, preds, average='micro'))
print(confusion_matrix(gold, preds))

0.6898148148148148
[[  4  12  33]
 [  0  23  14]
 [  0   8 122]]
