# Demo of GazeNet

GazeNet involves a two-step process: first, obtaining head and body predictions, and then predicting gaze directions. This separation allows for independent experimentation with gaze predictions.

This notebook shows how to do inference on GazeNet if the head and body rotations are already predicted.

In [None]:
import torch

from matplotlib import pyplot as plt
from matplotlib import animation
from IPython.display import Video
from collections import defaultdict

import os
import cv2
import pickle
import numpy as np

import sys
sys.path.append("GazeNet")
from models.gazenet import GazeNet

sys.path.append("sixDDirect_H_B")
from GAFA.valid_frames import get_frames

## Load GazeNet from Pre-Trained Weights

You can download the pre-trained weights [here](https://huggingface.co/noanonk/6DDirect_H_B/resolve/main/gazenet.ckpt). Place them in the `GazeNet/output/` folder.

In [2]:
# load model
device = torch.device("gpu") if torch.cuda.is_available() else torch.device("cpu")
model = GazeNet(n_frames=7)
model.load_state_dict(torch.load(
    "GazeNet/output/gazenet.ckpt", map_location=device)["state_dict"]
)
model.eval()

GazeNet(
  (gazemodule): GazeModule(
    (lstm): LSTM(14, 128, num_layers=2, bidirectional=True)
    (direction_layer): Sequential(
      (0): Linear(in_features=1792, out_features=64, bias=True)
      (1): ReLU()
      (2): Linear(in_features=64, out_features=21, bias=True)
    )
    (kappa_layer): Sequential(
      (0): Linear(in_features=1792, out_features=64, bias=True)
      (1): ReLU()
      (2): Linear(in_features=64, out_features=7, bias=True)
      (3): Softplus(beta=1, threshold=20)
    )
  )
)

## Load Video
Videos are saved in a pickle file and per frame. We load this from the chosen scene and video and convert the dictionary to a numpy array.

If you haven't yet run the preprocessing, we have the `kitchen` folder for download [here](https://huggingface.co/noanonk/6DDirect_H_B/resolve/main/kitchen.zip). Unzip this file in the `GazeNet/data/preprocessed/` folder.

In [3]:
root_dir = "GazeNet/data/preprocessed/"
scene = "kitchen/1022_2"
video_dir = "Camera_3_5"

with open(os.path.join(root_dir, scene, video_dir, "images.pkl"), "rb") as f:
    video = pickle.load(f)
video = np.array([v for v in video.values()])

video.shape

(800, 960, 720, 3)

## Load Annotations
In order to visualize the gaze predictions, we get the head bounding box from the GAFA dataset, which we've cleaned.

In [26]:
with open(os.path.join(root_dir, scene, video_dir, "clean_annotations.pkl"), "rb") as f:
    annot = pickle.load(f)
annot = annot["head_bb"]

annot.shape

(800, 4)

## Visualize Video
To visualize the video, we plot the frames one per one and write to `test.mp4`.

In [5]:
writer = animation.FFMpegWriter(fps=30, codec='libx264')

fig = plt.figure()
plt.axis('off')
im = plt.imshow(video[0,:,:,:])

plt.close() # this is required to not display the generated image

def init():
    im.set_data(video[0,:,:,:])

def animate(i):
    im.set_data(video[i,:,:,:])
    return im

with writer.saving(fig, "test.mp4", 100):
    for i in range(len(video)):
        animate(i)
        writer.grab_frame()

In [6]:
Video("test.mp4")

## Load Head and Body Rotation Predictions
The head and body rotations were pre-calculated in order to facilitate easy testing.
We can load all the predictions for the specific video we are looking at.

In [7]:
H_B_preds = os.path.join(root_dir, scene, video_dir, "6DDirect_H_B_preds.pkl")

Since we get multiple predictions from our head and body model, we only select the most certain head and body predictions per frame.

In [9]:
gaze_input = get_frames(H_B_preds, video_dir, n_frames=7)
for k in gaze_input[0]:
    print(k, ":", gaze_input[0][k])

head_dirs : [[0.7815450429916382, -0.10810178518295288, -0.773358017206192, -0.04168832302093506, 0.8880559206008911, -0.26126569509506226], [0.8055877685546875, -0.11023050546646118, -0.7511477023363113, -0.042988717555999756, 0.8862242698669434, -0.2765321731567383], [0.8169564008712769, -0.11016595363616943, -0.7324377000331879, -0.04296988248825073, 0.8828508853912354, -0.28466349840164185], [0.8230882883071899, -0.11145138740539551, -0.7197496294975281, -0.04553323984146118, 0.8797483444213867, -0.29135894775390625], [0.8220312595367432, -0.11462944746017456, -0.7145802080631256, -0.051094889640808105, 0.8765996694564819, -0.29847949743270874], [0.8312361240386963, -0.1076962947845459, -0.7064539790153503, -0.03572636842727661, 0.8703086376190186, -0.2807849049568176], [0.852353572845459, -0.10051202774047852, -0.6792432963848114, -0.02672553062438965, 0.8701001405715942, -0.27628225088119507]]
head_scores : [0.83739, 0.83575, 0.83231, 0.82739, 0.81911, 0.8186, 0.82673]
body_dirs 

We have 753 valid inputs for our GazeNet and 800 frames. 

In [10]:
len(gaze_input), len(video)

(753, 800)

## Gaze Direction Estimation

This is all the information GazeNet needs to predict gaze direction.

In [12]:
# n_frames x 6
GazeNet_input = {
    "head_dirs": torch.tensor([gaze_input[idx]["head_dirs"] for idx in range(len(gaze_input))]).to(device),
    "body_dirs": torch.tensor([gaze_input[idx]["body_dirs"] for idx in range(len(gaze_input))]).to(device),
    "head_scores": torch.tensor([gaze_input[idx]["head_scores"] for idx in range(len(gaze_input))]).unsqueeze(dim=-1).to(device),
    "body_scores": torch.tensor([gaze_input[idx]["body_scores"] for idx in range(len(gaze_input))]).unsqueeze(dim=-1).to(device),
    }

Single forward pass of all the inputs.
Since we are working with 7-frame data, we eventually want to know the $t$-th frame's gaze predictions in the $\{t-3, t-2, t-1, t, t+1, t+2, t+3\}$ sequence.
Therefore we look at the center frames.

In [13]:
with torch.no_grad():
        y_hat = model(GazeNet_input)["direction"][:, 7 // 2]

In [14]:
gaze_2d = y_hat[:, :2].numpy() / np.linalg.norm(y_hat, axis=-1, keepdims=True)

In [15]:
len(gaze_2d)

753

## Visualize Gaze Predictions in 2D

Collect for each valid index the center of the head and the 2D gaze vector for plotting.

In [28]:
center_frames = {}
for in_gaze, gaze_pred in zip(gaze_input, gaze_2d):
    center_frame = in_gaze["indices"][7 // 2]
    
    head_center_x = int(annot[center_frame][0] + (annot[center_frame][2] / 2))
    head_center_y = int(annot[center_frame][1] + (annot[center_frame][3] / 2))

    center_frames[center_frame] = [(head_center_x, head_center_y), gaze_pred]

In [31]:
writer = animation.FFMpegWriter(fps=30, codec='libx264')

fig = plt.figure()
plt.axis('off')
im = plt.imshow(video[0,:,:,:])

plt.close() # this is required to not display the generated image

def init():
    im.set_data(video[0,:,:,:])

def arrow(i):
    head_center = center_frames[i][0]
    gaze_pred = center_frames[i][1]

    des = (head_center[0] + int(gaze_pred[0]*50), int(head_center[1] + gaze_pred[1]*50))

    video[i,:,:,:] = cv2.arrowedLine(video[i,:,:,:], head_center, des, (0, 255, 0), 3, tipLength=0.3)
    return video[i,:,:,:]

def animate(i):
    if i in center_frames:
        im.set_data(arrow(i))
        return im
    im.set_data(video[i,:,:,:])
    return im

with writer.saving(fig, "test_w_annotations.mp4", 100):
    for i in range(len(video)):
        animate(i)
        writer.grab_frame()

In [32]:
Video("test_w_annotations.mp4")