# DenseAV Demonstration Notebook

> ⚠️ Change your collab runtime to T4 GPU before running this notebook

In this notebook we will walk through how to load, visualize, and work with our catalog of pre-trained models.

## Set up Google Collab
> ⚠️ Skip this section if you are not on Google Collab


In [ ]:
!git clone https://github.com/mhamilton723/DenseAV

In [ ]:
import os
os.chdir("DenseAV/")

In [1]:
!pip install -e .

/bin/bash: line 1: pip: command not found


## Import dependencies and load a pretrained DenseAV Model


In [1]:
from os.path import join

import torch
import torchvision
import torchvision.transforms as T
from PIL import Image
from torchaudio.functional import resample

from denseav.plotting import plot_attention_video, plot_2head_attention_video, plot_feature_video
from denseav.shared import norm, crop_to_divisor, blur_dim

In [2]:
model_name = "sound_and_language"
video_path = "samples/puppies.mp4"
result_dir = "results"

In [3]:
model = torch.hub.load('mhamilton723/DenseAV', model_name).cuda()

Using cache found in /home/marhamil/.cache/torch/hub/mhamilton723_DenseAV_main
2024-06-06 17:07:29.759719: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-06 17:07:29.759793: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-06 17:07:29.761321: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-06 17:07:29.769751: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the ap

trainable params: 147,456 || all params: 21,817,728 || trainable%: 0.6758540577644016


## Load a sample video and prepare it for DenseAV

In [4]:
original_frames, audio, info = torchvision.io.read_video(video_path, pts_unit='sec')
sample_rate = 16000

if info["audio_fps"] != sample_rate:
    audio = resample(audio, info["audio_fps"], sample_rate)
audio = audio[0].unsqueeze(0)

img_transform = T.Compose([
    T.Resize(224 * 2, Image.BILINEAR),
    lambda x: crop_to_divisor(x, 8),
    lambda x: x.to(torch.float32) / 255,
    norm])

frames = torch.cat([img_transform(f.permute(2, 0, 1)).unsqueeze(0) for f in original_frames], axis=0)

plotting_img_transform = T.Compose([
    T.Resize(224 * 4, Image.BILINEAR),
    lambda x: crop_to_divisor(x, 8),
    lambda x: x.to(torch.float32) / 255])

frames_to_plot = plotting_img_transform(original_frames.permute(0, 3, 1, 2))

## Use DenseAV to obtain dense AV-aligned features

In [5]:
with torch.no_grad():
    audio_feats = model.forward_audio({"audio": audio.cuda()})
    image_feats = model.forward_image({"frames": frames.unsqueeze(0).cuda()}, max_batch_size=2)

    sim_by_head = model.sim_agg.get_pairwise_sims(
        {**image_feats, **audio_feats},
        raw=False,
        agg_sim=False,
        agg_heads=False
    ).mean(dim=-2).cpu()

    sim_by_head = blur_dim(sim_by_head, window=3, dim=-1)
    print(sim_by_head.shape)

torch.Size([181, 2, 28, 28, 33])


## Visualize Cross-Modal Attention

In [6]:
plot_attention_video(
    sim_by_head,
    frames_to_plot,
    audio,
    info["video_fps"],
    sample_rate,
    join(result_dir, "attention", model_name, f'{video_path.split("/")[-1]}'))

Moviepy - Building video results/attention/sound_and_language/puppies.mp4.
MoviePy - Writing audio in puppiesTEMP_MPY_wvf_snd.mp3


                                                        

MoviePy - Done.
Moviepy - Writing video results/attention/sound_and_language/puppies.mp4


                                                               

Moviepy - Done !
Moviepy - video ready results/attention/sound_and_language/puppies.mp4


## Visualize Cross Modal Attention by Head to Disentangle Sound and Language

In [7]:
plot_2head_attention_video(
    sim_by_head,
    frames_to_plot,
    audio,
    info["video_fps"],
    sample_rate,
    join(result_dir, "attention", model_name, f'2head_{video_path.split("/")[-1]}'))

Moviepy - Building video results/attention/sound_and_language/2head_puppies.mp4.
MoviePy - Writing audio in 2head_puppiesTEMP_MPY_wvf_snd.mp3


                                                        

MoviePy - Done.
Moviepy - Writing video results/attention/sound_and_language/2head_puppies.mp4


                                                               

Moviepy - Done !
Moviepy - video ready results/attention/sound_and_language/2head_puppies.mp4


## Plot Deep Features

In [8]:
plot_feature_video(
    image_feats["image_feats"].cpu(),
    audio_feats['audio_feats'].cpu(),
    frames_to_plot,
    audio,
    info["video_fps"],
    sample_rate,
    join(result_dir, "features", model_name, f'visual_{video_path.split("/")[-1]}'),
    join(result_dir, "features", model_name, f'audio_{video_path.split("/")[-1]}')
)

Moviepy - Building video results/features/sound_and_language/visual_puppies.mp4.
MoviePy - Writing audio in visual_puppiesTEMP_MPY_wvf_snd.mp3


                                                        

MoviePy - Done.
Moviepy - Writing video results/features/sound_and_language/visual_puppies.mp4


                                                               

Moviepy - Done !
Moviepy - video ready results/features/sound_and_language/visual_puppies.mp4
Moviepy - Building video results/features/sound_and_language/audio_puppies.mp4.
MoviePy - Writing audio in audio_puppiesTEMP_MPY_wvf_snd.mp3


                                                        

MoviePy - Done.
Moviepy - Writing video results/features/sound_and_language/audio_puppies.mp4


                                                               

Moviepy - Done !
Moviepy - video ready results/features/sound_and_language/audio_puppies.mp4


