Tests integration of LipNet with the dataset.py classes, and the torch.utils.data.DataLoader class.

In [1]:
import os
import time 
import random 

import cv2
import torch
import librosa
import numpy as np
import torchsummary 
from torchvision import transforms
from torch.utils.data import Dataset

import cheapfake.lipnet.models as models
import cheapfake.contrib.dataset as dataset
import cheapfake.contrib.video_processor as video_processor

In [2]:
random_seed = 41
root_path = "/Users/shu/Documents/Datasets/DFDC_small_subset_raw"

In [5]:
# Test to see what shape is expected by LipNet
model = models.LipNet()
print(torchsummary.summary(model=model, input_size=(3, 75, 270, 480), batch_size=1))

[INFO] Starting foward pass
[INFO] Finished convolution layer 1
[INFO] Finished convolution layer 2
[INFO] Finished convolution layer 3
[INFO] Finished recurrent unit layers
[INFO] Starting fully connected layer
[INFO] Finished forward pass
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv3d-1      [1, 32, 75, 135, 240]           7,232
              ReLU-2      [1, 32, 75, 135, 240]               0
         Dropout3d-3      [1, 32, 75, 135, 240]               0
         MaxPool3d-4       [1, 32, 75, 67, 120]               0
            Conv3d-5       [1, 64, 75, 67, 120]         153,664
              ReLU-6       [1, 64, 75, 67, 120]               0
         Dropout3d-7       [1, 64, 75, 67, 120]               0
         MaxPool3d-8        [1, 64, 75, 33, 60]               0
            Conv3d-9        [1, 96, 75, 33, 60]         165,984
             ReLU-10        [1, 96, 75, 33, 60]       

Looks like LipNet takes as input a tensor of shape (Color, Frames, Height, Width) or (Color, Frames, Width, Height) though I suspect that it does not matter for the spatial dimensions. Currently, the dataset.py loads the frames in as (Frames, Color, Height, Width) or (Frames, Height, Width, Color) so a pre-processing step needs to be done before feeding into LipNet.

### Testing with dataset.py

In [7]:
start_time = time.time()
dfdataset = dataset.DeepFakeDataset(
    root_path=root_path, 
    return_tensor=False, 
    random_seed=random_seed, 
    sequential_frames=False, 
    sequential_audio=True, 
    stochastic=True,
)

# Need to change the ordering of the frames to match (Color, Frames, Height, Width).abs
frames, audio, _ = dfdataset.__getitem__(0)
frames = np.einsum("ijkl->jikl", frames[:75])
frames = torch.from_numpy(frames)
frames = frames[None, :, :, :,]
print(frames.shape)
prediction = model(frames.float())
end_time = time.time()

print("Entire operation took {} seconds".format(end_time - start_time))
print(prediction.shape)

torch.Size([1, 3, 75, 270, 480])
[INFO] Starting foward pass
[INFO] Finished convolution layer 1
[INFO] Finished convolution layer 2
[INFO] Finished convolution layer 3
[INFO] Finished recurrent unit layers
[INFO] Starting fully connected layer
[INFO] Finished forward pass
Entire operation took 28.975037336349487 seconds
torch.Size([75, 1, 512])


Above works, but some minor modifications had to be made to LipNet's architecture. As a result, this means we either have to change the size of the image to (64, 128) and then use the pretrained weights, or we can rewrite LipNet and then train new network weights.