#### *Note that this notebook isn't meant to be run

In [None]:
import data_handling
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import torch
import torchvision
import numpy as np
import torch.nn as nn
import torchvision

### Dataset
Dataset Produced from: https://www.cs.columbia.edu/CAVE/databases/pubfig/

Downloaded from: https://www.kaggle.com/datasets/kaustubhchaudhari/pubfig-dataset-256x256-jpg

The public figure dataset allegedly has 58,797 images of public figures, however the actual downloaded dataset is 11,640 images. The images are colored and 256px by 256px.

In [None]:
DATASET_DIRECTORY = "./CelebDataProcessed"
ANNOTATIONS_DIRECTORY = "./annotations.csv"
NAME = ""
BATCH_SIZE = 64
TRANSFORM = torchvision.transforms.Compose([
torchvision.transforms.ToPILImage(),
torchvision.transforms.ToTensor(),
])

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

pubfig = data_handling.PublicFigureDataset(ANNOTATIONS_DIRECTORY, DATASET_DIRECTORY, NAME, transform=TRANSFORM)

# 80-20 train test split
train_size = int(0.8 * len(pubfig))
test_size = len(pubfig) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(pubfig, [train_size, test_size])

train_dl = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_dl = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=True)

### Model 1
The basis for deepfakes is to use autoencoders to extract the core features person A into a latent space, and then decode it with a decoder trained on person B. This first model was what I used to test the success and capabilities of autoencoders trained on the public figure dataset.

In [None]:
import encdec as ed

class AutoEncoder(nn.Module):
    def __init__(self, layer_count, latent_dim, input_dim, dropout_odds=0.5):
        super().__init__()
        self.encoder = ed.Encoder(layer_count, latent_dim, input_dim, dropout_odds=dropout_odds)
        self.decoder = ed.Decoder(layer_count, latent_dim, input_dim, self.encoder.getOut(), dropout_odds=dropout_odds)

    def forward(self, x):
        z = self.encode(x)
        y = self.decode(z)
        return y

    def encode(self, x):
        output = self.encoder(x)
        return output

    def decode(self, input):
        y = self.decoder(input)
        return y




### Model 2
This model is based on the model from the Deepfacelab paper (https://paperswithcode.com/paper/deepfacelab-a-simple-flexible-and-extensible). Source Code: https://github.com/iperov/DeepFaceLab

And the model from Faceswap-GAN (https://github.com/shaoanlu/faceswap-GAN/blob/master/networks/faceswap_gan_model.py).

The general structure of the models is to utilize a single encoder and two decoders, with each decoder trained on a different person. This allows the two faces to be correctly encoded into the same latent space, and as such allow the decoders to decode a different person's face onto the latent space.

In [None]:
class SingleEnc(nn.Module):
    def __init__(self, latent_dim, leak=True, dropout_odds=0.5):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Conv2d(3, 64, 9, 4),
            nn.BatchNorm2d(64),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.Conv2d(64, 128, 5, 2, 1),
            nn.BatchNorm2d(128),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.Conv2d(128, 256, 5, 2, 1),
            nn.BatchNorm2d(256),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.Conv2d(256, 512, 3, 2, 1),
            nn.BatchNorm2d(512),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.Conv2d(512, 1024, 3, 2, 1),
            nn.BatchNorm2d(1024),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.Conv2d(1024, 512, 1, 1),
            nn.BatchNorm2d(512),
            nn.LeakyReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.Flatten(),
        )

        self.inter = nn.Sequential(
            nn.Linear(8192, 2048),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Linear(2048, latent_dim),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Linear(latent_dim, 2048),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Linear(2048, 8192),
            nn.LeakyReLU() if leak else nn.ReLU(),
        )

        self.decoderA = nn.Sequential(
            nn.Unflatten(dim=1, unflattened_size=(512, 4, 4)),
            nn.ConvTranspose2d(512, 1024, 1, 1),
            nn.BatchNorm2d(1024),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.ConvTranspose2d(1024, 512, 3, 2, 1),
            nn.BatchNorm2d(512),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.ConvTranspose2d(512, 256, 3, 2, 1),
            nn.BatchNorm2d(256),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.ConvTranspose2d(256, 128, 5, 2, 1),
            nn.BatchNorm2d(128),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.ConvTranspose2d(128, 64, 5, 2, 1),
            nn.BatchNorm2d(64),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.ConvTranspose2d(64, 3, 9, 4),
            nn.BatchNorm2d(3),
            nn.Sigmoid(),
            nn.Dropout2d(p=dropout_odds),
        )
        self.decoderB = nn.Sequential(
            nn.Unflatten(dim=1, unflattened_size=(512, 4, 4)),
            nn.ConvTranspose2d(512, 1024, 1, 1),
            nn.BatchNorm2d(1024),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.ConvTranspose2d(1024, 512, 3, 2, 1),
            nn.BatchNorm2d(512),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.ConvTranspose2d(512, 256, 3, 2, 1),
            nn.BatchNorm2d(256),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.ConvTranspose2d(256, 128, 5, 2, 1),
            nn.BatchNorm2d(128),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.ConvTranspose2d(128, 64, 5, 2, 1),
            nn.BatchNorm2d(64),
            nn.LeakyReLU() if leak else nn.ReLU(),
            nn.Dropout2d(p=dropout_odds),

            nn.ConvTranspose2d(64, 3, 9, 4),
            nn.BatchNorm2d(3),
            nn.Sigmoid(),
            nn.Dropout2d(p=dropout_odds),
        )
        #self.discriminatorA = nn.Sequential(

        #)
        #self.discriminatorB = nn.Sequential(
            
        #)

    def forward(self, x, type='a'):
        x = self.encoder(x)
        x = self.inter(x)
        if type == 'a':
            x = self.decoderA(x)
        else:
            x = self.decoderB(x)
        x = torchvision.transforms.Resize((256, 256))(x)
        return x

Unlike model 1 which has a rather standard training procedure, model 2 requires two backward passes because of the decoders.

In [None]:
import torchvision.transforms as T
from PIL import Image
from torchvision.utils import save_image
from numpy.random import randint
import time
import gc

def train_epoch(model, device, trainloader, loss_fn, optimizer, testloader, dataset, epochs=5, default_dtype=torch.FloatTensor, video=False):

    start_time = time.time()
    iters = 0

    if video:
        index = randint(len(dataset)) # From the dataset we get a random image
        image, name = h.getImage(index, dataset) 
        image = image.unsqueeze(0)
        save_image(image, "./Outputs/Video/deepfaker/{}.png".format(name))

    # Set train mode for both the encoder and the decoder
    model.train()
    for ep in range(epochs):
        train_loss_a = []
        train_loss_b = []

        # Iterate the dataloader (we do not need the label values, this is unsupervised learning)
        for i, (image_batch, _) in enumerate(trainloader): # with "_" we just ignore the labels (the second element of the dataloader tuple)
            if video: # TODO: This only works for the Autoencoder class atm
                model.eval()
                output = model(image, "a")
                save_image(output, "./Outputs/Video/deepfaker/a/{}_{}.png".format(ep, i))
                output = model(image, "b")
                save_image(output, "./Outputs/Video/deepfaker/b/{}_{}.png".format(ep, i))
                model.train()

            iters += 1

            # Move tensor to the proper device
            image_batch = image_batch.type(default_dtype).to(device)
            #labels = labels.type(default_dtype).to(device)

            # Encode data
            output = model(image_batch, type="a")

            # Evaluate loss
            loss_a = loss_fn(output, image_batch)

            # Backward pass
            optimizer.zero_grad()
            loss_a.backward()
            optimizer.step()

            output = model(image_batch, type="b")

            # Evaluate loss
            loss_b = loss_fn(output, image_batch)

            # Backward pass
            optimizer.zero_grad()
            loss_b.backward()
            optimizer.step()

            time_lapse = time.strftime('%H:%M:%S', time.gmtime(time.time() - start_time))
            if i % 20 == 0:
                print('Epoch:{:2d} | Iter:{:5d} | Time: {} | Train_A Loss: {:.4f} | Train_B Loss: {:.4f}'.format(ep+1, i, time_lapse, loss_a.data, loss_b.data))

            # Print batch loss
            train_loss_a.append(loss_a.detach().cpu().numpy())
            train_loss_b.append(loss_b.detach().cpu().numpy())
        gc.collect()
        test_loss_a, test_loss_b = test_epoch(model, device, testloader, loss_fn)
        print('\n EPOCH {}/{} \t Avg. Train_A loss this Epoch {} \t Avg. Train_B loss this Epoch {} \t Test loss A {} \t Test loss B {}'.format(ep + 1, epochs, np.mean(train_loss_a),np.mean(train_loss_b), test_loss_a, test_loss_b))
    return



def test_epoch(model, device, dataloader, loss_fn, default_dtype=torch.FloatTensor):
    # Set evaluation mode for encoder and decoder
    model.eval()
    with torch.no_grad(): # No need to track the gradients
        # Define the lists to store the outputs for each batch
        conc_out_a = []
        conc_label_a = []
        for image_batch, _ in dataloader:
            # Move tensor to the proper device
            image_batch = image_batch.type(default_dtype).to(device)#image_batch.type(torch.HalfTensor).to(device)
            # Encode data
            output = model(image_batch)

            # Append the network output and the original image to the lists
            conc_out_a.append(output.cpu())
            conc_label_a.append(image_batch.cpu())
        # Create a single tensor with all the values in the lists
        conc_out_a = torch.cat(conc_out_a)
        conc_label_a = torch.cat(conc_label_a) 
        # Evaluate global loss
        val_loss_a = loss_fn(conc_out_a, conc_label_a)

        conc_out_b = []
        conc_label_b = []
        for image_batch, _ in dataloader:
            # Move tensor to the proper device
            image_batch = image_batch.type(default_dtype).to(device)#image_batch.type(torch.HalfTensor).to(device)
            # Encode data
            output = model(image_batch,  type="b")

            # Append the network output and the original image to the lists
            conc_out_b.append(output.cpu())
            conc_label_b.append(image_batch.cpu())
        # Create a single tensor with all the values in the lists
        conc_out_b = torch.cat(conc_out_b)
        conc_label_b = torch.cat(conc_label_b) 
        # Evaluate global loss
        val_loss_b = loss_fn(conc_out_b, conc_label_b)

    return val_loss_a.data, val_loss_b.data

### Results
As of the creation of this notebook, the models are unable to successfully swap faces. In my attempt to train autoencoders, its become increasingly clear that whether it be due to time, hardware, or data constraints, the training process is extremely slow. During the training process, I ran an image through the model and saved the output images. I compiled them into the gifs below.

Below is an autoencoder trained on images of Donald Trump. Expectedly, due to the small amount of data it takes a significant amount of time before the model begins identifying facial features.

![SegmentLocal](trump.gif "segment")


Below is a model trained on the entire pubfig dataset, demonstrating a much more defined face, however its still missing details, likely due to an insufficient training time. 

![SegmentLocal](double.gif "segment")

Below is a model pretrained on the entire dataset, then begins training only on a single individual. The gif begins with a clear image of a generalized face which the pretrained model learns, which it quickly discards and begins the learning process over. Unlike the Donald Trump model, this one quickly begins to form a face after a period of noise which is promising.

![SegmentLocal](png_to_gif.gif "segment")

Although currently lacking the details to properly perform a face-swap, the results show the model is certainly capable of learning faces given sufficient time. The results also indicate that pretraining an autoencoder model on reconstructing faces in general can quicken the models subsequent learning of specific faces for one-to-one deepfake models. 