# Video Generation using fine-tuned StyleGAN3 model

This notebook contains code adapted from Class 7, titled *02_StyleGAN_inference.ipynb*, with help from Copilot.

## 00. Setup

First lets clone StyleGAN3

In [None]:
!git clone https://github.com/NVlabs/stylegan3.git

And install additional libraries

In [None]:
!pip install ninja

I've already pretrained two models, one for the title sections, and one for the rest of the film. You can [download them here](https://artslondon-my.sharepoint.com/:u:/g/personal/l_plowden0620231_arts_ac_uk/Eb4lYgu8sB9BoxJDk8udrLABdo-hux3x6U3h-vlhXvCMdw?e=U7lsmF) and extract them to `./models/`. I used my original film and roughly facetracked to the characters on screen, and also cropped to 1024x1024px to make it work with the model.

To train (or finetune) your own, [here is the documentation I followed](https://github.com/NVlabs/stylegan3/blob/main/docs/configs.md), and the command I used to do so is:

In [None]:
!python ./stylegan3/train.py --outdir=conor-output --cfg=stylegan3-t --gpus=1 --batch=32 --gamma=32 --batch-gpu=4 --snap=5 --resume==stylegan3-t-ffhqu-1024x1024.pkl --freezed=10 --kimg=2000 --data=wtfdata/individual_frames/main/ --metrics=none

You will also need to download the appropriate model from [this link](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/research/models/stylegan3) if you want to finetune (as indicated above by the --resume flag).

## Imports

We have to add the StyleGAN repository to path temporarily in order to import some functions from torch_utils.

I had to insert it at the beginning of my path to make it work, otherwise I was getting import errors. (thanks to Copilot)

In [28]:
import sys
import os


base_dir = "./stylegan3"
if base_dir not in sys.path:
    sys.path.insert(0, base_dir)

In [None]:
import torch
import numpy as np

from torch_utils import misc
from torch_utils.ops import upfirdn2d

from torchvision.transforms import ToTensor
from PIL import Image

from torchvision.transforms.functional import to_pil_image
from torch.nn.functional import interpolate
from torchvision.utils import make_grid
from IPython.display import display, HTML

import legacy
import dnnlib

In [None]:
device = "cpu"

if torch.cuda.is_available():
    device = "cuda"

elif torch.backends.mps.is_available():
    device = "mps"

print(f'torch version {torch.__version__}')
print(f'Using device: {device}')

## 01. Loading the models

In [None]:
network_pkl_main = r".\models\main_model.pkl"
network_pkl_titles = r".\models\titles_model.pkl"

with dnnlib.util.open_url(network_pkl_main) as f:
    model = legacy.load_network_pkl(f)
    g_model_1 = model['G'].eval().requires_grad_(False).to(device)

with dnnlib.util.open_url(network_pkl_titles) as f:
    model = legacy.load_network_pkl(f)
    g_model_2 = model['G'].eval().requires_grad_(False).to(device)

## 02. Projecting into latent space and generating frames

There are a couple more imports needed for the projection and to interpolate between projected frames. 

Unfortunately there's not enough time to project into the latent space for every frame of the video, so we have to fill in the blanks with interpolation.

In [None]:
from stylegan3.dnnlib.util import open_url
from utils import image_path_to_tensor
from utils import run_projector

from utils import slerp
from base64 import b64encode
import torchvision.transforms as transforms

# these are my added functions as I have a few more requirements
from utils import image_directory_to_tensors
from utils import get_ws_emas_for_scene

#### Fetch a feature extractor

In order to move closer to the target style vector, we'll be using a pre-trained feature extractor to tell us how close we are. We'll use [VGG16](https://www.geeksforgeeks.org/vgg-16-cnn-model/) for this.

In [None]:
url = 'https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metrics/vgg16.pt'
with open_url(url) as f:
    vgg16 = torch.jit.load(f).eval().to(device)
print('Using device:', device, file=sys.stderr)

## Extract the images for projection

I put all my video files, separated by cuts, into a folder. Then I got Copilot to generate a script which split each image into frames, at 1frame per second. This script is in the repository: [extract_frames.py](./extract_frames.py).
 
The script takes this structure of files, where the subdirectories are named after the key for the model we want to use for those images.:

![A picture of file structure containing a base directory titled "original videos" and two subdirectories titled respective to the AI model we want to use for those scenes.](images\original_videos.png)

After running the script we should end up with a directory titled [./extracted_frames/](./extracted_frames/) which is structured similarly to the input files, except now each video has become a directory containing a frame for every second of the original video, plus the last frame of the video:

![A picture of file structure containing a base directory titled "extracted_frames", two subdirectories titled respective to the AI model we want to use for those scenes, and further subdirectories for each of the extracted frames](images\extracted_frames.png)

## Make tensors from the extracted frames

We will also generate tensors with this utils function I added on top of the one provided in class 7 / utils.py. It uses `image_path_to_tensor()` for all of the scenes in the film (cuts moreso than scenes), checks the appropriate ML model to use (as defined by the file's subdirectory), and returns a list of tuples. Each item in the list represents a scene, including the generated tensors and the model to project into with these tensors.

In [None]:
target_images_base_path = r"extracted_frames/"

# Initialize an empty list to store lists of tensors from each subdirectory
all_subdir_tensors_and_models = image_directory_to_tensors(target_images_base_path, g_model_1, g_model_2, device)

### We can also make a folder for the generated frames now

In [None]:
# create a directory to save the model checkpoints
if not os.path.exists('generated_frames'):
    os.makedirs('generated_frames')

## Projecting into the $w$ space

"We are now going to take our image, and project it into the $w$ space of StyleGAN. This process will start with a random vector, and make changes to the latent vector and noise input, until it converges on on the closest matching image in StyleGAN space to our input image. This is quite a long process, however if you want to shorten it you can change the the `step` variable to a smaller number if you want to reduce the amount of steps taken to find the closest match."

I have modified the code to work with our file structure/ setup per cut in the film. It now goes through scene by scene, and outputs the files at the end of each scene, so that we can save progress along the way, as well as previewing the results.

This takes a long time and is really the main part of the project. IT took me 8 hours or so! Good luck!

In [None]:
steps = 500
num_interp = 25 #Because the original film was at 25fps, and we extracted a single frame for every second

# Initial loop goes through all of the "scenes"
for i, (scene, model_identifier) in enumerate(all_subdir_tensors_and_models):
    
    # First we check whether it's a scene from the main part of the film, or from the titles, and use the appropriate model.
    # You could also experiment with this a lot more
    if model_identifier == 'MAIN':
        g_model = g_model_1
    else:
        g_model = g_model_2


    # Then we get all of the w space moving averages. This is the long part where we're really projecting into the w space as I understand it.
    this_scene_ws_emas = []
    for j, frame_tensor in enumerate(scene):
        ws_ema = run_projector(projection_target=frame_tensor,
                               g_model=g_model, 
                               steps=steps,
                               perceptual_model=vgg16, 
                               device=device, 
                               save_path=None)
        this_scene_ws_emas.append(ws_ema)
        print(f'Scene {i}/{len(all_subdir_tensors_and_models)}, Frame {j+1}/{len(scene)} complete')

    # Now we're translating these w space tensors into linear latent space vectors and interpolating between every second
    interp_vals = np.linspace(1./num_interp, 1, num=num_interp)
    this_scene_latent_interps = []

    for j in range(len(this_scene_ws_emas) - 1):
        latent_a_np = this_scene_ws_emas[j].cpu().detach().numpy().squeeze()
        latent_b_np = this_scene_ws_emas[j+1].cpu().detach().numpy().squeeze()
        latent_interp = np.array([slerp(v, latent_a_np, latent_b_np) for v in interp_vals], dtype=np.float32)
        this_scene_latent_interps.append(latent_interp)

    # Define a folder to store the images
    image_folder_name = f"generated_frames/{model_identifier}/scene_038"
    if not os.path.exists(image_folder_name):
        os.makedirs(image_folder_name)

    #Finally, save all of the tensors as images. 
    #In my case, I had to retroactively add the last 25 frames so I made this start index variable to offset the names
    #but it can be left at 0 now.
    start_index = 0
    for k, latent_interp in enumerate(this_scene_latent_interps):
        for j, step in enumerate(latent_interp):
            step = torch.tensor(step).unsqueeze(0).to(device)
            image_tensor = g_model.synthesis(step, noise_mode='const')
            image = transforms.functional.to_pil_image(image_tensor.clamp(-1, 1).add(1).div(2).cpu().squeeze(0))
            # Calculate the image name index based on start_index and the current loop iteration
            image_name_index = start_index + k * len(latent_interp) + j
            # Save the image with the calculated name index
            image.save(f'./{image_folder_name}/{image_name_index:04}.jpg')            

## 03. Generate a video from the resulting frames

I used copilot again to make a script which generates videos from the outputted frames. The script is called [vid_from_frames.py](./vid_from_frames.py).

Final notes about the video: I wanted to maintain cuts in the video so we could have some sense of the original film. However because of time constraints we are only extracting a frame every 1 second, and then interpolating between them. Since most (all) of our scenes are unlikely to converge on the 25th frame, the total runtime is bound to be shorter. As a quick fix, I decided to let the [vid_from_frames.py](./vid_from_frames.py) script also grab the final frame from the script, and record that too. Then, in video editing software, we can just trim off any extra frames.

This does make the editing process a bit manual for my liking, but it's an okay workaround!

Since the videos I produced are large, I will just put some vimeo links here:
- [Side by side comparison with the original film](https://vimeo.com/962221187/9dc567cfb0?share=copy)
- [Generated video](https://vimeo.com/962159643/ce36485b86?share=copy)
