# **Few shot adversarial learning of Realistic Neural Talking Head Models**
## Explaination, Implementations and Discussion

When [MyHeritage.com](https://www.myheritage.com) announced the release of [Deep Nostalgia](https://www.myheritage.com/deep-nostalgia), a tool to animate faces from still photos, it received an overwhelming response from people. The website reported that over 1 million photos were animated in the first 48 hours alone and was reaching 3 million on the third day ([source](https://blog.myheritage.com/2021/02/deep-nostalgia-goes-viral/)). The world was going gaga over this and honestly, rightly so. It is only natural for a human to grab any chance they have at seeing their lost loved ones living again and when I found out about this tool while scrolling through my [reddit](https://www.reddit.com) feed, I also did not hesitate to jump on the bandwagon. Here is an example of a photo animated using Deep Nostalgia:

![example of Deep Nostalgia](https://media.giphy.com/media/vNEulvl9tMP2GDKpoz/giphy.gif) ![Another example](https://media.giphy.com/media/JRnVuIZKw5tPFPgdaH/giphy.gif)

I tried this tool on a photo of my grandfather who passed in the year 2003 and there he was, moving, looking around, smiling almost naturally. I showed my mother the animation on her birthday a few days later and I could see first-hand what the people meant in the reviews they gave this tool. Being a student of Artificial Intelligence, I couldn't help but wonder how this tool worked so I started looking for the tech behind it. It led me to the company that developed it: [D-ID](https://www.d-id.com/). D-ID is a company which started with anonymization of facial featuress in photos using AI and went on to develop [Talking Heads and the Live Portrait](https://www.d-id.com/reenactment/) which are a part of its reenactment suite.

Now, obviously the tech is proprietary so I could not find much details but I came across a [similar paper](https://arxiv.org/abs/1905.08233) from researchers at the Samsung AI center, Moscow. This blog post is going to be about this paper from the samsung AI lab: _Few-Shot Adversarial Learning of Realistic Neural Talking Head Models_.

Feeling overwhelmed by the title of the paper? I was too. Let's change that.

# Overview

The researchers at the Samsing AI lab developed a [GAN (Generativce Adversarial Network)](https://arxiv.org/abs/1406.2661) to make realistic talking heads from photos of people with as few learning shots as possible. The model works by taking a source image and a target video/image and extracts the landmarks from the target image and applyting them to the source image, animating the source image into a talking head image. So, when applied, the implementation looks something like: 

Driver video / Training video: <p style="text-align: center;"> ![driver image](https://media.giphy.com/media/5AXoAsfkw7ESOYzPH6/giphy.gif) </p> 
    
Extracted landmarks:<p style="text-align: center;">![landmarks gif](https://media.giphy.com/media/4V6KKfZT0OlOCWEN7z/giphy.gif)</p>
Now the extracted landmarks were applied to this set of images:<p style="text-align: center;">![trainign shots](https://i.ibb.co/hM3Hq8y/IMG-5213.jpg)<p> so the resulting animated talking head came out to be: <p style="text-align: center;">![result](https://media.giphy.com/media/6siZvkfigPHnX9zQdA/giphy.gif)</p>

Compare the source, the landmarks and the result:<p style="text-align: center;">
![driver image](https://media.giphy.com/media/5AXoAsfkw7ESOYzPH6/giphy.gif) ![landmarks gif](https://media.giphy.com/media/4V6KKfZT0OlOCWEN7z/giphy.gif) ![result](https://media.giphy.com/media/6siZvkfigPHnX9zQdA/giphy.gif)</p>

# BUT HOWWWW??
### Underlying concepts
#### 1. GAN - Generative Adversarial Network

The underlying principle of such systems in the recent times has been GAN's, or Generative Adversarial Networks. GAN's are, in simple words, recent machine learning frameworks developed by [Ian J. Goodfellow et. al.](https://arxiv.org/abs/1406.2661) in 2014 which can generate new data that is similar to the trainng data but also quite different different at the same time. Nvidia Corp. recently made a GAN which generated human faces that do not exist in real life. The implementation can be seen on www.thispersondoesnotexist.com. Every time you reload this page, a new face appears which belongs to nobody. A creepy choice of words, I know, but interesting nonetheless. Surprisingly, these GAN's have somehow overcome the [Uncanny Valley Effecy](https://spectrum.ieee.org/automaton/robotics/humanoids/what-is-the-uncanny-valley) and look real.

GAN's have two parts: 
- A Generator 
- A Discriminator

An informal definition of generators would be a "model that can use the training data and use that to genenrate new data" and it has to do it as effectively as possible, and informally, a discriminator would be a "model that has to differentiate between the original data and the generated data and provide feedback to the generator".

In even simpler words, it's like a game of hide-and-seek; the generator has to generate new data and try to fool the discriminator, while the discriminator has to try its best to not be fooled by the data generated by the generator.

Mathematically,

> **Generators** have to calculate the joint probability, that is,  _P(a, b, c, ...)_ if there are multiple labels or _P(x)_ if there is only one label

whereas

> **Discriminators** have to calculate only the conditional probability, that is, _P(a | b)_.

#### 2. Embedder

In the paper, an embedder network is also used in the meta learning stage. An embedder network outputs the embedding vectors wich are then fed into the generator and help in creating new data. The equation for generating embedding vectors is:
  
<p style="text-align: center;">  
<img src = "https://render.githubusercontent.com/render/math?math=\hat{e}_{i} = \frac{1}{K}\sum_{k=1}^{K}E(x_{i}(s_{k}), y_{i}(s_{k})\phi)"><br></p>

where 
<img src = "https://render.githubusercontent.com/render/math?math=\hat{e}_{i}"> is the embedding vector, that is being calculated by taking an average of the ouptut of the **_E(x)_** output. K is the number of episodes in K-shot learning. For few shot learning, the value of K is low. Here **_i_** is a randomly chosen video from the training set and **_t_** is a randomly chosen frame from the selected training video. The network parameters that are learned during this meta-learning stage are stored in <img src = "https://render.githubusercontent.com/render/math?math=\phi">.

In english, **an embedder network is used to extract information from training videos and the landmarks and feed the learned parameters (phi) from the training videos to the Generator network.**

# Okaayyyy.. go off, I guess?
### I have the general idea now but what about it? how is everything implemented??

#### 1. Generator
In the previous step, we got the embedding vector, which will be fed into the generator network. 

The Generator takes the landmark image of a different frame, the _embedding vector_ from the embedder network and a video frame. The output is a new frame genrated by the Generator. The generator is represented as:

<p style="text-align: center;">  
<img src = "https://render.githubusercontent.com/render/math?math=\hat{x}_{i}(t) = G(y_{i}(t), \hat{e}_{i}, \Psi, P)"><br></p>

where <br>
<img src = "https://render.githubusercontent.com/render/math?math=\y_{i}(t)"> is the landmark image, <br> 
<img src = "https://render.githubusercontent.com/render/math?math=\hat{e}_{i}"> is the embedding vector, <br>
<img src = "https://render.githubusercontent.com/render/math?math=\psi"> are the _person-generic_ parameters that are learnt during meta-learning itself and <br>
<img src = "https://render.githubusercontent.com/render/math?math=\hat{\psi}_{i}"> are the _person-specific_ parameters trained from the embedding vectors. <br>
The output is <img src = "https://render.githubusercontent.com/render/math?math=\hat{x}_{i}"> which is a synthesized video frame.

Once we have the synthesized output video, it is then sent to the discriminator to predict whether it is a real image or a synthesized video.

#### 2. Discriminator

The discriminator, used to differentiate between real and fake outputs form the generator, takes an input frame from video sequence either from the generator output or from the training dataset, the index of the video from which the frame is taken. The output is _r_, a.k.a realism score which is a single scalar value indicating how much the discriminator "_thinks_" that the frame is real. Yes, i really made that pun.

representation of a discriminator is:

<p style="text-align: center;">  
<img src = "https://render.githubusercontent.com/render/math?math=r = D(x_{i}(t), y_{i}(t),i, \theta, W, w_{0}, b)"><br></p>

where,<br>
<img src = "https://render.githubusercontent.com/render/math?math=x_{i}(t)">  is the video frame, <br>
<img src = "https://render.githubusercontent.com/render/math?math=y_{i}(t)"> is the landmark image, <br>
<img src = "https://render.githubusercontent.com/render/math?math=i"> is the index of the video sequence in the dataset and<br>
<img src = "https://render.githubusercontent.com/render/math?math=\theta, W, w_{0}, b"> represent the learnable parameters of the discriminator network.