# GANS


Generative Adversarial Networks (GANs) are Neural Networks that take random noise as input and generate outputs (e.g. a picture of a human face) that appear to be a sample from the distribution of the training set (e.g. set of other human faces).

A GAN achieves this feat by training two models simultaneously

+ A **generative model** that captures the distribution of the training set.
+ A **discriminative model** estimates the probability that a sample came from the training data and not the generative model above.

### Why GANs?
+ If your training data is insufficient, no problem. GANs can learn about your data and generate synthetic images that augment your dataset.
+ Can create images that look like photographs of human faces, even though the faces don’t belong to any real person from the given distribution. Isn’t that incredible?
+ Generate images from descriptions (text to image synthesis). 
+ Improve the resolution of a video that captures finer details (low-resolution to high-resolution).
+ Even in the audio domain, GANs can be used to produce synthetic, high-fidelity audio or perform voice translations.

This is not all. GANs can do more. No wonder they are so powerful and in demand today. These include text to image synthesis, image to image translation, ageing, and super resolution among many others


### Advantages of GANs over Other Generative Models

GANs today dominate over all other generative models. Let’s see why:

Data labelling is an expensive task. GANs are unsupervised, so no labelled data is required to train them.
GANs currently generate the sharpest images. Adversarial training makes this possible. Blurry images produced by Mean Squared Error stand no chance before a GAN.
Both the networks in GAN can be trained using only backpropagation.
Let’s try to understand GANs with some simple analogies.


### Intuition behind GANs
There are two ways to look at a GAN. 

+ Call it an artist that sketches realistic images from scratch. And like many successful artists, it too feels the need of a mentor to reach higher levels of proficiency. Seen thus, a GAN consists of:
    + An artist, i.e., the Generator 
    + And a mentor, i.e., the Discriminator 
    
The Discriminator helps the Generator in generating realistic images from what is merely noise.

![](./fig/gan1.jpg)

+ What if the GAN was not really an artist, but an ‘art forger’. Wouldn’t it need an inspector then to check what is genuine and what is not? Look at a GAN this way, then: 
    + The generator plays the role of the art forger. The aim of this network is to mimic realistic art. 
    + While the discriminator inspects, whether the art is real or fake. It’s job is to look at the real as well as the fake artwork generated by the forger, and to differentiate between the two. Further, the art inspector employs a feedback mechanism to help the forger generate more realistic images.
    
![](./fig/gen_dis.jpg)  

In short, as shown above, GAN is a fight between two nemeses: the generator and the discriminator 

The generator tries to learn the data distribution, by taking random noise as input, and producing realistic-looking images. 
On the other hand, the discriminator tries to classify whether the sample has come from the real dataset, or is fake (generated by the generator).

![](./fig/gen_dis1.jpg)  

When GAN training starts, the generator produces gibberish, having no clue what a realistic observation might look like. All through the training, noise is the only input to the generator. Not once does it get to see the original observations. Initially, even the discriminator cannot distinguish between real and fake, though it does come across both real and fake observations during the training.
    
### Components of a GAN
The idea of GANs has revolutionized the generative modeling domain. It was Ian Goodfellow et al. of Université de Montréal, who first published a paper on Generative Adversarial Networks in 2014, at the NIPS conference He introduced GAN as a new framework for estimating generative models via an adversarial process, in which a generative model G captures the data distribution, while a discriminative model D estimates if the sample came from the training data rather than G.

$x \rightarrow Training\ Sample$  

$z \rightarrow Noise\ Vector$  

$x_{real} \rightarrow Real\ Images$  

$x_{fake} \rightarrow Output\ of\ the\ generator, G(Z)$  

$D \rightarrow Discriminator$  

$G \rightarrow Generator$  

$G(z)\ or\ x_{fake} \rightarrow Generator's\ output$  

$D(x)\epsilon(0,1) \rightarrow Discriminator's\ output$  

A **GAN** comprises a Generator $G$ and a Discriminator $D$, which are trained simultaneously. Given a dataset $X_{real}$, the generator $G$ tries to capture the dataset distribution, by producing images $X_{fake}$ from noise $Z$. The discriminator $D$ tries to discriminate between the original dataset images $X_{real}$ and the images produced by the generator $X_{fake}$. Through this adversarial process, the end goal is to mimic the dataset distribution as realistically as possible. For instance, when provided with a dataset of car images $X_{real}$, a **GAN** aims to generate plausible car images $X_{fake}$.

### Generator
Generator in GAN is a neural network, which given a random set of values, does a series of non-linear computations to produce real-looking images. The generator produces fake images $X_{fake}$,when fed a random vector $Z$, sampled from a multivariate-gaussian distribution.

![](./fig/gen.png)

The generator’s role is to:

+ Fool the discriminator
+ Produce realistic-looking images
+ Achieve high-performance as the training process completes 

Assume you trained a GAN with lots of dog images, your generator should then be able to produce diverse real dog images. 


### Discriminator
The discriminator is based on the concept of discriminative modeling, which you learned is a classifier that tries to classify different classes in a dataset, with class-specific labels. So, in essence, it is similar to a supervised-classification problem. Also, the discriminator’s ability to classify observations is not limited to images, but includes video, text and many other domains (multi-modal).

![](./fig/dis.png)

The discriminator’s role in GAN is to solve a binary classification problem that learns to discriminate between a real and fake image. It does this by:

+ Predicting whether the observation is generated by the generator (fake), or from the original data distribution (real). 
+ While doing so, it learns a set of parameters or weights (theta). The weights keep getting updated, as the training progresses. 

A **Binary Cross-Entropy (BCE)** loss function is used to train the discriminator. We will be discussing this function in detail here. 

Right from the beginning, GANs have always used Dense Layers in the discriminator, and so will you in the coding section here. However, in 2015 came Deep Convolutional GAN (DCGAN), which demonstrated that convolutional layers work better than fully-connected layers in GAN.

### Training Procedure
Let's denote a set of fake and real images as $X$. Given real images ($X_{real}$) and fake images ($X_{fake}$), the discriminator, which is a binary classifier, tries to classify an image as fake or real. Does the image belongs to the true data distribution $P_{data}$ or the model distribution $P_{model}$ ? That’s what the discriminator tries to determine. 

The training of the generator and discriminator in GAN is done in an alternating fashion.  

In the first step:

The images produced by the generator $X_{fake}$ and the original images $X_{real}$ are first passed to the discriminator.
The discriminator then predicts $Y_{pred}$ ( a probability score ). This tells you which of the $X$ images are real, and which fake. 
Next, the predictions are compared with the ground truth {0: fake, 1: real}, and a **Binary Cross-Entropy (BCE)** loss is calculated. 
The loss (or gradient) is then backpropagated only through the discriminator, and its parameters are optimized accordingly.  

In the second step, 

The generator produces images $X_{fake}$ , which are again passed through the discriminator.
Here too it outputs a prediction $Y_{pred}$. And the BCE loss is computed.

Now, in this alternate step, because you want to enforce your Generator to produce images, as similar to the real images as possible (i.e., close to the true distribution), the true labels (or ground truth) are all labeled as ‘real’ or 1. As a result, when the generator tries to fool the discriminator (into believing that the images generated by it are real), the loss is backpropagated only through the generator network, and its parameters are optimized suitably.

It is important to note that for the generator to produce realistic images, the discriminator has to be there to guide (loss for fake images are backpropagated through the generator). Thus, there is a need for both networks to be strong enough. If:

+ The discriminator is a weak classifier, then even non-plausible images produced by the generator will be classified as real. The end result being the generator produces low-quality images. 
+ The generator is weak, it will not be able to fool the discriminator, as it will not generate images similar to the true distribution.

### Objective function of GAN

Both the generator and the discriminator you have seen are trained, based on the classification score given by the discriminator’s final layer, telling how fake or real its input had been. Surely that makes cross-entropy function the obvious choice when it comes to training such a network. And, we are dealing with a binary-class classification problem here, so a Binary Cross-Entropy (BCE) function is used.


Objective function of GAN
Both the generator and the discriminator you have seen are trained, based on the classification score given by the discriminator’s final layer, telling how fake or real its input had been. Surely that makes cross-entropy function the obvious choice when it comes to training such a network. And, we are dealing with a binary-class classification problem here, so a Binary Cross-Entropy (BCE) function is used.


\begin{equation*}L(\hat{y}, y) = -\frac{1}{N}\sum_{i=1}^{N}\left[ y_i \log\hat{y_i} + (1-y_i) \log (1-\hat{y_i}) \right]\end{equation*}

Let's break down the above equation and understand various components of it.

+ The negative sign at the beginning of the equation is there to avoid the loss from being negative. As the neural-network output is normalized between 0 and 1, taking log of values in this range would result in a value less than zero. Hence, we solve the negative log-likelihood.
+ Remember that we train our neural network in batches. The summation from 1 to N means that your loss is computed for the N training samples per batch, and you then take the average of those samples by dividing it by N ( batch ). In short, average the loss across the batch.
+ The $\hat{y_i}$ is the prediction made by the model or discriminator in GAN, while the y_i is the true label, irrespective of the sample being real or fake.
+ Did you note that there are two terms in the loss function, but only one  is relevant. That’s because the first term is valid when the true label is 1 ( real ), and the second term is valid when the true label is 0 ( fake ).

Now that you have understood the BCE loss function, see how it is modeled in GAN.

+ The generator’s goal is to learn a distribution $p_g$ over original data $x$.
+ A prior is defined on input noise variables $p_z(z)$ which is sampled from the normal distribution.
+ Then the input noise vector is mapped to a data space as $G(z; \theta_g )$, where $G$ is a differentiable function, represented by a stack of fully-connected network with learnable parameters $\theta_{g}$.
+ A second fully-connected network $D(x; \theta_d)$ outputs a single scalar value [0, 1]. $D(x)$ represents the probability that x came from the true data distribution rather than $p_g$ or generator $G$. The network is trained such that it maximizes the probability of $D$ assigning the correct label to both training examples, and samples produced from $G$.
+ At the same time, we train $G$ to minimize $-log(1- D(G(z)))$.
In other words, $D$ and $G$ play the following two-player minimax game with value function $V (G, D)$:

\begin{equation*}min_{G}max_{D}V (D, G) = E_{x \sim p_{data}(x)}[log D(x)] +  E_{z \sim p_{z}(z)}[log(1-D(G(z)))]\end{equation*}

As observed in the paper of GAN, the above equation may not provide sufficient gradient for the generator to learn well. Training this way will achieve only half the objective. Though the discriminator definitely becomes more powerful for it can now  easily discriminate the real from the fake, the generator lags behind. It has still not learned to produce realistic-looking images. 

Early in learning, when $G$ is poor, $D$ can reject samples with high confidence because they are clearly different from the training data. In this case, $-log(1 - D(G(z)))$ saturates. Hence, rather than training $G$ to minimize $-log(1 - D(G(z)))$, they train $G$ to maximize $-log D(G(z))$

Lets examine the above objective function in greater detail.

The discriminator is a binary classifier that given an input $x$, outputs a probability $D(x)$ between 0 and 1.

As the true label for $X_{real}$ is 1 and the true label for $X_{fake}$ is 0:

+ The probability $D(x)$ closer to 1 means the discriminator predicts that the input to be a real image.

+ And a probability closer to 0 means that the input is fake.

**Thus, the objective of the discriminator becomes:**

+ Maximizing the probability $D(X_{real})$ i.e. bringing it closer to 1
+ Minimizing the probability $D(X_{fake})$, where $X_{fake}$ is $G(Z)$

The generator wants the images generated by it to be classified as real by the discriminator.

**Thus, the objective of the generator becomes:**

+ Maximizing the probability $D(G(z)$ i.e. bringing it closer to 1.


Now, let’s have a look at the Minibatch stochastic gradient descent training of generative adversarial nets. The number of steps to apply to the discriminator, k, is a hyperparameter. A value of k = 1 is used, as it is the least expensive option.
```
for number of training, iterations do
    for k steps do
        1. Sample minibatch of m noise samples {z^{(1)} , . . . , z^{(m)}} 
           from noise prior p_{g}(z).
        2. Sample minibatch of m examples {x^{(1)}, . . . , x^{(m)}}   
           from data generating distribution p_{data}(x).  
        3. Update the discriminator by minimizing the 
           Discriminator loss, D_{loss} = -logD(X_{real}) -log(1 - D(G(z))  
    end for 
    1. Sample minibatch of m noise samples {z^{(1)} , . . . , z^{(m)}}
       from noise prior p_{g}(z).
    2. Update the generator by minimizing the Generator loss,
       G_{loss} = -logD(G(z))    
end for  
```