# 17. Representation Learning and Generative Learning Using Autoencoders and GANs

Autoencoders are artificial neural networks capable of learning dense representations of the input data, called **latent representations** or **codings**, without any supervision.  

These codings typically have a much lower dimensionality than the input data, making autoencoders useful for: 
* Dimensionality reduction (esp. visualization)
* Feature detection (pretraining of DNN)
* Some of them are generative models 

Generative Adversarial Network (GANs) are a new class of generative models already widely adopted, with applications in image editing and creation, super resolution, coloring etc.

Key differences between the two, in short:

1. Autoencoder are made of an encoder and a decoder. Codings are byproducts of the autoencoder learning the identity function under some constraints. 
2. GANs, on the other hand, are made of a generator and discriminator.

### Efficient Data Representations

An autoencoder looks at the inputs, converts them to an efficient latent representation, and then spits out something that (hopefully) looks very close to the inputs.

Autoencoder outputs are called **reconstructions**, and the cost function contains a **reconstruction loss** that penalizes the model when the reconstructions are different from the inputs. Because of lower dimensionality, the model is said to be **undercomplete**. 

### Performing PCA with an Undercomplete Linear Autoencoder

If the autoencoder uses only linear activations and the cost function is the mean squared error (MSE), then it ends up performing Principal Component Analysis (PCA).

In [1]:
from tensorflow import keras

encoder = keras.models.Sequential([keras.layers.Dense(2, input_shape=[3])])
decoder = keras.models.Sequential([keras.layers.Dense(3, input_shape=[2])])
autoencoder = keras.models.Sequential([encoder, decoder])
autoencoder.compile(loss="mse", optimizer=keras.optimizers.SGD(lr=0.1))

This looks actually very similar to a MLP. Now let’s train the model on a simple generated 3D dataset and use it to encode that same dataset (in 2D): 

In [2]:
history = autoencoder.fit(X_train, X_train, epochs=20)
codings = encoder.predict(X_train)

NameError: name 'X_train' is not defined

### Stacked Autoencoders

Stacked = multiple layers = more powerful. Careful not to make the autoencoder too powerful. It may still reconstruct the data very well, but it may fail to learn any representation. 

Usually layers are symmetric with respect to central hidden layer (coding layer). 

#### Implementing a Stacked Autoencoder Using Keras

In [3]:
# encoder
stacked_encoder = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(100, activation="selu"),
    keras.layers.Dense(30, activation="selu"),
])

# decoder
stacked_decoder = keras.models.Sequential([
    keras.layers.Dense(100, activation="selu", input_shape=[30]),
    keras.layers.Dense(28 * 28, activation="sigmoid"),
    keras.layers.Reshape([28, 28])
])

stacked_ae = keras.models.Sequential([stacked_encoder, stacked_decoder])
stacked_ae.compile(loss="binary_crossentropy",
                   optimizer=keras.optimizers.SGD(lr=1.5))
history = stacked_ae.fit(X_train, X_train, epochs=10,
                        validation_data=[X_valid, X_valid])

NameError: name 'X_train' is not defined

#### Visualizing the Reconstructions

Example code to check how our reconstruction fares against the actual inputs:

In [6]:
from matplotlib import pyplot as plt

def plot_image(image):
    plt.imshow(image, cmap="binary")
    plt.axis("off")
    
def show_reconstructions(model, n_images=5):
    reconstructions = model.predict(X_valid[:n_images])
    fig = plt.figure(figsize=(n_images * 1.5, 3))
    for image_index in range(n_images):
        plt.subplot(2, n_images, 1 + image_index)
        plot_image(X_valid[image_index])
        plt.subplot(2, n_images, 1 + n_images + image_index)
        plot_image(reconstructions[image_index])

show_reconstructions(stacked_ae)

NameError: name 'X_valid' is not defined

#### Unsupervised Pretraining Using Stacked Autoencoders

If you have a large dataset but most of it is unlabeled, we can:
1. First train a stacked autoencoder using all the data
2. Reuse the lower layers to create a neural network for your actual task 
3. Train it using the labeled data

![Pretraining using Autoencoders](images/17.Autoencoders_Pretraining.png)

#### Tying Weights

Simple intuition: if layers are symmetrical, we can just tie weights of encoder and decoder.  

### Convolutional Autoencoders

For images. The encoder is a regular CNN composed of convolutional layers and pooling layers. It typically reduces the spatial dimensionality of the inputs (i.e., height and width) while increasing the depth (i.e., the number of feature maps). The decoder must do the reverse (upscale the image and reduce its depth back to the original dimensions).

### Recurrent Autoencoders

For sequences. The encoder is typically a sequence-to-vector RNN which compresses the input sequence down to a single vector. The decoder is a vector-to-sequence RNN that does the reverse. 

### Denoising Autoencoders

Another way to force the autoencoder to learn useful features is to add noise to its inputs, training it to recover the original, noise-free inputs.

The noise can be pure Gaussian noise added to the inputs, or it can be randomly switched-off inputs, just like in dropout. 

In [7]:
dropout_encoder = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(100, activation="selu"),
    keras.layers.Dense(30, activation="selu")
])
dropout_decoder = keras.models.Sequential([
    keras.layers.Dense(100, activation="selu", input_shape=[30]),
    keras.layers.Dense(28 * 28, activation="sigmoid"),
    keras.layers.Reshape([28, 28])
])
dropout_ae = keras.models.Sequential([dropout_encoder, dropout_decoder])

### Sparse Autoencoders

By adding an appropriate term to the cost function, the autoencoder is pushed to reduce the number of active neurons in the coding layer. As a result, each neuron in the coding layer typically ends up representing a useful feature reducing the model reliance on complex combinations of activations. 

### Variational Autoencoders

They are different from what we have seen so far since they are:

* Probabilistic
* Generative

The process looks like this: inputs > encoding > codings $\mu$ (mean) and $\sigma$ (std) > add Gaussian noise > output sampled from this dist > decoded normally 

The cost function in this case is composed of two parts:

1. Reconstruction loss (as before)
2. Latent loss (Kullback–Leibler divergence between target Gaussian distribution and actual distribution of codings)

### Generative Adversarial Networks

Original [paper](https://homl.info/gan) published in 2014. Intuition: make NNs compete with each other. It's made of two parts:

1. **Generator**: random dist (e.g. Gaussian) as inputs (we can think of them as latent representations / codings) > data as output (typically image). 
2. **Discriminator**: takes fake image from generator OR real image, and needs to guess whether it is fake or real. 

Training is different from regular NNs, since it is made of two parts with conflicting objectives:

1. Train discriminator by feeding an equal number of real and fake images, properly labeled
2. Train generator by producing a batch of fake images and having discriminator judging them **but** we will all label them as 1 (real) to trick the discriminator. Crucially, the weights of the discriminator are frozen during this step, so backpropagation only affects the weights of the generator.

Example implementation:

In [32]:
codings_size = 30

generator = keras.models.Sequential([
    keras.layers.Dense(100, activation="selu", input_shape=[codings_size]),
    keras.layers.Dense(150, activation="selu"),
    keras.layers.Dense(28 * 28, activation="sigmoid"),
    keras.layers.Reshape([28, 28])
])
discriminator = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(150, activation="selu"),
    keras.layers.Dense(100, activation="selu"),
    keras.layers.Dense(1, activation="sigmoid")
])
gan = keras.models.Sequential([generator, discriminator])

Compiling. As the discriminator is a binary classifier, we can naturally use the binary cross-entropy loss. No need to compile generator now. 

GAN is also binary classifier so we use cross-entropy loss. Importantly, the discriminator should not be trained during the second phase, so we make it non-trainable before compiling the `gan` model: 

In [33]:
discriminator.compile(loss="binary_crossentropy", optimizer="rmsprop")
discriminator.trainable = False
gan.compile(loss="binary_crossentropy", optimizer="rmsprop")

Since the training loop is unusual, we cannot use the regular `fit()` method. We will create custom training loop and Dataset to iterate through the images:

In [34]:
import tensorflow as tf

fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

X_valid, X_train = X_train_full[:5000] / 255.0, X_train_full[5000:] / 255.0
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

In [36]:
batch_size = 32
dataset = tf.data.Dataset.from_tensor_slices(X_train).shuffle(1000)
dataset = dataset.batch(batch_size, drop_remainder=True).prefetch(1)

#### Difficulties in training GANs

According to the paper, there is only one Nash equilibrium possible between generator and discriminator: the generator gets so good that discriminator has to guess. Perfect, right? Not quite.

Enter **mode collapse**: let's say that our generator is better at fooling the discriminator with one class of images than the others. In order to win consistently, it will just gradually stop producing other classes of images. Then the discriminator will catch up, force the generator to change class, and forget about the first, in an endless cycle. 

Solutions include:

1. _Experience replay_: storing "old" generated images and feeding them along real ones instead of the latest ones
2. _Mini-batch discrimination_: measures how similar generated images are and rejects batches that are too similar 

#### Deep Convolutional GANs

Novelties:

* Replace any pooling layers with strided convolutions (in the discriminator) and transposed convolutions (in the generator)
* Use Batch Normalization in both the generator and the discriminator, except in the generator’s output layer and the discriminator’s input layer
* Remove fully connected hidden layers for deeper architectures
* Use ReLU activation in the generator for all layers except the output layer, which should use tanh
* Use leaky ReLU activation in the discriminator for all layers