In [2]:
'''
== Background ==

Introduced by Ian Goodfellow in University of Montreal!
Yann LeCunn called them the most important deep learning development of past 10y.

== Intuition ==

GAN == 2 networks in a forger-police relationship
G == forger = generator
D == police == discriminator

D learns a conditional distribution P(Y|X), or a function y = f(x) that maps inputs to outputs.
G learns a JOINT distribution of inputs and outputs: P(X, Y). Can use Bayes rule to convert to P(Y|X),
but also can be used to create new samples (x, y).

G generates images to try to trick D into believing they're real (from the training set).
D looks at an image and estimates if it's fake or real. G starts off generating
random noise into images.

If the training set is large enough, D (police) cannot just memorize the training data.
Regularization strategies also help with this. So it must learn a generalizing
function: rules that govern the look of the images, while G learns to create
new images that look like ones from the training set in order to trick D.

D & G play a game: show D a mixed batch of real images from training and fake
images from G. Optimize D to say NO to fake images and YES to real images.
Optimimze G to fool D into believing fake were real.

In mathematical terms: minimize the classification error w/r/t D and maximize
it w/r/t G. Or minimizing the f-divergence by (difference in probability distributions). 

Converges when D is no longer able to tell real from fake images.


== Laplacian Pyramid Technique ==

Train on images of increasing size. Start off with normal sized images:
64x64. Convert them to 8x8. Start training on these, then in each step
increase their size: 8x8 --> 16x16 --> 32x32 --> 64x64.

In each step, another G and D are trained, and they're trained to learn good
refinements of the upscaled, blurry images. D gets fed refined/sharpened images
and needs to tell if they're real (blurry from training set w/ refinements) or
fake (blurry but refinement was done by G).

So G ends up learning how to generate good refinements.
D learns what a good refined image looks like.

== Architecture ==

G == full laplacian pyramid in one network. its layers scale the images up.
D == convolutional network with multiple branches. Different branches are
supposed to focus on different regions of the image, last branch analyzes image in its totality.


== Tricks ==

To avoid producing gibberish noise! Make sure G and D don't become too strong compared to the other
Don't let D win and classify all images correctly  - G won't be able to learn from it.
If G wins, it is usually by exploiting a meaningless weakness in the image (ex: coloring the entire image blue).

Keep track of:
1. How good G is at fooling D
2. How good D is at classifying fake as fakes (TN)
3. How good D is at classifying real as real (TP)

If one of the networks is too good, skip updating its params:
margin // user-defined margin

if err_F < margin or err_R < margin:
    D.optimize = false

if err_F > 1 - margin or err_R > 1 - magin:
    G.optimize = false

if !G.optimize and !D.optimize:
    G.optimize = true
    D.optimize = true


Another trick: if G is performing poorly, try to regularize (penalize) D.
Increment D's L2 penalty if G isn't within a certain target range.
If G fools D 50% of the time, err would be log(0.5) = 0.69. Set target
range to [0.9 - 1.2] so that D is better than G but not by too much.

GAN generator takes uniform(-1, 1) distrib as input image, so make sure this
is what's being input when creating test images (after training).

Other tricks:
- Batch normalization in G but NOT in D (that will make D too powerful)*
- Dropout (especially in D)**
- Decrease # of features of D such that G has more params
- In first few epochs it's normal if it generates nonsense, but probably 
  bad news if it generates equal or nearly equal images.


== Resources ==

- Cat generator in lua torch: https://github.com/aleju/cat-generator
- Face generator in lua torch: http://torch.ch/blog/2015/11/13/gan.html
    - Includes tricks & small explanation about variational autoencoders
- Code from: http://blog.aylien.com/introduction-generative-adversarial-networks-code-tensorflow/
- http://blog.evjang.com/2016/06/generative-adversarial-nets-in.html
- "REAL" code used for sigilizer: https://github.com/carpedm20/DCGAN-tensorflow


== Wasserstein GANs ==

Paper: https://arxiv.org/abs/1701.07875

GANs are notoriously unstable. WGANs attempt to improve upon stability. 

Normal GAN uses softplus (nonlinearity): applied to last matrix multiplication when computing loss.
(Decision boundary is a nonlinearity?) With WGANs, they take the output of the last matrix multiplication
directly, and restrict the range of weights to guarantee Lipschitz continuity (a strong uniform continuity
-- limited in how fast it can change-- for every pair of points on graph, magnitude of slope cannot get
greater than a certain constant). So they regularize/squeeze the weights. Guarantees that the discriminator
is continuous w/r/t its parameters and therefore is continuously differentiable at every point -- this means
metric becomes more interpretable. Having a continuous/differentiable metric means can strongly train the
discriminator before doing an update to the generator, to get more reliable gradients ... (with regular
GANs, a strong D means vanishing gradients(?)).

Basically seems to allow you to make the discriminator stronger without imperiling the generator's loss.

Changes:
- D doesn't produce sigmoid/probabilistic output
    - instead, loss in D == y - y_hat -- just the diff between real/fake images
- train D multiple times for each G update
- clamp weights in D to ~0
- low learning rate
- optimizers do not need momentum
'''

"\n== Background ==\n\nIntroduced by Ian Goodfellow in University of Montreal!\nYann LeCunn called them the most important deep learning development of past 10y.\n\n== Intuition ==\n\nGAN == 2 networks in a forger-police relationship\nG == forger = generator\nD == police == discriminator\n\nD learns a conditional distribution P(Y|X), or a function y = f(x) that maps inputs to outputs.\nG learns a JOINT distribution of inputs and outputs: P(X, Y). Can use Bayes rule to convert to P(Y|X),\nbut also can be used to create new samples (x, y).\n\nG generates images to try to trick D into believing they're real (from the training set).\nD looks at an image and estimates if it's fake or real. G starts off generating\nrandom noise into images.\n\nIf the training set is large enough, D (police) cannot just memorize the training data.\nRegularization strategies also help with this. So it must learn a generalizing\nfunction: rules that govern the look of the images, while G learns to create\nnew i

In [3]:
import numpy as np
import tensorflow as tf

In [4]:
seed = 42
np.random.seed(seed)
tf.set_random_seed(seed)

In [5]:
# We'll learn to approximate a gaussian distribution

# This is a real gaussian with mean 4 and variance 0.5
# This is the training data
class DataDistribution(object):
    def __init__(self):
        self.mu = 4
        self.sigma = 0.5

    def sample(self, N):
        samples = np.random.normal(self.mu, self.sigma, N)
        samples.sort()
        return samples

# Generator initially generates random noise within a certain given range
class GeneratorDistribution(object):
    def __init__(self, range):
        self.range = range

    def sample(self, N):
        return np.linspace(-self.range, self.range, N) + \
            np.random.random(N) * 0.01

In [6]:
def linear(input, output_dim, scope='linear', stddev=1.0):
    # returns a context manager that makes sure variables
    # are from the same graph. get_variable gets variables
    # belonging to the scope of the "with" context
    with tf.variable_scope(scope):
        # hidden layer weights -- random normal
        w = tf.get_variable(
            'w',
            [input.get_shape()[1], output_dim],
            initializer=tf.random_normal_initializer(stddev=stddev)
        )
        # constant
        b = tf.get_variable(
            'b',
            [output_dim],
            initializer=tf.constant_initializer(0.0)
        )
        # matrix multiplication -- x * w + b
        return tf.matmul(input, w) + b

In [7]:
# Generator G is a linear transformation passed through
# nonlinearity -- softplus function. For an image, would
# be multiplying image by weights, adding constant, transforming
# image through non-linear fn, then adding another constant.
# softplus is simply: log(exp(X) + 1) == log(X^e + 1)
def generator(input, h_dim):
    # make predictions - x * w + constant
    pred = linear(input, h_dim, 'g0')
    # hidden layer weights
    h0 = tf.nn.softplus(pred)
    # hidden layer weights -- non_linearity(x * w + b) * 1 + constant
    h1 = linear(h0, 1, 'g1')
    return h1

In [8]:
# D has more hidden layers because it needs to be more powerful
# than G -- for this type of problem, bad at discriminating between
# real and fake. Uses tanh and sigmoid.
def discriminator_nobatch(input, hidden_size):
    h0 = tf.tanh(linear(input, hidden_size * 2, 'd0'))
    h1 = tf.tanh(linear(h0, hidden_size * 2, 'd1'))
    h2 = tf.tanh(linear(h1, hidden_size * 2, 'd2'))
    h3 = tf.sigmoid(linear(h2, 1, 'd3'))
    return h3

def discriminator(input, h_dim, minibatch_layer=True):
    # relu is a type of activation function: https://www.tensorflow.org/api_guides/python/nn#Activation_Functions
    # these are all nonlinearities, like sigmoid, that can make classification decision
    h0 = tf.nn.relu(linear(input, h_dim * 2, 'd0'))
    h1 = tf.nn.relu(linear(h0, h_dim * 2, 'd1'))

    # without the minibatch layer, the discriminator needs an additional layer
    # to have enough capacity to separate the two distributions correctly
    if minibatch_layer:
        # use minibatch on an intermediate layer
        h2 = minibatch(h1)
    else:
        h2 = tf.nn.relu(linear(h1, h_dim * 2, scope='d2'))

    # Final decision
    h3 = tf.sigmoid(linear(h2, 1, scope='d3'))
    return h3

In [9]:
# One of the main failure modes of GANs is their tendency to collapse to a small
# range of points. This can be circumvented by allowing D to look at multiple points
# at once: minibatch discrimination == any method where D looks at an entire batch
# of samples to determine if they are real or fake. One type of algorithm that does
# this looks at a sample and then that sample's distance from all other samples in the
# batch. Distance measures are then combined with the input (as an extra feature) and
# D can use this information while making a decision.
# more info here: http://blog.aylien.com/introduction-generative-adversarial-networks-code-tensorflow/
def minibatch(input, num_kernels=5, kernel_dim=3):
    # input is an intermediate layer. multiply it by 3D tensor to produce a matrix
    # of size num_kernels * kernel_dim
    x = linear(input, num_kernels * kernel_dim, scope='minibatch', stddev=0.02)
    activation = tf.reshape(x, (-1, num_kernels, kernel_dim))
    diffs = tf.expand_dims(activation, 3) - \
        tf.expand_dims(tf.transpose(activation, [1, 2, 0]), 0)
    abs_diffs = tf.reduce_sum(tf.abs(diffs), 2)
    minibatch_features = tf.reduce_sum(tf.exp(-abs_diffs), 2)
    return tf.concat([input, minibatch_features], 1)

In [10]:
# Use Adam optimizer
def optimizer(loss, var_list):
    learning_rate = 0.001
    step = tf.Variable(0, trainable=False)
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(
        loss,
        global_step=step,
        var_list=var_list
    )
    return optimizer

In [11]:
# Sometimes discriminiator outputs can reach values close to
# (or even slightly less than) zero due to numerical rounding.
# This just makes sure that we exclude those values so that we don't
# end up with NaNs during optimization.
def log(x):
    return tf.log(tf.maximum(x, 1e-5))

In [13]:
# Define the GAN!
class GAN(object):
    def __init__(self, params):
        # This defines the generator network - it takes samples from a noise
        # distribution as input, and passes them through an MLP.
        with tf.variable_scope('G'):
            self.z = tf.placeholder(tf.float32, shape=(params.batch_size, 1))
            self.G = generator(self.z, params.hidden_size)

        # The discriminator tries to tell the difference between samples from
        # the true data distribution (self.x) and the generated samples
        # (self.z).
        #
        # Here we create two copies of the discriminator network
        # that share parameters, as you cannot use the same network with
        # different inputs in TensorFlow. One copy of D processes x  and
        # the other processes the generated images G(z).
        
        # Input of D1 is a single sample of the real data. Input of D2
        # is a single sample of the fake data. So when optimizing D,
        # we want to maximize D1 and minimize D2.
        
        self.x = tf.placeholder(tf.float32, shape=(params.batch_size, 1))
        with tf.variable_scope('D'):
            self.D1 = discriminator(
                self.x,
                params.hidden_size,
                params.minibatch
            )
        with tf.variable_scope('D', reuse=True):
            self.D2 = discriminator(
                self.G,
                params.hidden_size,
                params.minibatch
            )

        # Define the loss for discriminator and generator networks
        # (see the original paper for details), and create optimizers for both
        # For discriminator:
            # min(-log(D1)) ==> maximize D1
            # min(-log(1 - D2)) ==> minimize D2
        # For generator:
            # min(-log(D2)) ==> maximize D2
        self.loss_d = tf.reduce_mean(-log(self.D1) - log(1 - self.D2))
        self.loss_g = tf.reduce_mean(-log(self.D2))

        vars = tf.trainable_variables()
        self.d_params = [v for v in vars if v.name.startswith('D/')]
        self.g_params = [v for v in vars if v.name.startswith('G/')]

        self.opt_d = optimizer(self.loss_d, self.d_params)
        self.opt_g = optimizer(self.loss_g, self.g_params)

In [14]:
def train(model, data, gen, params):
    anim_frames = []

    with tf.Session() as session:
        tf.local_variables_initializer().run()
        tf.global_variables_initializer().run()

        for step in range(params.num_steps + 1):
            ## Update discriminator
            
            # Get real data
            x = data.sample(params.batch_size)
            # Get fake data
            z = gen.sample(params.batch_size)
            # Compute loss
            loss_d, _, = session.run([model.loss_d, model.opt_d], {
                model.x: np.reshape(x, (params.batch_size, 1)),
                model.z: np.reshape(z, (params.batch_size, 1))
            })

            ## Update generator
            
            # Get fake data -- why isn't it using z from above?
            z = gen.sample(params.batch_size)
            loss_g, _ = session.run([model.loss_g, model.opt_g], {
                model.z: np.reshape(z, (params.batch_size, 1))
            })

            if step % params.log_every == 0:
                print('{}: {:.4f}\t{:.4f}'.format(step, loss_d, loss_g))

            if params.anim_path and (step % params.anim_every == 0):
                anim_frames.append(
                    samples(model, session, data, gen.range, params.batch_size)
                )

        if params.anim_path:
            save_animation(anim_frames, params.anim_path, gen.range)
        else:
            samps = samples(model, session, data, gen.range, params.batch_size)
            plot_distributions(samps, gen.range)

In [15]:

def samples(model, session, data, sample_range, batch_size, num_points=10000, num_bins=100):
    '''
    Return a tuple (db, pd, pg), where db is the current decision
    boundary, pd is a histogram of samples from the data distribution,
    and pg is a histogram of generated samples.
    '''
    xs = np.linspace(-sample_range, sample_range, num_points)
    bins = np.linspace(-sample_range, sample_range, num_bins)

    # decision boundary
    db = np.zeros((num_points, 1))
    for i in range(num_points // batch_size):
        db[batch_size * i:batch_size * (i + 1)] = session.run(
            model.D1,
            {
                model.x: np.reshape(
                    xs[batch_size * i:batch_size * (i + 1)],
                    (batch_size, 1)
                )
            }
        )

    # data distribution
    d = data.sample(num_points)
    pd, _ = np.histogram(d, bins=bins, density=True)

    # generated samples
    zs = np.linspace(-sample_range, sample_range, num_points)
    g = np.zeros((num_points, 1))
    for i in range(num_points // batch_size):
        g[batch_size * i:batch_size * (i + 1)] = session.run(
            model.G,
            {
                model.z: np.reshape(
                    zs[batch_size * i:batch_size * (i + 1)],
                    (batch_size, 1)
                )
            }
        )
    pg, _ = np.histogram(g, bins=bins, density=True)

    return db, pd, pg