<a href="https://colab.research.google.com/github/kashyapsanket/Variational-AutoEncoders-meets-NeuralNets/blob/master/VAE_and_FFNN_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



```
```

#  Variational Autoencoders meet MNIST

---

*Sanket Kashyap*

In [0]:
import numpy as np

import matplotlib.pyplot as plt
from matplotlib import cm

import keras
from keras import backend as K
from keras import layers
from keras.datasets import mnist
from keras.models import Model, Sequential
from keras.utils import to_categorical
from keras.optimizers import RMSprop
from keras.layers import Dense

from sklearn.model_selection import train_test_split



**Data Input and Partition**

---
The dataset being used here is the MNIST Handwritten digits dataset present in the keras library. The images are 28*28 black and white images with classes labelled from 0-9.

The next cell reads the input data and partitions it into three sets -

1.   X_70, Y_70 : Containing 70% of images of all classes
2.   X_20, Y_20 : Containing 20% of images of all classes
3. X_10, Y_10 : Containing 10% of images of all classes

The data is reshaped and kept for the training of a simple feed-forward neural network



In [0]:
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

Y = list(Y_train) + list(Y_test)
print(len(Y))

X = np.vstack((X_train, X_test))
print(len(X))

data_dict = {}
for i in range(len(Y)):
  data_dict[Y[i]] = []

for i in range(len(X)):
  data_dict[Y[i]].append(X[i])

X_70 = []
X_20 = []
X_10 = []

Y_70 = []
Y_20 = []
Y_10 = []

for key, value in data_dict.items():
  x = value
  y = [key]*len(value)
  x_70, x_30, y_70, y_30 = train_test_split(x, y, test_size=0.3)
  x_20, x_10, y_20, y_10 = train_test_split(x_30, y_30, test_size =0.34)
  X_70 += x_70
  Y_70 += y_70
  X_20 += x_20
  Y_20 += y_20
  X_10 += x_10
  Y_10 += y_10

  
X_70 = np.array(X_70)
X_20 = np.array(X_20)
X_10 = np.array(X_10)

X_70 = X_70.astype('float32') / 255.
X_70 = X_70.reshape(X_70.shape + (1,))

X_20 = X_20.astype('float32') / 255.
X_20 = X_20.reshape(X_20.shape + (1,))

X_10 = X_10.astype('float32') / 255.
X_10 = X_10.reshape(X_10.shape + (1,))

Y_70_ohe = keras.utils.to_categorical(Y_70, 10)
Y_20_ohe = keras.utils.to_categorical(Y_20, 10)
Y_10_ohe = keras.utils.to_categorical(Y_10, 10)

X_70_flat = X_70.reshape(48996,784)
X_10_flat = X_10.reshape(7145,784)


70000
70000


In [0]:
image_shape = (28, 28, 1)
latent_dim = 10
batch_size = 128

**Part - 1 : Building blocks  of the VAE** 

---

The following cell creates all the required components of the VAE and the helper functions.

*   Encoder
*   Decoder
*   Sampler
*   Loss Function (as discussed in slides)



In [0]:
def create_encoder():
    encoder_iput = layers.Input(shape=image_shape)
    
    x = layers.Conv2D(32, 3, padding='same', activation='relu')(encoder_iput)
    x = layers.Conv2D(64, 3, padding='same', activation='relu', strides=(2, 2))(x)
    x = layers.Conv2D(64, 3, padding='same', activation='relu')(x)
    x = layers.Conv2D(64, 3, padding='same', activation='relu')(x)
    x = layers.Flatten()(x)
    x = layers.Dense(32, activation='relu')(x)

    z_mean = layers.Dense(latent_dim)(x)
    z_log_var = layers.Dense(latent_dim)(x)

    return Model(encoder_iput, [z_mean, z_log_var], name='encoder')

def create_decoder():
    decoder_input = layers.Input(shape=(latent_dim,))
    
    x = layers.Dense(12544, activation='relu')(decoder_input)
    x = layers.Reshape((14, 14, 64))(x)
    x = layers.Conv2DTranspose(32, 3, padding='same', activation='relu', strides=(2, 2))(x)
    x = layers.Conv2D(1, 3, padding='same', activation='sigmoid')(x)
    
    return Model(decoder_input, x, name='decoder')
  
def sample(args):
    z_mean, z_log_var = args
    z_sigma = K.sqrt(K.exp(z_log_var))
    epsilon = K.random_normal(shape=K.shape(z_mean), mean=0., stddev=1.)
    return z_mean + z_sigma * epsilon

def create_sampler():
    return layers.Lambda(sample, name='sampler')

def neg_variational_lower_bound(x, t_decoded):
    # Reconstruction loss
    rc_loss = K.sum(K.binary_crossentropy(
        K.batch_flatten(x), 
        K.batch_flatten(t_decoded)), axis=-1)

    # Regularization term (KL divergence)
    kl_loss = -0.5 * K.sum(1 + z_log_sigma \
                             - K.square(z_mu) \
                             - K.exp(z_log_sigma), axis=-1)
    
    # Average over mini-batch
    return K.mean(rc_loss + kl_loss)


The next cell merges all the constituents of the VAE to create an end to end model. We train this for 35 epochs to finish training our VAE.

In [0]:
enc = create_encoder()
dec = create_decoder()
sampler = create_sampler()

x = layers.Input(shape=image_shape)
z_mu, z_log_sigma = enc(x)
z = sampler([z_mu, z_log_sigma])
z_decoded = dec(z)

vae = Model(x, z_decoded, name='vae')

vae.compile(optimizer='rmsprop', loss=neg_variational_lower_bound)
vae.fit(x=X_70, 
         y=X_70,
         epochs=35,
         shuffle=True,
         batch_size=batch_size,
         verbose=2)



Epoch 1/35
 - 12s - loss: 957.2691
Epoch 2/35
 - 11s - loss: 145.0232
Epoch 3/35
 - 11s - loss: 134.5560
Epoch 4/35
 - 11s - loss: 127.3346
Epoch 5/35
 - 11s - loss: 122.8461
Epoch 6/35
 - 11s - loss: 119.1327
Epoch 7/35
 - 11s - loss: 115.8267
Epoch 8/35
 - 11s - loss: 113.6010
Epoch 9/35
 - 11s - loss: 112.0785
Epoch 10/35
 - 11s - loss: 110.9181
Epoch 11/35
 - 11s - loss: 109.9425
Epoch 12/35
 - 11s - loss: 109.1430
Epoch 13/35
 - 11s - loss: 108.4722
Epoch 14/35
 - 11s - loss: 107.8045
Epoch 15/35
 - 11s - loss: 107.3070
Epoch 16/35
 - 11s - loss: 106.7499
Epoch 17/35
 - 11s - loss: 106.2563
Epoch 18/35
 - 11s - loss: 105.9177
Epoch 19/35
 - 11s - loss: 105.4952
Epoch 20/35
 - 11s - loss: 105.1228
Epoch 21/35
 - 11s - loss: 104.8265
Epoch 22/35
 - 11s - loss: 104.4963
Epoch 23/35
 - 11s - loss: 104.2160
Epoch 24/35
 - 11s - loss: 103.9358
Epoch 25/35
 - 11s - loss: 103.7139
Epoch 26/35
 - 11s - loss: 103.4659
Epoch 27/35
 - 11s - loss: 103.2755
Epoch 28/35
 - 11s - loss: 103.0636
E

<keras.callbacks.History at 0x7fc2e0e63630>

**Part 2 : **
We first obtain the predictions of the encoder and use them to train an FFNN with the number of neurons in the hidden layer being the hyperparameter to be changed. The input of the neural network depends on the latent_dim described in an earlier cell

In [0]:
X_20_latent = enc.predict(X_20)[0]
X_10_latent = enc.predict(X_10)[0]

X_20_latent = np.array(X_20_latent)
X_10_latent = np.array(X_10_latent)

print(X_20_latent.shape)


(13859, 10)


In [0]:
model = Sequential()
model.add(Dense(200, activation='relu', input_shape=(latent_dim,)))
model.add(Dense(10, activation='softmax'))
model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])

history = model.fit(X_20_latent, Y_20_ohe,
                    batch_size=256,
                    epochs=50,
                    verbose=1,
                    validation_data=(X_10_latent, Y_10_ohe))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_31 (Dense)             (None, 200)               2200      
_________________________________________________________________
dense_32 (Dense)             (None, 10)                2010      
Total params: 4,210
Trainable params: 4,210
Non-trainable params: 0
_________________________________________________________________
Train on 13859 samples, validate on 7145 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoc

**Part 3 ** We create a simple FFNN, train it on Partition 1 and test it on Partition 3. The hyperparameter to change here is the number of neirons in the hidden layer.

In [0]:
model2 = Sequential()
model2.add(Dense(200, activation='relu', input_shape=(784,)))
model2.add(Dense(10, activation='softmax'))
model2.summary()

model2.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])

history = model2.fit(X_70_flat, Y_70_ohe,
                    batch_size=256,
                    epochs=30,
                    verbose=1,
                    validation_data=(X_10_flat, Y_10_ohe))


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_29 (Dense)             (None, 200)               157000    
_________________________________________________________________
dense_30 (Dense)             (None, 10)                2010      
Total params: 159,010
Trainable params: 159,010
Non-trainable params: 0
_________________________________________________________________
Train on 48996 samples, validate on 7145 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


**Part 4 : Comparison between methods used in Part 2 and Part 3**
I used only single layer FFNNs for Part 3. I analysed results after changing the number of neurons in the hidden layer. For state of the art performance on the MNIST dataset, deep CNNs can be used which results in accuracies of upwards of 99.5%.


*   20 neurons (15,910 parameters) - 95.3% accuracy
*   100 neurons (79,510 paramters) - 97.4% accuracy
*   200 neurons (159,610 paramters) - 97.7% accuracy

For the VAE I tried three different latent_dimensions, 

*   2 dimensions, 100 neurons, ~1000 parameters - 75.2% accuracy 
*   8 dimensions, 100 neurons, ~2000 paramters - 95.6% accuracy 
*   8 dimensions, 200 neurons, ~4000 paramters - 96.2% accuracy 
*   16 dimensions, 400 neurons, ~10000 paramters - 93.95% accuracy
*   16 dimensions, 100 neurons, ~2000 paramters - 95.2% accuracy 
*   10 dimensions, 1000 neurons, ~21000 paramters - 97.2% accuracy 
*   10 dimensions, 200 neurons, ~2000 paramters - 96.9% accuracy 



The major reason I feel that the FFNNs tend to perform  better is that the Gaussian distribution is not strong enough to capture the complete complexity of the MNIST dataset after a certain point and the performance peaks at ~97% accuracy, using a Gaussian prior with more diverse image datasets (eg. ImageNet, MS COCO) will lead to worse reconstructions

The method of training in Part 2, is an incredibly useful one in case we have a shortage of labelled data. Deep CNNs and other deep networks require a lot of data to perform well without overfitting, with the VAE training method we can use the unlabelled data to train a VAE and use the encoder networks to further process the labelled data and train it with labels on a simpler neural network to achieve good results. This form of learning is known as semi-supervised learning.




