# Simple Variational AutoEncoder training and inference example

**Description**: in this notebook, we showcase the training process and inference capabilities of a simple variational auto-encoder model.

## Imports, definitions and setup

The first block is needed only when the current environment doesn't have the `dlproject` package installed.
Therefore, if you already cloned the whole repository and run the `pip install -e .` command, you can skip the first block.

If you're running this notebook only on a Jupyter server, run the first block as well in order to obtain the necessary dependencies.

In [None]:
!git clone https://github.com/peiva-git/deep_learning_project.git
%cd deep_learning_project
!pip install -e .

In [2]:
import dlproject as dlp
import tensorflow as tf

import os.path

## Load the MNIST dataset

In [3]:
dataset_builder = dlp.data.MNISTDatasetBuilder()
dataset_builder.preprocess_dataset_simple_vae()
train_x, test_x = dataset_builder.train_x, dataset_builder.test_x
train_y, test_y = dataset_builder.train_y, dataset_builder.test_y

## Instantiate the model

In [4]:
vae_wrapper = dlp.models.SimpleVAE(input_dim=28 * 28, latent_dim=2)
vae_model = vae_wrapper.vae
vae_model.compile(optimizer=tf.keras.optimizers.Adam())

2023-11-06 18:18:37.063045: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-06 18:18:37.069650: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-06 18:18:37.069808: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

## Train the model

Train the instantiated model on the MNIST dataset.

This block also saves a backup and a checkpoint every 20 epochs, so that you can automatically resume the training if it gets interrupted.

In [None]:
if not os.path.exists(os.path.join(os.getcwd(), 'output', 'training-callback-results', f'{vae_model.name}_mnist')):
    os.makedirs(os.path.join(os.getcwd(), 'output', 'training-callback-results', f'{vae_model.name}_mnist', 'backup'))
    os.makedirs(os.path.join(os.getcwd(), 'output', 'training-callback-results', f'{vae_model.name}_mnist', 'model_checkpoints'))

model_dir_path = os.path.join(os.getcwd(), 'output', 'training-callback-results', f'{vae_model.name}_mnist')

vae_model.fit(
    train_x, train_x,
    epochs=100,
    batch_size=32,
    validation_data=(test_x, test_x),
    callbacks=[
        tf.keras.callbacks.BackupAndRestore(
            backup_dir=os.path.join(model_dir_path, 'backup'),
            save_freq=37500 # 20 * 1875, each 20 epochs
        ),
        tf.keras.callbacks.ModelCheckpoint(model_dir_path, 'model_checkpoints', save_freq=37500)
    ]
)

Epoch 1/100


2023-11-06 18:18:43.778940: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-11-06 18:18:44.098705: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f07508bf7f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-06 18:18:44.098729: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce GTX 1050, Compute Capability 6.1
2023-11-06 18:18:44.102634: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-11-06 18:18:44.113523: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8700
2023-11-06 18:18:44.183335: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100


INFO:tensorflow:Assets written to: /home/peiva/PycharmProjects/deep_learning_project/output/training-callback-results/vae_mlp_mnist/assets


Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100


INFO:tensorflow:Assets written to: /home/peiva/PycharmProjects/deep_learning_project/output/training-callback-results/vae_mlp_mnist/assets


Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100


INFO:tensorflow:Assets written to: /home/peiva/PycharmProjects/deep_learning_project/output/training-callback-results/vae_mlp_mnist/assets


Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100


INFO:tensorflow:Assets written to: /home/peiva/PycharmProjects/deep_learning_project/output/training-callback-results/vae_mlp_mnist/assets


Epoch 81/100
Epoch 82/100
Epoch 83/100

## Save the trained model

Save the just trained model for later use.

In [None]:
if not os.path.exists(os.path.join(os.getcwd(), 'output', 'models')):
    os.mkdir(os.path.join(os.getcwd(), 'output', 'models'))

vae_model.save(os.path.join(os.getcwd(), 'output', 'models', f'{vae_model.name}_mnist.keras'))
vae_wrapper.encoder.save(os.path.join(os.getcwd(), 'output', 'models', f'{vae_wrapper.encoder.name}_mnist.keras'))
vae_wrapper.decoder.save(os.path.join(os.getcwd(), 'output', 'models', f'{vae_wrapper.decoder.name}_mnist.keras'))

## Load the model

Instead of training the model, you can load it from a previously saved `.keras` file.

In [8]:
vae_model = tf.keras.saving.load_model(os.path.join(os.getcwd(), 'output', 'models', 'vae_mlp_mnist.keras'), safe_mode=False)
vae_encoder = tf.keras.saving.load_model(os.path.join(os.getcwd(), 'output', 'models', 'vae_encoder_mnist.keras'), safe_mode=False)
vae_decoder = tf.keras.saving.load_model(os.path.join(os.getcwd(), 'output', 'models', 'vae_decoder_mnist.keras'), safe_mode=False)

ValueError: Layer 'dense' expected 2 variables, but received 0 variables during loading. Expected: ['dense/kernel:0', 'dense/bias:0']

## Visualization

Display a scatter plot of the encoded test data.

In [22]:
dlp.data.show_mnist_scatter_plot(test_x, test_y, vae_wrapper.encoder, batch_size=32)



TypeError: list indices must be integers or slices, not tuple

<Figure size 600x600 with 0 Axes>

Display artificially generated digits.

In [23]:
dlp.data.show_latent_plane_sampling_digits(vae_wrapper.decoder)

ValueError: Exception encountered when calling layer 'vae_decoder' (type Functional).

Input 0 of layer "dense_3" is incompatible with the layer: expected min_ndim=2, found ndim=1. Full shape received: (2,)

Call arguments received by layer 'vae_decoder' (type Functional):
  • inputs=tf.Tensor(shape=(2,), dtype=float64)
  • training=None
  • mask=None