# **Full GPU-Accelerated Variational AutoEncoder Implementation in CUDA**

## Experimental Environment Setup

In this section we prepare and validate the experimental environment used
for all subsequent benchmarks and analyses.

We first verify that a CUDA-capable GPU is available and that the CUDA compiler (`nvcc`) is correctly installed.  
This step ensures that the benchmarks will run on the expected hardware.

In [None]:
!nvidia-smi
!nvcc --version

We clone the project repository from GitHub and place it in the working directory.
This step recreates the exact codebase used for the experiments.

In [12]:
REPO_URL="https://github.com/massimo-ruggiero/vae-cuda"
PROJECT_DIR="VAE"

In [16]:
%cd /content
!rm -rf "$PROJECT_DIR"
!git clone --depth 1 "$REPO_URL" "$PROJECT_DIR"
%cd "$PROJECT_DIR"

/content
Cloning into 'VAE'...
remote: Enumerating objects: 62, done.[K
remote: Counting objects: 100% (62/62), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 62 (delta 14), reused 14 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (62/62), 30.84 KiB | 15.42 MiB/s, done.
Resolving deltas: 100% (14/14), done.
/content/VAE


We inspect the directory structure of the repository to verify that all expected
modules and scripts are present.

In [None]:
!sudo apt-get update -y >/dev/null
!sudo apt-get install -y tree >/dev/null
!tree -L 4

The repository includes helper scripts for running the main training pipeline, the *micro* and *macro* benchmark suite.

In [None]:
!ls -la scripts

total 16
drwxr-xr-x 2 root root 4096 Jan  3 09:07 .
drwxr-xr-x 6 root root 4096 Jan  3 09:07 ..
-rw-r--r-- 1 root root  981 Jan  3 09:07 run_main.sh
-rw-r--r-- 1 root root 1017 Jan  3 09:07 run_micro_bench.sh


The VAE implementation expects the MNIST dataset to be provided in a custom
binary format for fast loading during training and benchmarking.

In [17]:
import os
import numpy as np
from tensorflow.keras.datasets import mnist

def save_to_bin(images, labels, filename):
    images_flat = images.reshape(images.shape[0], -1).astype(np.uint8)
    labels = labels.astype(np.uint8)

    num_samples = images.shape[0]

    header = np.array([num_samples], dtype=np.int32)

    print(f"Scrittura {filename}...")
    print(f"  - Samples: {num_samples}")
    print(f"  - Dimensioni Dati: {images_flat.shape}")
    print(f"  - Dimensioni Labels: {labels.shape}")

    with open(filename, 'wb') as f:
        header.tofile(f)
        images_flat.tofile(f)
        labels.tofile(f)

    size_mb = os.path.getsize(filename) / (1024 * 1024)
    print(f"  -> Completato! ({size_mb:.2f} MB)\n")


if not os.path.exists('data'):
    os.makedirs('data')
    print("Cartella 'data/' creata.")

print("Scaricamento MNIST da Keras...")
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Genera i file binari
save_to_bin(x_train, y_train, 'data/train.bin')
save_to_bin(x_test, y_test,  'data/test.bin')

print("Tutto fatto. Ora puoi lanciare il programma C++.")


Cartella 'data/' creata.
Scaricamento MNIST da Keras...
Scrittura data/train.bin...
  - Samples: 60000
  - Dimensioni Dati: (60000, 784)
  - Dimensioni Labels: (60000,)
  -> Completato! (44.92 MB)

Scrittura data/test.bin...
  - Samples: 10000
  - Dimensioni Dati: (10000, 784)
  - Dimensioni Labels: (10000,)
  -> Completato! (7.49 MB)

Tutto fatto. Ora puoi lanciare il programma C++.


## End-to-End Sanity Check


Before running the full benchmark suite, we perform a quick end-to-end test to verify that:
- the project compiles and runs correctly on the current GPU
- training executes without runtime errors
- the VAE produces a valid reconstruction
- the sampling pipeline generates plausible outputs

This step is not meant to optimize performance: it is a correctness + pipeline validation check.

In [None]:
!chmod +x scripts/run_sanity_check.sh
!bash scripts/run_sanity_check.sh

In [19]:
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path


IMG_SIZE = 28
IMG_PIXELS = IMG_SIZE * IMG_SIZE


def load_raw_image(path: str) -> np.ndarray:
    data = np.fromfile(path, dtype=np.float32)

    if data.size != IMG_PIXELS:
        raise ValueError(
            f"{path}: expected {IMG_PIXELS} values, found {data.size}"
        )

    return data.reshape(IMG_SIZE, IMG_SIZE)


def show_reconstruction(original_path: str, reconstructed_path: str):
    img_orig = load_raw_image(original_path)
    img_recon = load_raw_image(reconstructed_path)

    plt.figure(figsize=(10, 5))

    plt.subplot(1, 2, 1)
    plt.title("Original input")
    plt.imshow(img_orig, cmap="gray", vmin=0, vmax=1)
    plt.axis("off")

    plt.subplot(1, 2, 2)
    plt.title("VAE reconstruction")
    plt.imshow(img_recon, cmap="gray", vmin=0, vmax=1)
    plt.axis("off")

    plt.tight_layout()
    plt.show()


def show_sample(sample_path: str, title: str = "VAE sample"):
    img = load_raw_image(sample_path)

    plt.figure(figsize=(4, 4))
    plt.title(title)
    plt.imshow(img, cmap="gray", vmin=0, vmax=1)
    plt.axis("off")
    plt.show()

In [None]:
print("üìÇ Loading raw images...")

try:
    # --- paths ---
    base_dir = Path("images/Warp Reduction")
    original = base_dir / "original.raw"
    reconstructed = base_dir / "reconstructed.raw"

    sample_0 = base_dir / "sample_0.raw"

    # --- visualizations ---
    show_reconstruction(original, reconstructed)
    show_sample(sample_0, title="VAE sample")

except FileNotFoundError as e:
    print("‚ùå File not found:", e)
    print("Make sure you have run the C++ program first.")
except ValueError as e:
    print("‚ùå Data error:", e)

## Micro-Benchmark Suite Execution

After validating the end-to-end execution of the VAE pipeline, we run a dedicated
micro-benchmark suite to evaluate the performance of individual CUDA kernels.

The micro-benchmark script supports a configurable output directory.

- **Default output directory:** `results/`
- **Custom output directory:** specified via the `--outdir <path>` option

All benchmark results are stored as CSV files inside the selected directory.

In [None]:
!chmod +x scripts/run_micro_bench.sh
!bash scripts/run_micro_bench.sh

## Macro-Benchmatk Suite Execution