# **Full GPU-Accelerated Variational AutoEncoder Implementation in CUDA**

## Experimental Environment Setup

In this section we prepare and validate the experimental environment used
for all subsequent benchmarks and analyses.

We first verify that a CUDA-capable GPU is available and that the CUDA compiler (`nvcc`) is correctly installed.  
This step ensures that the benchmarks will run on the expected hardware.

In [None]:
!nvidia-smi
!nvcc --version

We clone the project repository from GitHub and place it in the working directory.
This step recreates the exact codebase used for the experiments.

In [None]:
REPO_URL="https://github.com/massimo-ruggiero/vae-cuda"
PROJECT_DIR="VAE"

In [None]:
%cd /content
!rm -rf "$PROJECT_DIR"
!git clone --depth 1 "$REPO_URL" "$PROJECT_DIR"
%cd "$PROJECT_DIR"

We inspect the directory structure of the repository to verify that all expected
modules and scripts are present.

In [None]:
!sudo apt-get update -y >/dev/null
!sudo apt-get install -y tree >/dev/null
!tree -L 4

The repository includes helper scripts for running the main training pipeline, the *micro* and *macro* benchmark suite.

In [None]:
!ls -la scripts

The VAE implementation expects the MNIST dataset to be provided in a custom
binary format for fast loading during training and benchmarking.

In [None]:
import os
import numpy as np
from tensorflow.keras.datasets import mnist

def save_to_bin(images, labels, filename):
    images_flat = images.reshape(images.shape[0], -1).astype(np.uint8)
    labels = labels.astype(np.uint8)

    num_samples = images.shape[0]

    header = np.array([num_samples], dtype=np.int32)

    print(f"Scrittura {filename}...")
    print(f"  - Samples: {num_samples}")
    print(f"  - Dimensioni Dati: {images_flat.shape}")
    print(f"  - Dimensioni Labels: {labels.shape}")

    with open(filename, 'wb') as f:
        header.tofile(f)
        images_flat.tofile(f)
        labels.tofile(f)

    size_mb = os.path.getsize(filename) / (1024 * 1024)
    print(f"  -> Completato! ({size_mb:.2f} MB)\n")


if not os.path.exists('data'):
    os.makedirs('data')
    print("Cartella 'data/' creata.")

print("Scaricamento MNIST da Keras...")
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Genera i file binari
save_to_bin(x_train, y_train, 'data/train.bin')
save_to_bin(x_test, y_test,  'data/test.bin')

print("Tutto fatto. Ora puoi lanciare il programma C++.")


## End-to-End Sanity Check


Before running the full benchmark suite, we perform a quick end-to-end test to verify that:
- the project compiles and runs correctly on the current GPU
- training executes without runtime errors
- the VAE produces a valid reconstruction
- the sampling pipeline generates plausible outputs

This step is not meant to optimize performance: it is a correctness + pipeline validation check.

In [None]:
!chmod +x scripts/run_sanity_check.sh
!bash scripts/run_sanity_check.sh

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path


IMG_SIZE = 28
IMG_PIXELS = IMG_SIZE * IMG_SIZE


def load_raw_image(path: str) -> np.ndarray:
    data = np.fromfile(path, dtype=np.float32)

    if data.size != IMG_PIXELS:
        raise ValueError(
            f"{path}: expected {IMG_PIXELS} values, found {data.size}"
        )

    return data.reshape(IMG_SIZE, IMG_SIZE)


def show_reconstruction(original_path: str, reconstructed_path: str):
    img_orig = load_raw_image(original_path)
    img_recon = load_raw_image(reconstructed_path)

    plt.figure(figsize=(10, 5))

    plt.subplot(1, 2, 1)
    plt.title("Original input")
    plt.imshow(img_orig, cmap="gray", vmin=0, vmax=1)
    plt.axis("off")

    plt.subplot(1, 2, 2)
    plt.title("VAE reconstruction")
    plt.imshow(img_recon, cmap="gray", vmin=0, vmax=1)
    plt.axis("off")

    plt.tight_layout()
    plt.show()


def show_sample(sample_path: str, title: str = "VAE sample"):
    img = load_raw_image(sample_path)

    plt.figure(figsize=(4, 4))
    plt.title(title)
    plt.imshow(img, cmap="gray", vmin=0, vmax=1)
    plt.axis("off")
    plt.show()

In [None]:
print("üìÇ Loading raw images...")

try:
    # --- paths ---
    base_dir = Path("images/Warp Reduction")
    original = base_dir / "original.raw"
    reconstructed = base_dir / "reconstructed.raw"

    sample_0 = base_dir / "sample_0.raw"

    # --- visualizations ---
    show_reconstruction(original, reconstructed)
    show_sample(sample_0, title="VAE sample")

except FileNotFoundError as e:
    print("‚ùå File not found:", e)
    print("Make sure you have run the C++ program first.")
except ValueError as e:
    print("‚ùå Data error:", e)

## Micro-Benchmark Suite Execution

After validating the end-to-end execution of the VAE pipeline, we run a dedicated
micro-benchmark suite to evaluate the performance of individual CUDA kernels.

The micro-benchmark script supports a configurable output directory.

- **Default output directory:** `results/`
- **Custom output directory:** specified via the `--outdir <path>` option

All benchmark results are stored as CSV files inside the selected directory.

In [None]:
!chmod +x scripts/run_micro_bench.sh
!bash scripts/run_micro_bench.sh

### Roofline Analysis

The roofline model provides an upper bound on the attainable performance of a kernel
by relating **arithmetic intensity** (FLOPs per byte of memory traffic) to the
hardware limits of the target architecture.

Given a kernel with work $W$ (in FLOPs) and memory traffic $Q$ (in bytes),
the arithmetic intensity is defined as:
$$
AI = \frac{W}{Q}
$$

The attainable performance $P$ is bounded by:
$$
P = \min \left( \pi,\; \beta \times AI \right)
$$
where:
- $\pi$ is the peak compute performance (GFLOPS),
- $\beta$ is the peak memory bandwidth (GB/s).

The intersection of the two bounds defines the **ridge point**
$AI = \pi / \beta$, which separates **memory-bound** from **compute-bound**
execution regimes.

In the following plot, each point represents the measured performance of a kernel
implementation positioned according to its arithmetic intensity.

**<font color="red">‚ö†Ô∏è
Kernels with zero arithmetic intensity perform no floating-point operations and therefore cannot be meaningfully positioned on the roofline in terms of GFLOPS.
</font>**

In [None]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

def read_bench_csv(path: str | Path):
    """
    Reads your CSV format:
      # key=value
      # key=value
      op,strategy,... (real CSV header)
      ...
    Returns: (df, meta_dict)
    """
    path = Path(path)
    meta = {}

    with path.open("r", encoding="utf-8", errors="ignore") as f:
        pos = f.tell()
        line = f.readline()
        while line:
            if line.startswith("#"):
                m = re.match(r"#\s*([A-Za-z0-9_]+)\s*=\s*(.*)\s*$", line.strip())
                if m:
                    k, v = m.group(1), m.group(2)
                    try:
                        meta[k] = float(v)
                    except ValueError:
                        meta[k] = v
                pos = f.tell()
                line = f.readline()
            else:
                f.seek(pos)
                break

    df = pd.read_csv(path, comment="#")
    return df, meta


def plot_roofline_from_csv(
    csv_paths,
    strategy="NAIVE",
    ops=None,
    title="Roofline",
    ai_limits=(1/32, 256),
    perf_limits=None,
    label_col="Kernel",
    figsize=(9, 6),
):
    """
    Draws a roofline plot in the same style as your reference image.
    - black roofline
    - thin dashed horizontal and diagonal guide lines
    - red dashed vertical ridge line
    - annotations: œÄ and Œ≤√óI
    - legend includes ONLY points (kernels)
    """
    dfs = []
    metas = []
    for p in csv_paths:
        df, meta = read_bench_csv(p)
        dfs.append(df)
        metas.append(meta)
    df = pd.concat(dfs, ignore_index=True)

    meta0 = metas[0] if metas else {}
    peak_gflops = float(meta0.get("peak_gflops_fp32", 8100.0))
    peak_bw     = float(meta0.get("peak_bandwidth_gbps", 320.0))
    ridge       = float(meta0.get("ridge_point", peak_gflops / peak_bw))

    # Filter strategy and ops
    if "strategy" in df.columns:
        df = df[df["strategy"] == strategy]
    if ops is not None and "op" in df.columns:
        df = df[df["op"].isin(ops)]

    if "flops" in df.columns:
        df = df[df["flops"] > 0]
    else:
        raise ValueError("Column 'flops' not found: cannot filter zero-FLOP kernels.")

    # Need ai + gflops
    required = {"ai", "gflops", label_col}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"Missing columns in CSV(s): {missing}. Found: {list(df.columns)}")

    # Roofline curve
    ai = np.logspace(np.log10(ai_limits[0]), np.log10(ai_limits[1]), 600)
    roof = np.minimum(peak_gflops, peak_bw * ai)
    ai_star = peak_gflops / peak_bw  # theoretical ridge from peaks (for the dashed diagonal)

    plt.figure(figsize=figsize)

    # Main roofline (thick)
    plt.plot(ai, roof, color="black", linewidth=2)

    # Thin dashed guide lines like the paper figure:
    # - horizontal at œÄ (peak compute)
    plt.hlines(
        peak_gflops, ai_limits[0], ai_limits[1],
        colors="black", linestyles="--", linewidth=1, alpha=0.6
    )

    # - diagonal continuation beyond ridge (dashed)
    ai_diag = np.logspace(np.log10(ai_star), np.log10(ai_limits[1]), 200)
    plt.plot(ai_diag, peak_bw * ai_diag, color="black", linestyle="--", linewidth=1, alpha=0.6)

    # Ridge point vertical line (red dashed)
    plt.axvline(ridge, linestyle="--", color="red", linewidth=2)

    # Scatter points (legend = points only)
    handles, labels = [], []
    for key, grp in df.groupby(label_col):
        sc = plt.scatter(grp["ai"], grp["gflops"], s=70, marker="o")
        handles.append(sc)
        labels.append(str(key))

    # Scales + labels
    plt.xscale("log")
    plt.yscale("log")
    plt.xlabel("Operational / Arithmetic Intensity [FLOP/byte]")
    plt.ylabel("Performance [GFLOP/s]")
    plt.title(f"{title} ‚Äî {strategy}")

    if perf_limits is not None:
        plt.ylim(perf_limits)
    plt.xlim(ai_limits)

    # Grid (subtle)
    plt.grid(True, which="both", linestyle="--", alpha=0.30)

    plt.text(ai_limits[1] / 3, peak_gflops * 1.03, r"$\pi$", fontsize=18)

    ai_txt = np.sqrt(ai_limits[0] * (peak_gflops / peak_bw))
    y_txt = peak_bw * ai_txt
    plt.text(ai_txt * 1.1, y_txt * 1.1, r"$\beta \times I$", fontsize=16, rotation=35)

    if handles:
        plt.legend(handles, labels, title="Kernels", loc="lower right", frameon=True)

    plt.tight_layout()
    plt.show()

    return {"peak_gflops_fp32": peak_gflops, "peak_bandwidth_gbps": peak_bw, "ridge_point": ridge}

In [None]:
meta = plot_roofline_from_csv(
    csv_paths=[
        "results/micro_bench/csv/bench_linalg.csv",
        "results/micro_bench/csv/bench_activations.csv",
        "results/micro_bench/csv/bench_loss.csv",
        "results/micro_bench/csv/bench_reparam.csv",
        "results/micro_bench/csv/bench_optimizers.csv",
    ],
    strategy="NAIVE",
    title="Roofline",
    ai_limits=(1/64, 256),
    perf_limits=(50, 20000),
)

## Macro-Benchmatk Suite Execution