# 6. Post-Hoc Discretization Pipeline

This notebook implements a two-stage VAE approach:
1. **Train a Continuous VAE:** Learn a smooth, high-quality Gaussian latent space without the constraints of quantization.
2. **Post-Hoc Quantization:** Train an RBF-based Codebook to cluster the learned continuous space into discrete tokens.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import sys
import os

# PATH FIX
project_root = os.path.abspath(os.getcwd())
if 'src' not in os.listdir(project_root):
    project_root = os.path.abspath(os.path.join(project_root, '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

from src.train_continuous_vae import train_continuous_vae
from src.train_posthoc_rbf import train_posthoc_quantizer

## Step 1: Train Continuous VAE
We train a standard VAE with KL divergence loss. The goal is to minimize Reconstruction Loss.

In [None]:
# You might need 10-20 epochs for a good continuous manifold
train_continuous_vae(num_epochs=10, batch_size=16, lr=1e-4)

## Step 2: Train RBF Quantizer
Now we freeze the VAE, pass the data through it, and use the RBF module to find `1024` clusters (centroids) in that space.

In [None]:
# 5 epochs is usually enough for K-Means to converge
train_posthoc_quantizer(num_epochs=5, batch_size=32)