## Explainer notebook

Walk through of the computations, algorithms and results for this project.

#### Data

The Tasic et al. (Nature 2018) single-cell RNA-seq dataset is used, containing gene expression profiles from 23,822 cells isolated from the adult mouse visual cortex Kobak and Berens (2019). The dataset is used in preprocessed form; PCA reduced dimensionality of 50 components. The PCA-reduced data is available via the accompanying GitHub repository Berens and Kobak (2019). 

There are 133 unique classes in the data.

#### Motivation

Deep latent variable models (DLVMs) are powerful tools for uncovering low-dimensional structure in high-dimensional data. The goal is not merely to compress data, but to learn representations that reflect the underlying mechanisms of the observed phenomena.

However, in VAEs and similar deep generative models, the latent variables themselves are not identifiable. Multiple different latent representations can yield the same observed data distribution. As a result, the learned latent variables are not unique and can vary significantly across training runs, making interpretation difficult Hauberg (2025).

Recent work has shown that while the latent variables themselves are not identifiable, certain geometric relationships between them — such as distances, angles, and volumes, when defined using the pullback metric via the decoder — can be statistically identifiable under mild assumptions Syrota et al. (2025).

#### Objective and Aim

Under the manifold hypothesis, it is assumed that the data lie near a lower-dimensional manifold embedded in a highdimensional space. A variational autoencoder (VAE) or an ensemble VAE is used to learn this manifold.

The objective is to produce distance matrices and design methods to compute geodesic distances.

* The first experiment is set up with a single VAEs. One encoder, one decoder.
* The second experiment is set up using ensemble VAEs using 10 decoders.

#### Single VAE

Two variational autoencoders are trained independently with different random seeds using the code in `src/single_decoder/vae_train.py`. Each model maps the 50-dimensional PCA-reduced RNA-seq data to a 2-dimensional latent space. The models are each trained for 100 epochs using Adam optimizer with a learning rate of 0.001 and a batch size of 64.

The goal is to study the geometry learned by each VAE by comparing geodesic curves between fixed point pairs in latent space. From one of the models latent spaces one point is chosen by using center of clusters to represent the cluster. 133 points from all unique classes gives 8778 geodesic distances to approximate (`src/select_representative_pairs.py`).

**Spline Initialization**

Inspired by the paper Detlefsen et al (2022) for a single decoder vae the splines are initialized using the parameterization as in Syrota et all (2025). This formulation is embedded in the `GeodesicSpline` class. A shortest path algorithm initialized in latent space of the model is used for initializing the splines. The shortest paths between point pairs are found using Dijkstras algorithm one a grid with 200x200 over latent space with k=8 -nearest neighbor, from SciPy, and the splines are initialized using 4 segments per spline each optimized to follow the Dijkstra path using LBFGS algorithm. Visualized below with some example splines.


<div style="display: flex; justify-content: space-around; align-items: center;">
  <div style="text-align: center;">
    <img src="../src/plots/splines_init_dijkstra_seed12.png" alt="Splines Init Dijkstra Seed 12" style="max-width: 45%; height: auto;">
    <p><em>Seed 12: Initial spline paths differ slightly from the Dijkstra paths. </em></p>
  </div>
  <!-- <div style="text-align: center;">
    <img src="../src/plots/splines_init_dijkstra_seed123.png" alt="Splines Init Dijkstra Seed 123" style="max-width: 45%; height: auto;">
    <p><em>Seed 123</em></p>
  </div> -->
</div>

**Energy Optimization**

The splines are expressed using the parameterization as in Syrota et al. (2025). This means the optimization step `src/single_decoder/optimize_energy_batched.py` will optimize the omega parameters. The optimization is run in batches of splines.

The optimization of energy is using an Adam optimizer with 500 steps and a learning rate of 0.001. The initialized splines are optimized using the same number of segments as they are initialized with pr spline (same for all splines) and the splines are optimized using a discretization of t_vals = 2000. The energy is computed using the decoder of the model and utilizing the mean for each discretization step. 

The optimized splines are saved as well as the (approximation) of the geodesics' lengths. Below some examples of the visualized optimized (full) and initial splines (dashed) on density based respective backgrounds.

<div style="display: flex; justify-content: space-around; align-items: center;">
  <div style="text-align: center;">
    <img src="../src/plots/density_illustration_examples12.png" alt="Initial and optimized splines seed 12" style="max-width: 80%; height: auto;">
    <p><em>Seed 12</em></p>
  </div>
  <div style="text-align: center;">
    <img src="../src/plots/density_illustration_examples123.png" alt="Initial and optimized splines seed 123" style="max-width: 80%; height: auto;">
    <p><em>Seed 123</em></p>
  </div>
</div>
<p style="text-align: center; margin-top: 1em;"><em>The initial paths are optimized and seems to be pushed towards more density rich areas in the latent space</em><p>



**Distance Matrices Geodesics**

The approximated geodesic lengths are visualized for comparison below. 

<div style="display: flex; justify-content: space-between;">
  <div style="flex: 0 0 49%; text-align: center;">
    <img src="../src/plots/geodesic_distance_seed12_p133.png" alt="Seed 12" style="width: 80%; height: auto;">
    <p><em>Seed 12</em></p>
  </div>
  <div style="flex: 0 0 49%; text-align: center;">
    <img src="../src/plots/geodesic_distance_seed123_p133.png" alt="Seed 123" style="width: 80%; height: auto;">
    <p><em>Seed 123</em></p>
  </div>
</div>
<p style="text-align: center; margin-top: 1em;"><em>Although the matrices look to have similar structure, the geodesic distances for the comparable points differ numerically.</em></p>


To explore further the geodesic distances in this dataset further, the next step is to explore geodesic distances computed via ensemble VAEs.

#### Ensemble VAE

While the latent codes themselves are non-identifiable, the geometry defined via the decoder could be stabilized through ensembling. This approach reduces the variance of the learned Riemannian metric, leading to (hopefully) more consistently approximated geodesic paths.

**Latent spaces**


<div style="display: flex; justify-content: space-between;">
  <div style="flex: 0 0 49%; text-align: center;">
    <img src="../experiment/plots/latent_plot_seed12.png" alt="Seed 12" style="width: 80%; height: auto;">
    <p><em>Seed 12 EVAE 10 decoders</em></p>
  </div>
  <div style="flex: 0 0 49%; text-align: center;">
    <img src="../experiment/plots/latent_plot_seed123.png" alt="Seed 123" style="width: 80%; height: auto;">
    <p><em>Seed 123 EVAE 10 decoders</em></p>
  </div>
</div>
<p style="text-align: center; margin-top: 1em;"><em>It is noted that the latent spaces look very similar across seeds. This unfortunately hints that the euclidean distances will be much alike, and thereby that the geodesic approximated distances perhaps will not be a big improvement over euclidean for this dataset. The similar latent spaces might most likely be because of the pca reduced data.</em></p>



**Initializing Splines - Two methods**

Now there are more decoders. This means that a better initial guess can be placed by utilizing the decoders and where they agree. Opposed to before where only the kNN was initialized in the latent space grid (called euclidean initialization) a new method is proposed based on Detlefsen et al. (2022). The entropy of where the decoders agree is used to construct a weighted graph which is used to construct the initial splines.



<div style="display: flex; justify-content: space-between;">
  <div style="flex: 0 0 49%; text-align: center;">
    <img src="../experiment/splines_init_model_seed12/spline_plot_init_euclidean_10.png" alt="Seed 12" style="width: 80%; height: auto;">
    <p><em>Initialized using euclidean based graph</em></p>
  </div>
  <div style="flex: 0 0 49%; text-align: center;">
    <img src="../experiment/splines_init_model_seed12/spline_plot_init_entropy_10.png" alt="Seed 123" style="width: 80%; height: auto;">
    <p><em>Initialized using entropy weighted graph</em></p>
  </div>
</div>
<p style="text-align: center; margin-top: 1em;"><em>Both are for EVAE model seed 12. The entropy based initialized splines are noted to follow the stucture of the latent data much more closely than the euclidean based initialized splines.</em></p>

**Geodesic Distances vs Euclidean Distances**

Where it is not possible to compare the Geodesic distance matrices to the Euclidean distance matrices certain similarities cannot be unseen as all four plot are very similar.

**CoV analysis**

Very shortly about evaluating the performance using CV analysis. 

<div style="display: flex; justify-content: space-around; align-items: center;">
  <div style="text-align: center;">
    <img src="../experiment/plots/cov_plot_15_alldec.png" alt="Splines Init Dijkstra Seed 12" style="max-width: 45%; height: auto;">
    <p><em>CV analysis using 6 models trained with 10 decoders. Evaluating geodesic and euclidean distances with increasing number of decoders. The results are not statistically valid to conclude on.</em></p>
</div>

#### Next work 