# Note 

Prof. Oganisian -- as discussed, I have attached the abstract and the introduction of my paper. It already exists in its rough form, but I have not yet written the causal section. Please let me know what suggestions and thoughts you have, I would be eager to incorporate them. 

# Abstract

This paper introduces an approach for incorporating directional prior information into manifold learning, specifically within the principal manifolds framework. We formulate a Bayesian method that leverages directional statistics to update prior knowledge about the orientation of data points projected onto manifolds. With a spherical example in $\mathbb{R}^3$, we demonstrate how directional uncertainty can be modeled using von Mises distributions. Importantly, we let the concentration parameter encode the (fitted) local Gaussian curvature of the manifold at a projection point. Within this (empirical) Bayesian paradigm, we explore the Manifold analogue of a causal regression discontinuity estimator where pre- and post treatment manifolds are fitted. Treating the pre-treatment manifolds as prior information, we explore the identifiability of a causal estimator as the changes in the posterior angles post-treatment. The approach outlines the foundation for incorporating domain expertise in causal questions about directional trends in applications such as cancer medicine, while preserving the topological advantages of manifold learning as a dimensionality reduction technique.

# Introduction

Manifold learning is dimensionality reduction technique that has proven useful in settings where data are high-dimensional and non-linear. Often, manifold learning algorithms are used when the topological structure of the data are to be preserved in a statistical learning ^[For an introduction, see (Meila2023)]. Within the manifold learning framework, data are assumed to live on a lower dimensional manifold and are corrupted by high-dimensional noise. We say that $D$-dimensional data can be embedded in $d$-dimensional manifold where $d \leq D$ but generally $d \ll D$.  


```{tikz}
\begin{tikzpicture}[scale=0.85]
  % Define styles
  \tikzset{
    manifold/.style={blue, thick, fill=blue!10},
    noise/.style={gray, thick},
    observed/.style={blue, thick, fill=blue!10, opacity=0.7},
    arrow/.style={->, >=stealth, thick},
    label/.style={font=\normalfont\normalsize}
  }
  % Low-dimensional manifold (left)
  \begin{scope}[xshift=-5cm]
    \draw[manifold] plot[smooth, tension=0.8] coordinates {(-1.5,0) (-0.5,1) (1,1.2) (1.5,0) (1,-1) (-0.5,-1) (-1.5,0)};
    
    % Draw some points on the manifold
    \node[circle, draw, inner sep=1pt, fill=blue!50] at (-0.7,0.3) {$f$};
    \node[circle, draw, inner sep=1pt, fill=blue!50] at (0.8,0.5) {$f$};
    \node[circle, draw, inner sep=1pt, fill=blue!50] at (0.2,-0.6) {$f$};
  \end{scope}
  % High-dimensional noise (center)
  \begin{scope}
    % Draw noise as radial lines from center point
    \fill (0,0) circle (0.1);
    \foreach \i in {0,10,...,350} {
      \draw[noise] (0,0) -- (\i:1.2+0.3*rnd);
    }
  \end{scope}
  % Observed data (right)
  \begin{scope}[xshift=5cm]
    % First draw the noise pattern
    \fill (0,0) circle (0.1);
    \foreach \i in {0,10,...,350} {
      \draw[noise] (0,0) -- (\i:1.2+0.3*rnd);
    }
    
    % Then overlay the manifold with transparency
    \draw[observed] plot[smooth, tension=0.8] coordinates {(-1.5,0) (-0.5,1) (1,1.2) (1.5,0) (1,-1) (-0.5,-1) (-1.5,0)};
    
    % Draw some points on the manifold
    \node[circle, draw, inner sep=1pt, fill=blue!50, opacity=0.7] at (-0.7,0.3) {$f$};
    \node[circle, draw, inner sep=1pt, fill=blue!50, opacity=0.7] at (0.8,0.5) {$f$};
    \node[circle, draw, inner sep=1pt, fill=blue!50, opacity=0.7] at (0.2,-0.6) {$f$};
  \end{scope}
  % Draw the operation symbols
  \node at (-2.5,0) {$+$};
  \draw[arrow] (2,0.5) -- (3,0.5);
  \draw[arrow] (3,-0.5) -- (2,-0.5);
  % Draw the curved arrows for parametrization and embedding
  \draw[arrow, green!50!black, thick] (-4,2.5) to[bend left=30] node[above, font=\normalfont\normalsize] {Parameterization} (4,2.5);
  \draw[arrow, orange, thick] (4,-2.5) to[bend left=30] node[below, font=\normalfont\normalsize] {Embedding} (-4,-2.5);
  % Labels with consistent position and size
  \node[label, text width=3cm, align=center] at (-5,-2) {Low-dimensional manifold};
  \node[label, text width=3cm, align=center] at (0,-2) {High-dimensional noise};
  \node[label, text width=3cm, align=center] at (5,-2) {Observed data};
\end{tikzpicture}
```


Within the principal manifolds framework (Meng2021) -- a replicable and flexible framework for manifold learning -- the process of fitting a manifold to our data contains multiple steps. The key idea is that we fit a $d$-dimensional manifold to our $D$-dimensional by minimizing the sum of squares between our data and the proposed manifold. An important extension to linear dimensionality reduction, i.e. the principal components algorithm (PCA), is that we allow our proposed manifold to preserve underlying topological structure of our data. In a way, manifold learning reduces the dimensionality of data with an explicit focus on the topology of it. We note that -- although certainly intuitive -- this topological structure is not only limited to spatial abstraction, but may be extended to arbitrary dimensions of interest. This framework was pioneered as an extension to the PCA algorithm with curves (HastieStuetzle1989, Tibshirani1992) and has since found a myriad of applications in higher-dimensional extensions. 

Now, consider a setting where we have fit a manifold $\mathcal{M}_d$ to our $D$-dimensional data by means of minimizing the orthogonal distance between the data and the manifold. We consider this manifold fixed and will not touch on the fitting procedure itself. Given $\mathcal{M}_d$, for each data point, i.e. the row vector $[x_{11} \cdots x_{1D}]^T$, we can now define the point on $\mathcal{M}_d$, say $f\left(\left[x_{11} \cdots x_{1D}\right]^T\right)$. This point minimizes the distance between $x_i$ and $f(x_i)$. We want to stress again that this procedure does not mean we are fitting the manifold to the data, we are simply retrieving the distance-minimizing projection point. We write 

$$
\text{arg~min}_{f \in \mathcal{F}} \|x^* - f(x^*)\|_2
$$

where we consider each projection function $f$ to be a member of an arbitrary function space $\mathcal{F}$. We define the distance metric as the $L^2$ distance. 

If there exists only one projection point $f(x_i)$ for every $x_i$, every $f \in \mathcal{F}$ is one-to-one and onto ("bijective") mapping. We find it interesting to highlight that the projection functions in the PCA algorithm are inherently bijective, and for inferential purposes, this is a property that is often taken for ^[This is because a principal axes is a straight line, i.e. neither convex or concave. Although a point's distance to its projection may be co-minimal across $\leq 2$ dimensions, it only has one $f(x_i)$ in one principal axis.]. In a manifold learning framework, this is no longer the case. Albeit highly interesting, due to the limited scope of this paper, we shall treat this scenario as an edge case, reserving rigorous treatment for the blissful times that follow the author's qualifying exam. 

Now, given the data $[\{x_i\}_{i=1}^n, \{f(x_i)\}_{i=1}^n]$, we can reparameterize our space into polar coordinates to obtain a vector representation of the collection $f \in \mathcal{F}$. Converting a Cartesian parameterization in space with $D$ dimensions into polar coordinates yields the $d$-dimensional vector $[r_i^*; \theta_{i, 1}, \cdots, \theta_{i, D-1}]$, i.e. one radius $r_i^*$ and a set of $D-1$ angles suffice to characterize each point $x_i$'s location in space. 

Recognize that the parameter $r^*$ is not random. This is because it is simply the result from our previous projection distance-minimizing procedure. Usually, polar parameterizations assume that all angles and radii are centered at the origin. Luckily, simple vector addition and subtraction readily generalizes our parameterizations in space. For instance, to obtain the vector from the point $x_i$ and $f(x_i)$, we simply subtract $f(x_i) - x_i$. It is important that the issue of defining the origin is explicitly clarified when dealing with directional information. For simplicity, we will henceforth consider data centered at the origin.

![Noisy data projected on $\mathcal{M}_3$; the unit sphere in $\mathbb{R}^3$](../fig/data-with-noise.png){#fig-noisy-data width=75%}