Derivation of the equations to solve the Gaussian Processes Factor Analysis as described in: Yu, B.M., Cunningham, J.P., Santhanam, G., Ryu, S., Shenoy, K.V., Sahani, M., 2008. Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity, in: Advances in Neural Information Processing Systems. Curran Associates, Inc.

## Notation

- $T$  Number of time steps
- $N$  Number of variables observed
- $K$  Number of dimensions of latent variable
- $x_{:,t}$ vector of all the $N$ variables at time $t$, $\in \mathbb{R}^N$
- $x_{n,:}$ vector of the $n$th variable at for time steps in $T$, $\in \mathbb{R}^T$
- $x_{n,t}$ $n$th variable at time $t$, $\in \mathbb{R}$
- $X = [x_{:,1}, ... x_{:, T}]$ Matrix with all the $N$ variables at all time steps $T$, $\in \mathbb{R}^{N \times T}$
- $t$ time step
- $z_{k, t}$ $k$th latent variable at time $t$, $\in \mathbb{R}$
- $Z = [z_1 , ... z_t]$ Vector with $z$ at all time steps in $T$, $\in \mathbb{R}^{K \times T}$



## Gaussian Processes Factor Analysis model

We model the variables in this way
 $$x_{:,t} = \Lambda z_{:,t} + \epsilon $$
where:

- $\Lambda$ Matrix for linear transformation of $z$ into $x$, $\in \mathbb{R}^{N \times K}$
- $\epsilon$ Random noise. The random noise is independent between the different time steps, $\in \mathbb{R}^N$:
    - $p(\epsilon) = \mathcal{N}(0, \psi)$ distribution of noise
    - $\psi$, covariance matrix of noise, It is a diagional matrix, $\in \mathbb{R}^{N \times N}$

The formulas assumes that $\langle X \rangle = 0$ (if $X$ doesn't have a 0 mean it can be easily transformed by substracting the mean)

The latent variable $z$ is modelled over time using a Gaussian Process, one process for each dimension $k$
for simplicity we assumed that $z$ has only one dimension ($k = 1$)

$$p(Z) = \mathcal{GP}(0, k(t, t \prime))$$

## Derivation of $p(X)$

$p(x_{:,t}|z_{:,t}) = \mathcal{N}(\Lambda z_{:,t}, \psi)$ is easy to derive and then $p(x_{:,t})$ and $p(z_{:,t}|x_{:,t})$ can be obtained using the rules of gaussian inference.

However, what is interesting is to have the analytical form of $p(X)$, which models both the relations between $z$ and $x$ and the $z$ and $t$. The likelihood of $p(X)$ can then be maximized to obtain the parameters of the latent transformation and the kernel hyperparameter.

$p(X)$ is a guassian distribution with $T$ dimensions.

$p(X) = \mathcal{N}(\langle x_{:,} \rangle, \langle x_{:,:}x_{:, :}^T \rangle)$




### Diagonal of the covariance matrix

let's start with the diagonal of the covariance matrix ($t = t \prime$)

$\langle x_{:,t}x_{:,t}^T \rangle = \langle (\Lambda z_{:,t} + \epsilon_{t})(\Lambda z_{:,t} + \epsilon_{t})^T \rangle$

by multipling the two vectors together we obtain

$\langle x_{:,t}x_{:,t}^T \rangle = \langle \Lambda z_{:,t} z_{:,t}^T \Lambda^T + \Lambda z_{:,t} \epsilon_{t}^T + \epsilon_t \Lambda^T z_{:,t}^T  + \epsilon_t \epsilon_{t}^T \rangle$

The using the properties of the [expectation](https://www.statlect.com/fundamentals-of-probability/expected-value-properties) we can: 1) transform the expecations of a sum into a sum of expecations 2) move the $\lambda$ out of the expecatios, as it doesn't depend on t 3) $\langle z_{:,t} \epsilon_t \rangle = \langle z_{:,t} \rangle \langle \epsilon_t \rangle$ because $z_{:,t}$ and $\epsilon_t$ are indipendent random variables

$\langle x_{:,t}x_{:,t}^T \rangle = \Lambda \langle z_{:,t} z_{:,t}^T\rangle \Lambda^T + \Lambda \langle z_{:,t} \rangle \langle \epsilon_{t}^T \rangle + \langle \epsilon_{t} \rangle \Lambda^T \langle z_{:,t}^T  \rangle + \langle \epsilon_t \epsilon_{t}^T \rangle$

Then considering that $\langle z_{:,t} \rangle = 0$ and that $\langle \epsilon_t \rangle = 0$
the expression can be simplified as:

$\langle x_{:,t}x_{:,t}^T \rangle = \Lambda \langle z_{:,t} z_{:,t}^T\rangle \Lambda^T + \langle \epsilon_t \epsilon_{t}^T \rangle$

Then substituting 1) $\langle z_{:,t} z_{:,t}^T\rangle = k(t, t)$ as that is the covariance matrix of the gaussian process. 2) $\langle \epsilon_t \epsilon_t^T \rangle= \psi$

$\langle x_{:,t}x_{:,t}^T \rangle = \Lambda k(t,t)  \Lambda^T + \psi$

### Other elements

similar steps of above

$\langle x_{:,t}x_{:,t \prime}^T \rangle = \langle (\Lambda z_{:,t} + \epsilon_{t})(\Lambda z_{:,t \prime} + \epsilon_{t \prime})^T \rangle$

by multipling the two vectors together we obtain

$\langle x_{:,t}x_{:,t \prime}^T \rangle = \langle \Lambda z_{:,t} z_{:,t \prime}^T \Lambda^T + \Lambda z_{:,t} \epsilon_{t \prime}^T + \epsilon_t \Lambda^T z_{:,t \prime}^T  + \epsilon_t \epsilon_{t \prime}^T \rangle$

Then using the properties of the [expectation](https://www.statlect.com/fundamentals-of-probability/expected-value-properties) we can: 1) transform the expecations of a sum into a sum of expecations 2) move the $\Lambda$ out of the expecatios, as it doesn't depend on t 3) $\langle z_{:,t} \epsilon_t \rangle = \langle z_{:,t} \rangle \langle \epsilon_t \rangle$ because $z_{:,t}$ and $\epsilon_t$ are indipendent random variables

$\langle x_{:,t}x_{:,t}^T \rangle = \Lambda \langle z_{:,t} z_{:,t}^T\rangle \Lambda^T + \Lambda \langle z_{:,t} \rangle \langle \epsilon_{t}^T \rangle + \langle \epsilon_{t} \rangle \Lambda^T \langle z_{:,t}^T  \rangle + \langle \epsilon_t \epsilon_{t}^T \rangle$

Then considering that $\langle z_{:,t} \rangle = 0$ and that $\langle \epsilon_t \rangle = 0$
the expression can be simplified as:

$\langle x_{:,t}x_{:,t \prime}^T \rangle = \Lambda \langle z_{:,t} z_{:,t \prime}^T\rangle \Lambda^T + \langle \epsilon_t \epsilon_{t \prime}^T \rangle$

Then substituting 1) $\langle z_{:,t} z_{:,t \prime}^T\rangle = k(t,t \prime)$ as that is the covariance matrix of the gaussian process. 2) $\langle \epsilon_t \epsilon_{t \prime}^T \rangle= 0$ as $\epsilon_t$ and $\epsilon_{t \prime}$ are indipendent and $\langle \epsilon_t \rangle = 0$

$\langle x_{:,t}x_{:,t \prime}^T \rangle = \Lambda k(t,t \prime) \Lambda^T$


### Result
Therefore $p(X)$ can ve modelled as:

$$p(X) = \mathcal{N}\left (0 , {\begin{array}{cccc}
    \Lambda k(t_1,t_1) \Lambda^T + \psi & \Lambda k(t_{1},t_{2}) \Lambda^T & \cdots & \Lambda k(t_1 ,t_t) \Lambda^T\\
    \Lambda k(t_{2},t_{1}) \Lambda^T &  \Lambda k(t_{2},t_{2}) \Lambda^T + \psi & \cdots & \Lambda k(t_{2},t_{t}) \Lambda^T\\
    \vdots & \vdots & \ddots & \vdots\\
    \Lambda k(t_{t}, t_{1}) \Lambda^T & \Lambda k(t_{t},t_{1}) \Lambda^T & \cdots & \Lambda k(t_{t},t_{t}) \Lambda^T + \psi\\
    \end{array} } \right )$$

and this can also be described as Gaussian Process with a "special" kernel. Multiplying kernel with a constant ($\Lambda$) or adding a constant ($\psi$) yields another valid kernel (= the obtained matrix is positive semi-definite).

If we define the new kernel as $$K(t,t \prime) = \Lambda k(t,t \prime) \Lambda^T + P(t,t \prime)$$

where $P(t, t \prime) = \begin{cases}
                \psi & if\ \ t = t \prime \\
                0    & if\ \ t \ne t \prime \\
            \end{cases}$

(note: should prove that this is a valid kernel)

thus

$$ p(X) = \mathcal{GP}(0, K(t, t\prime))$$

## Next steps

- The parameters of the final GP ($\Lambda, \psi$ and the kernel hyperparameters) can be fitted by maximizing the likelihood of $p(X)$ using gradient descent
- An implementation challenge is that for each time step $x$ (the predicted variable from the GP) has multiple dimensions