# Introduction and Background

In [the previous notebook](1_LinearMethods.ipynb), we showed linear methods for learning atomic structures. In this notebook, we combine these supervised (LR) and unsupervised (PCA) models into one. 

As before, for each model, we first go step-by-step through the derivation, with equations, embedded links, and citations supplied where useful. At the end of the notebook, we demonstrate a "Skcosmo Class" for the model, which is found in the skcosmo module and contains all necessary functions.

In [None]:
#!/usr/bin/env python3

import sys

# Maths things
import numpy as np

# Plotting
import matplotlib.pyplot as plt

# Local Utilities for Notebook
sys.path.append("../")
from utilities.general import load_variables
from utilities.plotting import (
    plot_projection,
    plot_regression,
    check_mirrors,
    get_cmaps,
)
from sklearn.linear_model import Ridge
from sklearn.decomposition import PCA

from skcosmo.decomposition import PCovR


cmaps = get_cmaps()
plt.style.use("../utilities/kernel_pcovr.mplstyle")
dbl_fig = (2 * plt.rcParams["figure.figsize"][0], plt.rcParams["figure.figsize"][1])

First, we must load the data. For a step-by-step explanation of this, please see [Importing Data](X_ImportingData.ipynb).


In [None]:
var_dict = load_variables()
locals().update(var_dict)

# Fundamentals of Principal Covariates Regression (PCovR)

## Constructing the Loss and Similarity Functions

Principal Covariates Regression (PCovR) is a mathematical model that combines principal component analysis and linear regression (or more specifically, 
[principal components regression](https://en.wikipedia.org/wiki/Principal_component_regression)), with a parameter $\alpha$ that is used to 
tune the relative weight of each of the two tasks and their corresponding losses [(de Jong, 1992)](https://www.doi.org/10.1016/0169-7439(92)80100-I), 
[(Vervloet, 2015)](https://www.doi.org/10.18637/jss.v065.i08). 

\begin{equation}
\ell =
\alpha {\left\lVert \mathbf{X} - \mathbf{X}\mathbf{P}_{XT}\mathbf{P}_{TX}\right\rVert^2}
 +
(1-\alpha){\left\lVert \mathbf{Y} - \mathbf{X}\mathbf{P}_{XT}\mathbf{P}_{TY}\right\rVert^2}.
\end{equation}

It is easier to minimize this by looking for a projection $\tilde{\mathbf{T}}$ in a latent space if we require $\tilde{\mathbf{T}}^T\tilde{\mathbf{T}} = \mathbf{I}$, i.e. for the the column vectors in $\tilde{\mathbf{T}}$ to be orthonormal. 

\begin{equation}
\ell =
\alpha {\left\lVert \mathbf{X} - \mathbf{X}\mathbf{P}_{X\tilde{T}}\mathbf{P}_{\tilde{T}X}\right\rVert^2}
 +
(1-\alpha){\left\lVert \mathbf{Y} - \mathbf{X}\mathbf{P}_{X\tilde{T}}\mathbf{P}_{\tilde{T}Y}\right\rVert^2}.
\end{equation}

where $\tilde{\mathbf{T}}$ is the [whitened](https://en.wikipedia.org/wiki/Whitening_transformation) version of our earlier projection $\mathbf{T}$.

By definition, $\mathbf{X}\mathbf{P}_{X\tilde{T}} = \tilde{\mathbf{T}}$. Due to orthonormality, the definition $\tilde{\mathbf{T}}\mathbf{P}_{\tilde{T}X} = \mathbf{X}$ implies that $\mathbf{P}_{\tilde{T}X} = \tilde{\mathbf{T}}^T
 \mathbf{X}$. Similarly, we find  $\mathbf{P}_{\tilde{T}Y} = \tilde{\mathbf{T}}^T\mathbf{Y}$, leading to

\begin{equation}
    \ell = \alpha\lVert\mathbf{X} - \tilde{\mathbf{T}}\tilde{\mathbf{T}}^T\mathbf{X}\rVert^2 + (1 - \alpha)\lVert\mathbf{Y} - \tilde{\mathbf{T}}\tilde{\mathbf{T}}^T\mathbf{Y}\rVert^2.
\end{equation}

**Note**: if the features and the properties were not normalized, it would be advisable to do so here, by dividing the first term by ${\lVert \mathbf{X} \rVert^2}$ and the second by ${\Vert \mathbf{Y} \rVert^2}$, to make sure that the two components are compared on equal footings, without introducing a dependence on their absolute magnitude.


Just like with [PCA](1_LinearMethods.ipynb), instead of minimizing loss, we maximize the similarity measure, leading to:

\begin{equation}
\rho = \operatorname{Tr}\left(\alpha \cdot \tilde{\mathbf{T}}\tilde{\mathbf{T}}^T\mathbf{X}\mathbf{X}^T + \left(1-\alpha\right)\tilde{\mathbf{T}}\tilde{\mathbf{T}}^T\mathbf{Y}\mathbf{Y}^T\right),
\end{equation}

where $\rho$ is obtained from $\ell$ by exploiting the invariance of the trace to circular permutations, and dropping constant terms. 

## Combining X- and Y-Space

An important detail to consider is that we are looking for an approximation of the properties based on a reduced-dimensional latent space that depends only on the  $\mathbf{X}$ features. In order to avoid considering components of the properties that cannot be represented even in the full feature space, we must first project our properties onto $\mathbf{X}$.

For this, we use the least-squares approximation of $\mathbf{Y}$ as found in [linear regression](1_LinearMethods.ipynb).

\begin{equation}
\hat{\mathbf{Y}} = \mathbf{X}\mathbf{P}_{XY} = \mathbf{X}\left(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I}\right)^{-1}\mathbf{X}^T \mathbf{Y}
\end{equation}

where $\lambda$ is the regularization parameter.

In [None]:
regularization = 1e-6

lr = Ridge(alpha=regularization)
lr.fit(X_train, Y_train)
Yhat_train = lr.predict(X_train).reshape((-1, Y_train.shape[1]))

When when $\lambda$ is small, $\operatorname{Tr}(\tilde{\mathbf{T}}\tilde{\mathbf{T}}^T\mathbf{Y}\mathbf{Y}^T) = \operatorname{Tr}(\tilde{\mathbf{T}}\tilde{\mathbf{T}}^T\mathbf{\hat{Y}}\mathbf{\hat{Y}}^T)$ we write the similarity function as

\begin{equation}
\rho = \operatorname{Tr}\left(\alpha \cdot \tilde{\mathbf{T}}\tilde{\mathbf{T}}^T\mathbf{X}\mathbf{X}^T + \left(1-\alpha\right)\tilde{\mathbf{T}}\tilde{\mathbf{T}}^T\mathbf{\hat{Y}}\mathbf{\hat{Y}}^T\right),
\end{equation}

## Formulations of PCovR

In the similarity measure we have the matrix product $\mathbf{XX}^T$. This matrix becomes too large to handle if $\mathbf{X}$ has many more rows (samples) than columns (features). Therefore, two formulations of PCovR have been proposed: one for cases where $n_{samples} \gg n_{features}$, and the other for cases where $n_{features} \gg n_{samples}$. The former we refer to as **feature space PCovR** (which is analogous to PCA), and we refer to the latter as **sample space PCovR** (which is analogous to MDS).
We begin by discussing sample space PCovR.

# Sample-Space PCovR

To compute PCovR using the all $n_{samples}$ components of our Sample Space, we must combine the sample-space kernel ($\mathbf{X}\mathbf{X}^T$) and the outer product of the regressed properties ($\hat{\mathbf{Y}}\hat{\mathbf{Y}}^T$). 

We define the modified [Gram matrix](https://en.wikipedia.org/wiki/Gramian_matrix)

\begin{equation}
    \mathbf{\tilde{K}} = \alpha {\mathbf{X} \mathbf{X}^T}
    + (1 - \alpha) {\hat{\mathbf{Y}} \hat{\mathbf{Y}}^T},
\end{equation}

Since the trace of the matrix is invariant under cyclic permutation ($\operatorname{Tr}(\mathbf{A B C}) = \operatorname{Tr}(\mathbf{B C A})  = \operatorname{Tr}(\mathbf{C A B})$), we can write 
$\rho = \operatorname{Tr}\left(\tilde{\mathbf{T}}\tilde{\mathbf{T}}^T\mathbf{\tilde{K}}\right) = \operatorname{Tr}\left(\tilde{\mathbf{T}}^T\mathbf{\tilde{K}}\tilde{\mathbf{T}}\right)$, that is maximized when $\tilde{\mathbf{T}}$ contains the principal eigenvectors of $\mathbf{\tilde{K}}$ (i.e., the eigenvectors associated with the largest eigenvalues). 

In [None]:
alpha = 0.5

K_pca = X_train @ X_train.T
K_lr = Yhat_train @ Yhat_train.T

K = (alpha * K_pca) + (1.0 - alpha) * K_lr

What follows is analogous to multi-dimensional scaling, with $\mathbf{\tilde{K}}$ acting as the Gram matrix, modified to weight structural and property correlations. 
First, we diagonalize $\mathbf{\tilde{K}}=\mathbf{U_\tilde{K}}\mathbf{\Lambda_\tilde{K}}\mathbf{U_\tilde{K}}^T$.

In [None]:
v_Kt, U_Kt = np.linalg.eigh(K)

# U_Kt/v_Kt are already sorted, but in *increasing* order, so reverse them
U_Kt = np.flip(U_Kt, axis=1)
v_Kt = np.flip(v_Kt, axis=0)

U_Kt = U_Kt[:,v_Kt>0]
v_Kt = v_Kt[v_Kt>0]

$\tilde{\mathbf{T}} = \mathbf{\hat{U}_\tilde{K}}$, where $\mathbf{\hat{U}_\tilde{K}}$ contains the first $n_{PCA}$ components of $\mathbf{U_\tilde{K}}$.

While it is useful to construct our loss and similarity using the whitened matrix $\tilde{\mathbf{T}}$, in doing so we lose information on the relative variance of the different components of the input space. To recover the MDS limit when $\alpha=1$, we build projections by "de-whitening" the principal eigenvectors by multiplying by a factor of $\hat{\mathbf{\Lambda}}_\mathbf{\tilde{K}}^{1/2}$, i.e.
$\mathbf{T} = \tilde{\mathbf{T}} \hat{\mathbf{\Lambda_\tilde{K}}}^{1/2} = \mathbf{X} \mathbf{P}_{X\tilde{T}}\hat{\mathbf{\Lambda}}_\mathbf{\tilde{K}}^{1/2}$.

In [None]:
T = U_Kt[:, :n_PC] @ np.diagflat(np.sqrt(v_Kt[:n_PC]))

## Determining the Projector $\mathbf{P}_{XT}$

We also derive a projector from $X$ space directly to the latent space in the form of $\mathbf{T} = \mathbf{XP}_{XT}$ (and vice versa). For that we need to solve 
\begin{equation}
\mathbf{X}\mathbf{P}_{XT} = \tilde{\mathbf{T}} \hat{\mathbf{\Lambda}}_\mathbf{\tilde{K}}^{1/2} =  \mathbf{\tilde{K}}\mathbf{\hat{U}_\tilde{K}}\hat{\mathbf{\Lambda}}_\mathbf{\tilde{K}}^{-1/2}
\end{equation}

where we used the fact that $\tilde{\mathbf{T}}$ is built from columns of $\mathbf{\hat{U}_{\tilde{K}}}$.
Writing explicitly $\mathbf{\tilde{K}}$ and $\hat{\mathbf{Y}}$,

\begin{align}
\mathbf{X}\mathbf{P}_{XT} &= \left( 
\alpha \color{red}{\mathbf{X}} \mathbf{X}^T
+ (1 - \alpha) 
\color{red}{\mathbf{X}}\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T \mathbf{Y}
\hat{\mathbf{Y}}^T
\right) \mathbf{\hat{U}_\tilde{K}}\hat{\mathbf{\Lambda}}_\mathbf{\tilde{K}}^{-1/2} \\
&= \color{red}{\mathbf{X}} \left( 
\alpha \mathbf{X}^T
+ (1 - \alpha) 
\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T \mathbf{Y}
\hat{\mathbf{Y}}^T
\right) \mathbf{\hat{U}_\tilde{K}} \hat{\mathbf{\Lambda}}_\mathbf{\tilde{K}}^{-1/2}.
\end{align}

we obtain the projector from feature to latent space $\mathbf{P}_{XT}$ explicitly 

\begin{equation}
\mathbf{P}_{XT} = \left(
\alpha{\mathbf{X}^T}
+ (1 - \alpha)
\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T \mathbf{Y}
\hat{\mathbf{Y}}^T
\right)  \mathbf{\hat{U}_\tilde{K}}\hat{\mathbf{\Lambda}}_\mathbf{\tilde{K}}^{-1/2}.
\end{equation}



In [None]:
P_lr = X_train.T @ X_train + np.eye(X_train.shape[1]) * regularization
P_lr = np.linalg.pinv(P_lr)
P_lr = (P_lr @ X_train.T @ Y_train).reshape((-1, Y_train.shape[1])) @ Yhat_train.T

P_pca = X_train.T

P = (alpha * P_pca) + (1.0 - alpha) * P_lr
PXT = P @ U_Kt[:, :n_PC] @ np.diag(1 / np.sqrt(v_Kt[:n_PC]))

We verify that $\mathbf{T}\approx \mathbf{X}\mathbf{P}_{XT}$.

In [None]:
print(np.linalg.norm(X_train @ PXT - T))

And plot the projection. Note how the position of points in latent space correlates much better with the properties than they do in PCA latent space.

In [None]:
T_pcovr_test = X_test @ PXT

In [None]:
fig, axes = plt.subplots(1, 2, figsize=dbl_fig)

ref = PCA(n_components=n_PC)
ref.fit(X_train)
t_ref = ref.transform(X_test)

plot_projection(
    Y_test,
    check_mirrors(T_pcovr_test, t_ref),
    fig=fig,
    ax=axes[0],
    title=r"PCovR (Sample Space, $\alpha={}$)".format(alpha),
    **cmaps
)
plot_projection(Y_test, t_ref, fig=fig, ax=axes[1], title="PCA", **cmaps)

fig.suptitle(
    r"These are not the same unless $\alpha = 1.0.$",
    y=0.0,
    fontsize=plt.rcParams["font.size"] + 6,
)
plt.show()

The image on the left (PCovR) is the 2D projected that is weighted by a factor of (1-$\alpha$) towards those axes which best discern the  property vector $\mathbf{Y}$. Note how the color of the points changes with the position of the points.

The image to the right (PCA) is the 2D projection which best preserves the variance of the feature matrix $\mathbf{X}$.

## Predicting the Properties
Now how do we get the weights to predict the properties? Basically we need to 
solve the linear regression problem in the projected space, i.e., solve for the weights $\mathbf{P}_{TY}$ such that:

\begin{equation}
\mathbf{Y_{PCovR}} = \mathbf{T} \mathbf{P}_{TY} = \mathbf{X} \mathbf{P}_{XT} \mathbf{P}_{TY}. 
\end{equation}

We determine the regression parameter $\mathbf{P}_{TY}$ by applying the usual LR expression using the projections $\mathbf{T}$.

\begin{equation}
\mathbf{P}_{TY} = (\mathbf{T}^T\mathbf{T})^{-1}\mathbf{T}^T\mathbf{Y} = 
 \hat{\mathbf{\Lambda}}_\mathbf{\tilde{K}}^{-1}\mathbf{T}^T\mathbf{Y},
\end{equation}

Where the factor of $\hat{\mathbf{\Lambda}}_\mathbf{\tilde{K}}^{-1}$ arises from the fact that
$\mathbf{T}^T\mathbf{T}=\hat{\mathbf{\Lambda}}_\mathbf{\tilde{K}}^{1/2}\mathbf{\hat{U}_\tilde{K}}^T\mathbf{\hat{U}_\tilde{K}}\hat{\mathbf{\Lambda}}_\mathbf{\tilde{K}}^{1/2}=
\hat{\mathbf{\Lambda}}_\mathbf{\tilde{K}}$. Note that $\mathbf{Y}$ corresponds to the property vector for the train set.

**Note:** this differs from the expression in [2015 paper on PCovR](https://www.doi.org/10.18637/jss.v065.i08), which uses the whitened PCA projections $\tilde{\mathbf{T}}=\mathbf{U_\tilde{K}}$ with variance 1. Here, we construct our PCA projections without this normalization, so we must divide by the eigenvalue matrix to compute regression weights.

In [None]:
PTY = np.diagflat(1 / (v_Kt[:n_PC])) @ T.T @ Y_train
Y_pcovr_test = X_test @ PXT @ PTY

fig, axes = plt.subplots(1, 2, figsize=dbl_fig)

ref_lr = Ridge(alpha=regularization)
ref_lr.fit(X_train, Y_train)
yref = lr.predict(X_test)

plot_regression(
    Y_test[:, 0],
    Y_pcovr_test[:, 0],
    fig=fig,
    ax=axes[0],
    title=r"PCovR (Sample Space, $\alpha={}$)".format(alpha),
    **cmaps
)
plot_regression(Y_test[:, 0], yref[:, 0], fig=fig, ax=axes[1], title="LR", **cmaps)

fig.suptitle(
    r"These become more dissimilar as $\alpha \to 1.0.$",
    y=0.01,
    fontsize=plt.rcParams["font.size"] + 6,
)
plt.show()

# Feature-Space PCovR

## The Similarity Function in Feature-Space

In the case where the number of samples is greater than the number of features ($n_{samples} >> n_{features}$), computing the eigenvectors of $\mathbf{\tilde{K}}$ may be undesirable. In this case, we instead compute the eigenvectors of the $n_{features} \times n_{features}$ matrix, i.e., the feature space. In order to formulate a feature-space version of PCovR we again consider how PCovR maximises the similarity ($\rho$) between our projection and the original data.

\begin{equation}
\rho = \operatorname{Tr}\left(\tilde{\mathbf{T}}^T\mathbf{\tilde{K}}\tilde{\mathbf{T}}\right)
\end{equation}

and rewrite it in terms of the decomposition $\tilde{\mathbf{T}} = \mathbf{X}\mathbf{P}_{X\tilde{T}}$

\begin{align}
\rho &= \operatorname{Tr}\left(\mathbf{P}_{X\tilde{T}}^T\mathbf{X}^T\mathbf{\tilde{K}}\mathbf{X}\mathbf{P}_{X\tilde{T}}\right)
\end{align}

$\mathbf{X}^T\mathbf{\tilde{K}}\mathbf{X}$ is a $n_{features}\times n_{features}$ matrix, so why not solve for $\mathbf{P}_{X\tilde{T}}$ through an eigendecomposition of $\mathbf{X}^T\mathbf{\tilde{K}}\mathbf{X}$?

For this trick to work, we need to work with an matrix whose column vectors are orthonormal, but $\mathbf{P}_{X\tilde{T}}^T\mathbf{P}_{X\tilde{T}} \neq \mathbf{I}$! However, 
\begin{equation}
\mathbf{P}_{X\tilde{T}}^T\mathbf{C}\mathbf{P}_{X\tilde{T}} = \mathbf{P}_{X\tilde{T}}^T\mathbf{X}^T\mathbf{X}\mathbf{P}_{X\tilde{T}} = \tilde{\mathbf{T}}^T\tilde{\mathbf{T}} = \mathbf{I}
\end{equation}

In [None]:
C = X_train.T @ X_train

PXV = PXT @ np.diagflat(1.0/np.sqrt(v_Kt[:n_PC]))

print(np.linalg.norm(PXV.T @ C @ PXV - 
                     np.eye(PXV.shape[1])))

Therefore $\mathbf{C}^{1/2}\mathbf{P}_{X\tilde{T}}$ contains orthonormal column vectors! 

We calculate $\mathbf{C}^{1/2}$ and $\mathbf{C}^{-1/2}$ using the eigendecomposition of $\mathbf{C}$, given

\begin{align}
& \mathbf{C}^{1/2} = \mathbf{U}_C \mathbf{\Lambda}^{1/2} \mathbf{U}_C^T\\
& \mathbf{C}^{-1/2} = \mathbf{U}_C \mathbf{\Lambda}^{-1/2} \mathbf{U}_C^T\\
\end{align}

In [None]:
v_C, U_C = np.linalg.eigh(C)

# U_C/v_C are already sorted, but in *increasing* order, so reverse them
U_C = np.flip(U_C, axis=1)
v_C = np.flip(v_C, axis=0)

U_C = U_C[:,v_C>0]
v_C = v_C[v_C>0]

Csqrt = U_C @ np.diagflat(np.sqrt(v_C)) @ U_C.T
iCsqrt = U_C @ np.diagflat(1.0 / np.sqrt(v_C)) @ U_C.T

We then write

\begin{align}
\rho &= \operatorname{Tr}\left(\mathbf{P}_{X\tilde{T}}^T\mathbf{C}^{1/2}\mathbf{C}^{-1/2}\mathbf{X}^T\mathbf{\tilde{K}}\mathbf{X}\mathbf{C}^{-1/2}\mathbf{C}^{1/2}\mathbf{P}_{X\tilde{T}}\right)
\end{align}

and introduce a modified covariance matrix $\tilde{\mathbf{C}}$

\begin{equation}
\tilde{\mathbf{C}} = \mathbf{C}^{-1/2}\mathbf{X}^T\mathbf{\tilde{K}}\mathbf{X}\mathbf{C}^{-1/2}
\end{equation}

In [None]:
Ct = iCsqrt @ X_train.T
Ct = Ct @ K @ Ct.T

v_Ct, U_Ct = np.linalg.eigh(Ct)
U_Ct = np.flip(U_Ct, axis=1)
v_Ct = np.flip(v_Ct, axis=0)

U_Ct = U_Ct[:,v_Ct>0]
v_Ct = v_Ct[v_Ct>0]

We verify that $\mathbf{\tilde{C}}$ and $\mathbf{\tilde{K}}$ have the same eigenvalues, as they're connected to each other by the same relation that links the covariance and the Gram matrix. As such, the blue and red lines in the figure below should be indistinguishable.

In [None]:
fig, ax = plt.subplots(1)
ax.set_yscale("log")
ax.plot(v_Kt, marker="o", c="b", label=r"$\mathbf{\tilde{K}}$")
ax.plot(v_Ct, marker="o", c="r", label=r"$\mathbf{\tilde{C}}$")
ax.set_xlabel("n")
ax.set_ylabel("$\lambda_n$")
ax.set_title(
    r"Eigenvalues of $\mathbf{\tilde{C}}$ and $\mathbf{\tilde{K}}$ as a function of n"
)
ax.legend()
plt.show()

## Computing the Projectors

The similarity is maximized when the orthonormal column vectors in matrix $\mathbf{C}^{1/2}\mathbf{P}_{X\tilde{T}}$ match the principal eigenvalues of $\mathbf{\tilde{C}}$, i.e. $\mathbf{P}_{X\tilde{T}} = \mathbf{C}^{-1/2}\hat{\mathbf{U}}_{\mathbf{\tilde{C}}}$. Similar to [PCA](1_LinearMethods.ipynb), we require that $\mathbf{X}_{PcovR} = \mathbf{\tilde{T}}\mathbf{P}_{\tilde{T}X}$ approximates the portion of the feature space projected to the potential space, i.e., $\mathbf{X}_{PcovR} \mathbf{P}_{X\tilde{T}} = \mathbf{\tilde{T}} $, which implies that $\mathbf{P}_{\tilde{T}X}\mathbf{P}_{X\tilde{T}}=\mathbf{I}$ and $\mathbf{P}_{\tilde{T}X}=\hat{\mathbf{U}}_{\mathbf{\tilde{C}}}^T \mathbf{C}^{1/2}$

In [None]:
PXV = iCsqrt @ U_Ct[:, :n_PC]

In general
$\mathbf{P}_{X\tilde{T}} \mathbf{P}_{\tilde{T}X} = \mathbf{C}^{-1/2}\hat{\mathbf{U}}_{\mathbf{\tilde{C}}}\hat{\mathbf{U}}_{\mathbf{\tilde{C}}}^T \mathbf{C}^{1/2}$ is not a symmetric matrix, and so it is not possible to define an $\mathbf{P}_{XT}$ such that $\mathbf{P}_{TX}=\mathbf{P}_{XT}^{T}$. 
Consistently with the case of sample-space PCovR, we define 

\begin{equation}
\begin{split}
\mathbf{P}_{XT} =& \mathbf{C}^{-1/2}\hat{\mathbf{U}}_{\mathbf{\tilde{C}}}\hat{\mathbf{\Lambda}}_{\mathbf{\tilde{C}}}^{1/2} \\
\mathbf{P}_{TX} =& \hat{\mathbf{\Lambda}}_{\mathbf{\tilde{C}}}^{-1/2} \hat{\mathbf{U}}_{\mathbf{\tilde{C}}}^T  \mathbf{C}^{1/2} \\
\mathbf{P}_{TY} =&(\mathbf{T}^T\mathbf{T})^{-1}\mathbf{T}^T\mathbf{Y} \\
           =&\hat{\mathbf{\Lambda}}_{\mathbf{\tilde{C}}}^{-1/2} \hat{\mathbf{U}}_{\mathbf{\tilde{C}}}^T  \mathbf{C}^{-1/2} \mathbf{X}^T \mathbf{Y} \\
\end{split}
\label{eq:pcovr-projectors}
\end{equation}

In [None]:
PXT = PXV @ np.diagflat(np.sqrt(v_Ct[:n_PC]))
PTX = np.diagflat(1.0 / np.sqrt(v_Ct[:n_PC])) @ U_Ct[:, :n_PC].T, Csqrt
PTY = (
    np.diagflat(1.0 / np.sqrt(v_Ct[:n_PC]))
    @ U_Ct[:, :n_PC].T
    @ iCsqrt
    @ X_train.T
    @ Y_train
)

## Projecting the Data
Projecting and regressing in feature-space PCovR proceeds similarly to Sample Space PCovR:

In [None]:
T_fspcovr_test = X_test @ PXT

Y_fspcovr_test = T_fspcovr_test @ PTY

We compare with the Sample Space PCovR and see that the results are essentially identical:

In [None]:
fig, axes = plt.subplots(1, 2, figsize=dbl_fig)

ref = PCovR(mixing=alpha, n_components=2, space="sample")
ref.fit(X_train, Y_train)
tref = ref.transform(X_test)
yref = ref.predict(X_test)
xref = ref.inverse_transform(tref)

plot_projection(
    Y_test,
    check_mirrors(T_fspcovr_test, tref),
    fig=fig,
    ax=axes[0],
    title="PCovR (Feature Space)",
    **cmaps
)
plot_projection(
    Y_test, tref, fig=fig, ax=axes[1], title="PCovR (Sample Space)", **cmaps
)

fig, axes = plt.subplots(1, 2, figsize=dbl_fig)

plot_regression(
    Y_test[:, 0],
    Y_fspcovr_test[:, 0],
    fig=fig,
    ax=axes[0],
    title=r"PCovR (Feature Space, $\alpha={}$)".format(alpha),
    cbar=False,
    **cmaps
)
plot_regression(
    Y_test[:, 0],
    yref[:, 0],
    fig=fig,
    ax=axes[1],
    title=r"PCovR (Sample Space, $\alpha={}$)".format(alpha),
    cbar=False,
    **cmaps
)

plt.show()

If you find a mirror reflection between the upper-left and upper-right subfigures,  it is because the signs of the eigenvectors are not fixed. Their second principal components $PC_2$ may have the opposite signs.

# PCovR Performance

To determine the sensitivity of PCovR to the value of $\alpha$ and the number of PCA components, it is instructive to visualize the dependence of the projections and to the different components of the loss on these parameters. We do so using the skcosmo classes.

In [None]:
n_alphas = 11
alphas = np.linspace(0.0, 1.0, n_alphas)
components = np.arange(0, 5, 1, dtype=np.int)[2::]
n_components = components.size

pcovr_calculators = np.array(
    [
        [PCovR(mixing=a, n_components=c, space="feature") for a in alphas]
        for c in components
    ]
)

for cdx, c in enumerate(components):
    for adx, a in enumerate(alphas):
        pcovr_calculators[cdx][adx].fit(X_train, Y_train)

## Comparison of PCovR Projections and Regressions

To use PCovR, it's useful to get an intuitive sense for the change in the projections and regressions across $\alpha$, and see the trade-off. Below we plot both projections and regressions for $n_{\alpha}$ different values of $\alpha$.

**Note**: Remember that in PCA-like projection, sometimes projections of the same data may be related by mirror reflection. For these projections, this may also occur.

In [None]:
n_plots = int(n_alphas ** 0.5)
scale = 3

t_ref = pcovr_calculators[0][-3].transform(X_test)
y_ref = pcovr_calculators[0][-3].predict(X_test)
x_ref = pcovr_calculators[0][-3].inverse_transform(tref)

pfig, pax = plt.subplots(
    n_plots,
    int(np.ceil(n_alphas / n_plots)),
    figsize=(
        scale * int(np.ceil(n_alphas / n_plots)),
        scale * n_plots,
    ),
)

rfig, rax = plt.subplots(
    n_plots,
    int(np.ceil(n_alphas / n_plots)),
    figsize=(
        scale * int(np.ceil(n_alphas / n_plots)),
        scale * n_plots,
    ),
)
for p, r, pcovr in zip(pax.flatten(), rax.flatten(), pcovr_calculators[0]):

    t = pcovr.transform(X_test)
    y = pcovr.predict(X_test)
    x = pcovr.inverse_transform(t)

    plot_projection(
        Y_test, check_mirrors(t, t_ref), fig=pfig, ax=p, alpha=1.0, s=20, **cmaps
    )

    plot_regression(
        Y_test[:, 0],
        y[:, 0],
        fig=pfig,
        ax=r,
        cbar=False,
        vmin=0,
        vmax=5,
        alpha=1.0,
        s=20,
        **cmaps,
    )

    p.set_title(r"$\alpha=$" + str(round(pcovr.mixing, 3)))
    r.set_title(r"$\alpha=$" + str(round(pcovr.mixing, 3)))


for p, r in zip(pax.flatten()[n_alphas:], rax.flatten()[n_alphas:]):
    p.axis("off")
    r.axis("off")

pfig.subplots_adjust(wspace=0.6, hspace=0.6)
pfig.suptitle(r"Projections across $\alpha$")
rfig.subplots_adjust(wspace=0.6, hspace=0.6)
rfig.suptitle(r"Regressions across $\alpha$")

plt.show()

## Comparison of PCovR Loss Terms
To get a more quantitative assessment of the behavior of PCovR as a function of $\alpha$, we plot the errors from the linear regression and PCA terms of the PCovR. The linear regression loss is calculated as the RMSE between our known and predicted properties

\begin{align}
\ell_{regr} = 
 {\left\lVert \mathbf{Y} - \mathbf{X}\mathbf{P}_{XT}\mathbf{P}_{TY}\right\rVert^2} \\
\end{align}

and the variance loss from the reconstruction of the input features

\begin{equation}
\ell_{proj}=
 {\left\lVert \mathbf{X} - \mathbf{X}\mathbf{P}_{XT}\mathbf{P}_{TX}\right\rVert^2}. \\
\end{equation}


In [None]:
L_pca = np.zeros((n_components, n_alphas))
L_lr = np.zeros((n_components, n_alphas))

for cdx, c in enumerate(components):
    for adx, a in enumerate(alphas):
        calculator = pcovr_calculators[cdx][adx]
        
        # TODO: remove except when score is merged into skcosmo
        try:
            L_pca[cdx, adx], L_lr[cdx, adx] = calculator.score(X_test, Y_test)
        except:
            xr = calculator.inverse_transform(calculator.transform(X_test))
            yp = calculator.predict(X_test)

            L_pca[cdx, adx] = (
                np.linalg.norm(X_test - xr) ** 2.0 / np.linalg.norm(X_test) ** 2.0
            )
            L_lr[cdx, adx] = (
                np.linalg.norm(Y_test - yp) ** 2.0 / np.linalg.norm(Y_test) ** 2.0
            )

In [None]:
fig, [axsLR, axsPCA] = plt.subplots(1, 2, figsize=dbl_fig, sharex=True)

for cdx, c in enumerate(components):
    axsLR.plot(alphas, L_lr[cdx, :], marker="o", label="{:d} PCs".format(c))
    axsPCA.plot(alphas, L_pca[cdx, :], marker="o", label="{:d} PCs".format(c))

axsLR.set_ylabel(r"$\ell_{LR}$")
axsLR.set_xlabel(r"$\alpha$")
axsPCA.set_ylabel(r"$\ell_{PCA}$")
axsPCA.set_xlabel(r"$\alpha$")

fig.subplots_adjust(wspace=0.3)
axsLR.legend()
plt.show()

Given that the two terms vary in opposite directions (which is obvious, since $\alpha$ changes the way they contribute to the total loss), choosing a value implies making a tradeoff between LR and PCA performance. The least biased choice is then to look for the value that minimizes the (non-weighted) sum of the PCA and LR terms. 

This is the optimal choice if one gives equal importance to the two tasks, and is (roughly) equivalent to plotting a $\lambda$ curve -- i.e. a plot of $\ell_{LR}$ against $\ell_{PCA}$ for different values of the mixing $\alpha$ -- and looking for the elbow of the curve, that gives the value at which the improvement in the accuracy of one of the losses balance the degradation of the other.

In [None]:
fig = plt.figure(figsize=dbl_fig)
axsLoss = fig.add_subplot(1, 2, 1)
axsSum = fig.add_subplot(1, 2, 2)

for cdx, c in enumerate(components):
    axsLoss.loglog(L_pca[cdx, :], L_lr[cdx, :], marker="o", label="{:d} PCs".format(c))

axsLoss.set_xlabel(r"$\ell_{PCA}$")
axsLoss.set_ylabel(r"$\ell_{LR}$")
axsLoss.legend()

for cdx, c in enumerate(components):
    loss_sum = L_lr[cdx, :] + L_pca[cdx, :]
    axsSum.semilogy(alphas, loss_sum, marker="o", label="{:d} PCs".format(c))
    print("Optimal alpha for {:d} PCs = {:.2f}".format(c, alphas[np.argmin(loss_sum)]))

axsSum.set_xlabel(r"$\alpha$")
axsSum.set_ylabel(r"$\ell_{LR} + \ell_{PCA}$")

fig.subplots_adjust(wspace=0.5, hspace=0.3)

plt.show()

# Next: Kernel Methods

Continue on to the [next notebook](3_KernelMethods.ipynb)!

# Implementation in `skcosmo`

Classes from the skcosmo module enable computing PCovR with a scikit.learn-like syntax. 

The PCovR class takes a parameter `space` to designate whether projectors should be computed in structure or feature space. If the parameter is not supplied, the class detects the shape of the input data when `pcovr.fit(X,Y)` is called and chooses the most efficient method.

In [None]:
from skcosmo.decomposition import PCovR

***To avoid confusion with scikit-learn naming conventions, where `alpha` is a regularization parameter, the keyword `mixing` is used to specify the PCovR $\alpha$.***

In [None]:
alpha = 0.5
pcovr = PCovR(mixing=alpha, n_components=2)

In [None]:
pcovr.fit(X_train, Y_train)

`pcovr.transform(X)` returns the projection $\mathbf{T}$.

In [None]:
t = pcovr.transform(X_test)

`pcovr.predict(X)` returns the regressed properties $\mathbf{Y}_p$.

In [None]:
y = pcovr.predict(X_test)

`pcovr.predict(T)` returns the reconstructed input data $\mathbf{X}_r$.

In [None]:
x = pcovr.inverse_transform(t)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=dbl_fig)
plot_projection(
    Y_test, t, title=r"PCovR ($\alpha$={})".format(alpha), fig=fig, ax=ax[0], **cmaps
)
plot_regression(
    Y_test[:, 0],
    y[:, 0],
    title=r"PCovR ($\alpha$={})".format(alpha),
    fig=fig,
    ax=ax[1],
    **cmaps
)