Projection to Latent Structures (PLS)
===

Author: Nathan A. Mahynski

Date: 2023/09/12

Description: Discussion and examples of [PLS](https://en.wikipedia.org/wiki/Partial_least_squares_regression).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mahynski/pychemauth/blob/main/docs/jupyter/gallery/pls.ipynb)

[Partial least squares (PLS) regression](https://en.wikipedia.org/wiki/Partial_least_squares_regression) is also known as "projection to latent structures".  There are many explanations for the PLS algorithm, but in essence this is actually a scheme to project both X and Y while taking each other into account (so this is considered supervised).  This is motivated by the example of [PCR](pca_pcr.ipynb) showing that just because the data ($X$) is naturally spread out in certain dimension does not necessarily mean the that response variable ($Y$) is correlated with that distribution.  Assume we model $X$ and $Y$ as follows:

$$X = TP^T + E$$,

$$Y = UQ^T + F$$,

where $E$ and $F$ are error terms (assumed to be IID). $X$ has dimensions $n \times p$, and $Y$ has dimensions $n \times l$; $T$ and $U$ are the $n \times k$ projection matrices of $X$ and $Y$, respectively.  Here, $k \le p$ represents a dimensionality reduction; while $P$ is $p \times k$ and $Q$ is $l \times k$.

This is particularly useful when we have either the case where (1) we have more regressors than instances ($p > n$), or (2) when the output is correlated with **dimensions that have low variance**, in which case unsupervised PCA will discard those dimensions, and with them, the PCR's predictive power. Note that (1) often occurs when we many correlated dimensions and so we have very few independent regressors in truth.

The PLS algorithm rotates and projects X into a lower k-dimensional space, represented by the scores matrix T (x-scores), and similarly projects Y into the same dimensional space, U (y-scores), where the projections ($P$ and $Q$ matrices) are determined [in a non-linear fashion](/examples/common_chemometrics/frank_friedman.pdf) by sharing information about the decompositions above with each other. Essentially, the decomposition is designed to maximize the covariance between the x and y scores ($T$ and $U$ matrices).

Originally, this was inspired by the [non-linear iterative partial least squares](https://cran.r-project.org/web/packages/nipals/vignettes/nipals_algorithm.html) algorithm as described [here](/examples/common_chemometrics/1-s2.0-0003267086800289-main.pdf); this is just an interative algorithm to find principal components.  Essentially, one could do NIPALS to get an eigendecomposition of $X$ and $Y$ independently, then do regression between them, similar to PCR; however, now $X$ and $Y$ have independent decompositions which might not be good representations of each other, as we saw in the example above.  One way to circumvent this is to "trade" information during the iteration letting these decompositions influence each other.  Later, it was shown that this can be interpreted as finding the decomposition that maximizes the variance of the product, or cross-covariance matrix, $C = Y^TX$, during each step (described below). See Hastie et al. "The Elements of Statistical Learning." or [here](https://www.jstor.org/stable/pdf/1269656.pdf?casa_token=cr6WyBAN7acAAAAA:OknO2R0VHg41_q0UOTo4Ribt2Z7dH8b5meHacIqSdT69GXeQJ_zd8N26I2EVexM0-bOMNqpaD4YYWuwMgw-1BenvoDcDSG10sfoimwCYOETgz9a8Lw) for more details.

scikit-learn has a nice discussion on the differences between [variations on this algorithm](https://scikit-learn.org/stable/modules/cross_decomposition.html#cross-decomposition).  What is described below should correspond to [PLSRegression](https://scikit-learn.org/stable/modules/cross_decomposition.html#cross-decomposition) aka "PLS1" or "PLS2" (depending on the number of targets) which is the most common variation used in the chemometric literature.

Algorithmically, the formulation of PLS goes like this;

1. Mean-center X and Y.  Certain PLS algorithms do not require explicit Y centering as it is done automatically. **Y is not scaled, only centered - because X's loadings are going to be used below we can then use the eigenvectors directly (I believe); Y is incorporated via OLS so its scale is built in.**

Then for $k$ steps:

2. Compute the first left and right singular vectors of the cross-covariance matrix, $C = Y^TX$ (these maximize the covariance between U and T) $\vec{p}$ and $\vec{q}$. NIPALS does this iteratively, but you can do this directly with SVD, for example (numerical differences may result). Save $\vec{p}$ and $\vec{q}$ vectors as columns in $P$ and $Q$ matrices.
3. Use these vectors to project $X$ and $Y$: $\vec{t} = X \vec{p}$ and $\vec{u} = Y \vec{q}$.
4. Deflate the matrices: $X \rightarrow X - \vec{t}\vec{p}^T$, $Y \rightarrow Y - \vec{u}\vec{q}^T$. Note that some algorithms do not require the deflation of $Y$ (see [wiki](https://en.wikipedia.org/wiki/Partial_least_squares_regression)).

End loop

5. Finally, seek the regression that relates the scores matrix, $U = T\beta$.  Recall, (1) $Y = UQ^T$ and (2) $X = TP^T$ (and by orthonormality, $P^T = P^{-1}$ so that $T = XP$); substitution yields:

$Y = UQ^T = T \beta Q^T = X P \beta Q^T$.

If we reformulate as if this was an OLS problem, $Y = X B$, then $B = P \beta Q^T$. Recall the standard OLS solution for $\beta = (T^T T)^{-1}T^TU$. The product $\beta Q^T = (T^T T)^{-1}T^TY$, which leads to

$B = P (T^T T)^{-1}T^TY$.

More references:
* See code at [cimcb](https://github.com/CIMCB/cimcb).
* A nice [blog post](https://nirpyresearch.com/partial-least-squares-regression-python/).
* An old, but thorough [tutorial](/examples/pca_pls/1-s2.0-0003267086800289-main).



<!--SVD formulation.
1. Mean-center X and Y.  Certain PLS algorithms do not require explicit Y centering as it is done automatically. **Y is not scaled, only centered - because X's loadings are going to be used below we can then use the eigenvectors directly (I believe); Y is incorporated via OLS so its scale is built in.**
2. Find directions equivalent to maximizing (co)variance of X "and" Y instead of just X. This is not the logic behind how PLS was originally derived, but it is a common modern statistical explanation for it (see Hastie et al. "The Elements of Statistical Learning." or [here](https://www.jstor.org/stable/pdf/1269656.pdf?casa_token=cr6WyBAN7acAAAAA:OknO2R0VHg41_q0UOTo4Ribt2Z7dH8b5meHacIqSdT69GXeQJ_zd8N26I2EVexM0-bOMNqpaD4YYWuwMgw-1BenvoDcDSG10sfoimwCYOETgz9a8Lw) for more details).
Following the logic from the PCA section ending with the Rayleigh quotient, we are now looking for eigenvectors/values of $(Y^TX)^T(Y^TX)$ (instead of $X^TX$ for PCA). So do SVD on $Y^TX$ to get [eigenvalue decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition#SVD_and_spectral_decomposition) of $(Y^TX)^T(Y^TX)$. Via SVD, we have $Y^TX = USW$, where the columns of $W$ (right-singular vectors) are eigenvectors of $(Y^TX)^T(Y^TX)$.  
3. Project X to T by using W matrix (only take top k eigenvectors). Note that Y was already used to produce T.
4. Linearly regress Y vs T (T=XW), so we have Y = TA (can use exact OLS matrix result).
5. Convert these coefficients so we have Y vs X, Y = XB.
 - Observe: Y = TA = (XW)A
 - Since we want Y = XB, B = WA

Note that PLS1 and others may do this iteratively via the NIPALS algorithm for estimating principal components, but the above is based on SVD which is more memory efficient.-->

In [1]:
if 'google.colab' in str(get_ipython()):
    !pip install git+https://github.com/mahynski/pychemauth@main
    import os
    os.kill(os.getpid(), 9) # Automatically restart the runtime to reload libraries

In [2]:
if 'google.colab' in str(get_ipython()):
    !wget https://github.com/mahynski/pychemauth/raw/main/docs/jupyter/gallery/utils.py
        
try:
    import pychemauth
except:
    raise ImportError("pychemauth not installed")

import matplotlib.pyplot as plt
%matplotlib inline

import watermark
%load_ext watermark

%load_ext autoreload
%autoreload 2

In [2]:
%watermark -t -m -v --iversions

UsageError: Line magic function `%watermark` not found.
