# Rank Estimation

An experiment of rank estimation on a toy dataset and the raw UV-Vis dataset. It will follow the method described in @zhang_flexibleimplementationtrilinearity_2022. First we will recreate the GC-MS dataset as described and build an implementation to preprocess, unfold, decompose and display the estimated singular values.

## Toy Dataset

The GC-MS data was downloaded from [here](https://ucphchemometrics.com/2023/06/01/wine-samples-analyzed-by-gc-ms-and-ft-ir-instruments/) and originally prepared by @skov_multiblockvariancepartitioning_2008. It is stored in MATLAB 7.3 format and required the use of [mat73](https://gitlab.com/obob/pymatreader/) library. Within the library is GC-MS, FT-I and physicochemical univariate measurements. The GC-MS data consists of 44 samples x 2700 elution time-points and 200 mass channels.

The authors narrowed the scope to range from 16.52 to 16.76 mins (corresponding to a range of 25 units) containing two compounds (peaks). They identified three significant components (chemical rank) attributing two to the compounds stated and one to the background. We expect to find the same.

In [None]:
%reload_ext autoreload
%autoreload 3

import numpy as np
from pymatreader import read_mat
import xarray as xr
from sklearn.decomposition import PCA

from pca_analysis.definitions import DATA_DIR

xr.set_options(display_expand_data=False)

mat_path = DATA_DIR / "Wine_v7.mat"
nc_path = DATA_DIR / "Wine_v7.nc"

if nc_path.exists():
    full_data = xr.open_dataarray(nc_path)

else:
    mat_data = read_mat(mat_path)

    key_it = [
        "Label_Wine_samples",
        "Label_Elution_time",
        "Label_Mass_channels",
    ]

    full_data = xr.DataArray(
        mat_data["Data_GC"],
        dims=["sample", "time", "mz"],
        coords=[mat_data[k] for k in key_it],
    )

    full_data.to_netcdf(nc_path)

display(full_data)


We will organise the data in a labelled xarray for ease of handling

## Preprocessing

To perform the SVD the data needs to be scaled and centered. It doesn't appear as though @zhang_flexibleimplementationtrilinearity_2022 did this, at least they didnt report it. So I will begin PCA without. If the results are poor, I will integrate this preprocessing stage. I think I would mean center columns (observations) and scale rows (samples). TODO: find resources for this.

### Slicing

I will use [tensorly](https://tensorly.org/stable/user_guide/tensor_basics.html#unfolding) to unfold the numpy array. Leaving the xarray for now as it has a potentially different API. Which is fine, but now the modes are unlabelled. Which is still fine, just need to reapply them. Easiest to extract then implement manually. Its probably easist just to find the indexes manually.

In [None]:
full_data


In [None]:
import matplotlib.pyplot as plt

sliced_data = full_data.sel(time=slice(16.52, 16.76))  # interval taken from article
display(sliced_data)
sliced_data[0].plot.line(x="time", add_legend=False);


Which looks correct.

### Unfolding

Unfolding along sample mode such that the matrix is stacked sample-wise with rows as sample, time and columns as wavelengths.

In [None]:
sample_aug = sliced_data.stack(sample_aug=("sample", "time")).transpose(..., "mz")
sample_aug


As we are skipping scaling and centering for now, we will proceed with PCA.

## PCA 

In [130]:
pca = PCA()
sample_aug.pipe(pca.fit_transform);


In [131]:
explained_variance = pca.explained_variance_


We will crudely define the inflection point as being the location where the finite difference becomes less than 2% of the maximum.

In [None]:
from pca_analysis.notebooks.experiments.zhang_pca_scree_plot import pca_scree_plot

pca_scree_plot(explained_variance)


According to the rule of <2% change, the number of significant components is deemed to be {eval}`sig_component_idxs[0][-1]+1`, which is in agreement with the authors, and no preprocessing was necessary.

This example will remain as a test case, we will now test it on the entire toy dataset, then my dataset.

## Full Dataset

Do the same thing but for the full dataset.

### Visual Estimation of Number of Peaks

What are the number of components expected? It should be close to the number of peaks in the maximum mass channel.

#### MZ Maxima


In [None]:
# average mz over time and samples

mean_max_mz_label = full_data.mean("time").mean("sample").idxmax().pipe(int)

f"the mz with the highest mean amplitude is {mean_max_mz_label}"


And which sample has the highest average for that channel?

In [None]:
max_sample_label = full_data.isel(mz=mean_max_mz_label).mean("time").idxmax().item()
f"the sample with the highest mean amplitude at {mean_max_mz_label} is: {max_sample_label}"


Whose chromatogram at 44 looks like:


In [None]:
max_sample = full_data.sel(sample=max_sample_label, mz=mean_max_mz_label)
max_sample.plot.line();


In [None]:
from pca_analysis.notebooks.experiments.zhang_experiment_find_peaks import FindPeaks


fp = FindPeaks()
fp.find_peaks(sample=max_sample)
fp.plot_peaks(sample=max_sample)


In [None]:
n_peaks = len(fp.peaks_x)
n_peaks


In [None]:
f"Visually it appears that all peaks are accounted for, thus the number of peaks is around {n_peaks}. Thus we should expect around {n_peaks} significant components through PCA."


### Unfold Full Dataset

In [None]:
sample_aug_full = full_data.stack({"aug": ("sample", "time")}).transpose(..., "mz")
sample_aug_full


In [None]:
from pca_analysis.notebooks.experiments.zhang_pca_scree_plot import MyPCA

pca = MyPCA()
pca = pca.run_pca(sample_aug_full)
pca.scree_plot()


But as we can see, the profile is coincidentally similar to the 2 peak slice with the vast majority of the variance explained by the first three compoonents, infact by the first two. We can presume that this is because of the dominance of the maxima peak. If we remove it from the set..

## Full Dataset Without Maxima

In [None]:
def exclude_global_maxima(xa: xr.DataArray, peaks_x: np.ndarray, peaks_y: np.ndarray):
    """
    select the subset of the input signal running from half way between the maxima and
    the next peak until the end of the signal
    """
    maxima_idx = peaks_y.argmax()
    next_peak_idx = maxima_idx + 1

    # maxima idx plus the next idx minus the maxima idx.
    # the time half way between the two.
    maxima_time = peaks_x[maxima_idx]
    next_peak_time = peaks_x[next_peak_idx]

    cut_time_start = maxima_time + ((next_peak_time - maxima_time) / 2)

    without_global_maxima = xa.sel(time=slice(cut_time_start, None))

    return without_global_maxima


without_global_maxima = exclude_global_maxima(
    xa=full_data, peaks_x=fp.peaks_x, peaks_y=fp.peaks_y
)
display(without_global_maxima)
without_global_maxima.sel(mz=44).isel(sample=slice(0, 44, 11)).plot(
    col="sample", col_wrap=2
);


Looks good. Whats the PCA?

In [None]:
(
    without_global_maxima.stack({"aug": ("sample", "time")})
    .transpose(..., "mz")
    .pipe(pca.run_pca)
    .scree_plot()
)


An improvement, but evidently the PCA is still being dominated by a subset of latent variables.

 Its time to implement scaling and centering, then if that doesnt work, binning.

## Scaled and Centered PCA Global Maxima


Scaling and centering can help models such as PCA fit noisy or otherwise abbhorant data. In this context, where after unfolding the dataset is represented by samples (sample and timewise rows) and features (spectral dimension columns), then centering is the subtraction of each columns mean from each row and scaling is the division of each by the columns standard variation. This is implemented by `sklearn`'s `StandardScaler`. The result is that each column ranges from 0 to 1 and has a mean of 0.

In [None]:
from sklearn_xarray import wrap
from sklearn_xarray.preprocessing import BaseTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


def unfold(x):
    return x.stack({"aug": ("sample", "time")}).transpose(..., "mz")


class Unfolder(BaseTransformer):
    def __init__(self):
        # required for API
        self.groupby = None
        pass

    def _transform(self, X):
        unfolded = unfold(X)
        return unfolded


prepro_pipeline = Pipeline(
    [
        ("unfold", Unfolder()),
        ("scale", wrap(StandardScaler, reshapes="feature")),
    ]
)

Xt = prepro_pipeline.fit_transform(full_data)

pca2 = MyPCA()
(Xt.pipe(pca2.run_pca).scree_plot(exp_var_change_prop=0.01))


A reversal in results. It appears that PCA with this user defined metric estimates an optimal number of components equal to the number of peaks detected earlier. Evidently standard scaling the data meaningfully increases the number of significant components estimated by the scree plot. This indicates that it is an appropriate method of estimation for unaligned GC-MS data. To further explore the veracity of the result we would need to estimate the total number of unique peaks across the sample dimension, as this verification has only observed the sample with the absorbance maxima.

## CORCONDIA

As the PCA approaches are failing to produce the expected results without extensive manual handling, we will move on to CORCONDIA, a method of approximating the number of components of PARAFAC and PARAFAC-like models (i.e. PARAFAC2) through observation of a Tucker3 core [@bro_newefficientmethod_2003]. It will iterate through components, starting at 1, until a limit (?) and we are looking for a steep dropoff as the indication of the optimal number of components.


In [None]:
from corcondia import corcondia_3d


In [None]:
corcondia_3d(X=raw_data.transpose("sample", "time", "mz").to_numpy(), k=22)


corcondia hasnt worked. The vibe of things is that none of these tools work for high numbers of peaks. Binning is required..

## PARAFAC2

In [None]:
from tensorly.decomposition import parafac2

best_err = np.inf
decomposition = None

true_rank = 22

for run in range(1):
    print(f"Training model {run}...")
    trial_decomposition, trial_errs = parafac2(
        raw_data[0:5, :, :].to_numpy(),
        true_rank,
        return_errors=True,
        tol=1e-8,
        n_iter_max=500,
        random_state=run,
        verbose=True,
    )
    print(f"Number of iterations: {len(trial_errs)}")
    print(f"Final error: {trial_errs[-1]}")
    if best_err > trial_errs[-1]:
        best_err = trial_errs[-1]
        err = trial_errs
        decomposition = trial_decomposition
    print("-------------------------------")
print(f"Best model error: {best_err}")


In [None]:
decomposition
