# Replication of @zhang_flexibleimplementationtrilinearity_2022 Rank Estimation

This study replicated the reported results of @zhang_flexibleimplementationtrilinearity_2022 on a Wine GC-MS dataset. The dataset is described in section 3.2 and the results in 4.2.

We first recreated the dataset and developed methods to preprocess, unfold, decompose and display the estimated singular values.

## Dataset

The GC-MS data was downloaded from [here](https://ucphchemometrics.com/2023/06/01/wine-samples-analyzed-by-gc-ms-and-ft-ir-instruments/) and originally prepared by @skov_multiblockvariancepartitioning_2008. It is stored in MATLAB 7.3 format and required the use of [mat73](https://gitlab.com/obob/pymatreader/) library. Within the library is GC-MS, FT-I and physicochemical univariate measurements. The GC-MS data consists of 44 samples x 2700 elution time-points and 200 mass channels.

The authors narrowed the scope to range from 16.52 to 16.76 mins (corresponding to a range of 25 units) containing two compounds (peaks), described in detail in section 3.2. They identified three significant components (chemical rank) attributing two to the compounds stated and one to the background. We expect to find the same.

In [None]:
%reload_ext autoreload
%autoreload 3

import numpy as np
import xarray as xr
from pca_analysis.cabernet.shiraz.shiraz import Shiraz
from pca_analysis.decomposers import PCA
from pca_analysis.definitions import DATA_DIR
from pca_analysis import xr_plotly
from pca_analysis.cabernet.cabernet import Cabernet
from pca_analysis import cabernet
from pca_analysis.get_dataset import load_zhang_data
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

xr.set_options(display_expand_data=False)

full_data = load_zhang_data()
cabernet.chrom_dims.TIME = "time"
cabernet.chrom_dims.SPECTRA = "mz"
cabernet.chrom_dims.SAMPLE = "sample"

cab = Cabernet(da=full_data)
shz = cab["zhang_data"]

assert isinstance(shz, Shiraz)

demo = shz.sel(time=slice(16.52 - 0.08, 16.76 + 0.08))

display(demo)
display(demo.sel(mz=slice(0, 100)).isel(sample=slice(0, 44, 4)).viz.heatmap(n_cols=3))


The data interval is displayed above and corresponds to that shown by @zhang_flexibleimplementationtrilinearity_2022 figure 5. The 2 peaks are best viewed at mz = 44, as below:

In [None]:
demo.sel(mz=44).viz.line(x="time", overlay_dim="sample")


Where we can see that the peaks are present in all samples at varying intensities.


## Preprocessing



### Unfolding 

@zhang_flexibleimplementationtrilinearity_2022 in section 2.5 describe the application of SVD (PCA) to different unfoldings of the data tensor. They state that depending on which mode is unfolded, the SVD will produce different numbers of significant components if the data is not perfectly trilinear, otherwise all three unfoldings will have the same number of significant components. They state that $X_{\text{caug}}$ produces the most accurate estimate of significant components in the face of noise and trilinear disruption, assuming that each chemical species has a unique spectrum and their relative concentrations are independent. What is $X_{\text{caug}}$? It is the unfolding $(I \times K, J)$, which in the context of the dataset is $(\text{retention times} \times \text{mz}, \text{samples})$. Thus we first need to produce the augmented (unfolded) matrix $C_\text{aug}$, unfolding along the sample mode.


In [None]:
demo.decomp.pca().scree_plot()


## Estimating the number of Components

While @zhang_flexibleimplementationtrilinearity_2022 specify the use of the SVD, a useful interface for estimating the number of components is sklearn's `PCA`. A rudimentary scree plot can be used to observe the inflection point described in figure 3. The authors state in section 4.1 that when observing a function of the explained variance against the number of components, the point where the explained variance does not "change much anymore" is the point where the components start describing the noise of the dataset rather than chemical species.

And as we can see, we're able to recreate the results if the cutoff of the magnitude of change is set to 0.005.

## Full Dataset

Now we'll see what happens if the same method is applied to the full dataset, as its only valid if it works for any number of peaks. If not then some underlying mechanism is at work. To reiterate, we're expecting the number of significant components == the number of peaks == the number of unique chemical species.

### Visual Estimation of Number of Peaks

What are the number of components expected? It should be close to the number of peaks in the maximum mass channel. This is because its a fair assumption that the sample with the most abundant chemical species is also the most intense. To find this value we will find the mass channel with the highest amplitude then the sample with the highest average amplitude at that mass channel. That sample and mz is displayed below, with its peaks highlighted in red.

In [None]:
max_spectra = shz.stats.mean_max_spectral_label
max_sample_label = shz.stats.max_sample_label
max_sample = shz.sel(sample=max_sample_label, mz=max_spectra)
max_sample.sel(time=slice(4, 24)).viz.line(x="time")


In [None]:
cab1 = cab.copy()
max_sample = cab.sel(
    sample=cab["zhang_data"].stats.max_sample_label,
    mz=cab["zhang_data"].stats.mean_max_spectral_label,
).pick_peaks("zhang_data", find_peaks_kwargs=dict(height=0.005))

display(max_sample.peak_array_as_df("peaks"))
display(max_sample.viz.overlay_peaks("zhang_data", "peaks"))


In [None]:
n_peaks = max_sample.peak_array_as_df("peaks").shape[0]

display(f"{n_peaks=}")


In [None]:
from IPython.display import Markdown

Markdown(
    f"Visually it appears that all peaks are accounted for, thus the number of peaks is around {n_peaks}. Thus we should expe1ct around {n_peaks} significant components through PCA."
)


### Unfold Full Dataset

In [None]:
zhang_data = shz
zhang_data.decomp.pca().scree_plot()


But as we can see, the profile is coincidentally similar to the 2 peak slice with the vast majority of the variance explained by the first three compoonents, We can presume that this is because of a lack of scaling and centering distorting the model towards the largest features.

Looks good. Whats the PCA?

## Scaled and Centered PCA Global Maxima


Scaling and centering can help models such as PCA fit noisy or otherwise abbhorant data. In this context, where after unfolding the dataset is represented by samples (sample and timewise rows) and features (spectral dimension columns), then centering is the subtraction of each columns mean from each row and scaling is the division of each by the columns standard variation. This is implemented by `sklearn`'s `StandardScaler`. The result is that each column ranges from 0 to 1 and has a mean of 0.

In [None]:
shz.decomp.pca(standardscale=True).scree_plot()


A reversal in results. It appears that PCA with this user defined metric estimates an optimal number of components equal to the number of peaks detected earlier. Evidently standard scaling the data meaningfully increases the number of significant components estimated by the scree plot. This indicates that it is an appropriate method of estimation for unaligned GC-MS data. To further explore the veracity of the result we would need to estimate the total number of unique peaks across the sample dimension, as this verification has only observed the sample with the absorbance maxima.

## CORCONDIA

CORCONDIA is a method of approximating the number of components of PARAFAC and PARAFAC-like models (i.e. PARAFAC2) through observation of a Tucker3 core [@bro_newefficientmethod_2003]. The algorithm iterates through components, starting at 1, until a user-specified limit, and we are looking for a steep dropoff as the indication of the optimal number of components. In the interest of speed we will restrict the signal to the 15 to 25 minute interval.

In [None]:
demo.sel(mz=44).isel(sample=slice(0, 44, 11)).viz.line(x="time", overlay_dim="sample")

# display(zhang_demo_intvl)
# display(
#     zhang_demo_intvl.sel(
#         mz=44,
#     )
#     .isel(sample=slice(0, 44, 11))
#     .plotly.line(x="time", color="sample")
# )


In [None]:
corcondia_results = demo.to_cabernet().rank_estimation.corcondia(
    "zhang_data", rank_range=(1, 10)
)
display(corcondia_results.diagnostic_over_rank)


As we can see, the CORCONDIA results are ambiguous over a wide range but indicate that a rank of 3 maintains model stability, which is in agreement with the PCA results. Note that while we havent' demonstrated it here ,the results are semi-random and that deserves more study.

## PARAFAC2

### PARAFAC2 Demo Dataset

A demonstration of PARAFAC2 on the demo dataset. We demonstrated earlier that a rank of 3 is appropriate.


In [None]:
rank_range = 3

demo = demo.to_cabernet().decomp.parafac2(
    path="zhang_data", rank=rank_range, n_iter_max=500, nn_modes="all", linesearch=False
)
parafac2 = demo["parafac2"]
parafac2


In [None]:
components = parafac2["components"]
components.sel(mz=44).viz.line(
    x="time",
    facet_dim="sample",
    overlay_dim="component",
    n_cols=3,
).update_layout(
    title="samples per component of the Zhang et. al demo interval",
)


This result corresponds to the deconstruction shown in @zhang_flexibleimplementationtrilinearity_2022, Figure 6.

TODO make viz better.


### PARAFAC2 Full Dataset

We demonstrated earlier that through the PCA approach, we estimate 22 components for the full dataset.


The decomposition of the full dataset with rank = 22 and n_iter_max = 500 takes 7m 25s which is too slow to retain for testing purposes. Furthermore the results are disappointing, with some components capturing multiple peaks while others capture nothing.

In [None]:
parafac2_results_path = DATA_DIR / "zhang_full_data_parafac2_decomp_results_v2.nc"

if not parafac2_results_path.exists():
    cab = cab.decomp.parafac2(
        path="zhang_data", rank=22, verbose=True, n_iter_max=1, nn_modes="all"
    )
    cab.to_netcdf(filepath=parafac2_results_path)
else:
    cab = Cabernet.from_file(filename_or_obj=parafac2_results_path)
cab


In [None]:
components = cab["parafac2/components"]
components.sel(mz=44).isel(sample=slice(0, 44, 5)).viz.line(
    x="time",
    facet_dim="sample",
    overlay_dim="component",
    n_cols=3,
).update_layout(
    title="samples per component of the Zhang et. al demo interval",
)


As we can see from the viz above, the decomposition is far from perfect, with some components capturing multiple peaks and many others capturing nothing. Furthermore there are multiple negative peaks, which is indicative of a very poor model. Further research will be required to explain these results.

## Conclusion

While I was able to successfully recreate Zhang's results and demonstrate that the scree test appears to be able to predict the number of components well when appropriate scaling is used. We found that CORCONDIA has random results which warrent more investigation, and that while PARAFAC performs admirably on a small number of peaks, it fails dramatically when applied to more complicated data.