# Rank Estimation

An experiment of rank estimation on a toy dataset and the raw UV-Vis dataset. It will follow the method described in @zhang_flexibleimplementationtrilinearity_2022. First we will recreate the GC-MS dataset as described and build an implementation to preprocess, unfold, decompose and display the estimated singular values.

## Toy Dataset

The GC-MS data was downloaded from [here](https://ucphchemometrics.com/2023/06/01/wine-samples-analyzed-by-gc-ms-and-ft-ir-instruments/) and originally prepared by @skov_multiblockvariancepartitioning_2008. It is stored in MATLAB 7.3 format and required the use of [mat73](https://gitlab.com/obob/pymatreader/) library. Within the library is GC-MS, FT-I and physicochemical univariate measurements. The GC-MS data consists of 44 samples x 2700 elution time-points and 200 mass channels.

The authors narrowed the scope to range from 16.52 to 16.76 mins (corresponding to a range of 25 units) containing two compounds (peaks). They identified three significant components (chemical rank) attributing two to the compounds stated and one to the background. We expect to find the same.

In [10]:
%reload_ext autoreload
%autoreload 3

import numpy as np
from pymatreader import read_mat
import xarray as xr
from tensorly import unfold
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from scipy import signal
from copy import deepcopy

from pca_analysis.definitions import DATA_DIR, ROOT

xr.set_options(display_expand_data=False)

mat_path = DATA_DIR / "Wine_v7.mat"
nc_path = DATA_DIR / "Wine_v7.nc"

if nc_path.exists():
    full_data = xr.open_dataarray(nc_path)

else:
    mat_data = read_mat(mat_path)

    key_it = [
        "Label_Wine_samples",
        "Label_Elution_time",
        "Label_Mass_channels",
    ]

    raw_data = xr.DataArray(
        mat_data["Data_GC"],
        dims=["sample", "time", "mz"],
        coords=[mat_data[k] for k in key_it],
    )

    raw_data.to_netcdf(nc_path)


We will organise the data in a labelled xarray for ease of handling

## Preprocessing

To perform the SVD the data needs to be scaled and centered. It doesn't appear as though @zhang_flexibleimplementationtrilinearity_2022 did this, at least they didnt report it. So I will begin PCA without. If the results are poor, I will integrate this preprocessing stage. I think I would mean center columns (observations) and scale rows (samples). TODO: find resources for this.

### Slicing

I will use [tensorly](https://tensorly.org/stable/user_guide/tensor_basics.html#unfolding) to unfold the numpy array. Leaving the xarray for now as it has a potentially different API. Which is fine, but now the modes are unlabelled. Which is still fine, just need to reapply them. Easiest to extract then implement manually. Its probably easist just to find the indexes manually.

In [None]:
full_data


In [None]:
import matplotlib.pyplot as plt

sliced_data = full_data.sel(time=slice(16.52, 16.76))  # interval taken from article
display(sliced_data)
sliced_data[0].plot.line(x="time", add_legend=False);


Which looks correct.

### Unfolding

Unfolding along sample mode such that the matrix is stacked sample-wise with rows as sample, time and columns as wavelengths.

In [None]:
sample_aug = sliced_data.stack(sample_aug=("sample", "time")).transpose(..., "mz")
sample_aug


As we are skipping scaling and centering for now, we will proceed with PCA.

## PCA 

In [67]:
pca = PCA()
sample_aug.pipe(pca.fit_transform);


In [68]:
explained_variance = pca.explained_variance_


We will crudely define the inflection point as being the location where the finite difference becomes less than 2% of the maximum.

In [None]:
from pca_analysis.notebooks.experiments.zhang_pca_scree_plot import pca_scree_plot

pca_scree_plot(explained_variance)


According to the rule of <2% change, the number of significant components is deemed to be {eval}`sig_component_idxs[0][-1]+1`, which is in agreement with the authors, and no preprocessing was necessary.

This example will remain as a test case, we will now test it on the entire toy dataset, then my dataset.

## Full Dataset

Do the same thing but for the full dataset.

### Visual Estimation of Number of Peaks

What are the number of components expected? It should be close to the number of peaks in the maximum mass channel.

In [None]:
# average mz over time and samples

mean_max_mz = raw_data.mean("time").mean("sample").idxmax().item()
mean_max_mz


The mz with the maximum abs is: {eval}`mean_max_mz`

And which sample has the highest average for that channel?

In [None]:
max_sample = raw_data.sel({"mz": 39}).mean("time").idxmax().item()
max_sample


which is {eval}`max_sample`

A visual observation:

In [None]:
max_sample = raw_data.sel({"mz": mean_max_mz, "sample": max_sample})

max_sample.plot.line()


In [None]:
max_sample.shape


In [None]:
class FindPeaks:
    def find_peaks(self, sample, height_ratio=0.001):
        self.height = (sample.max() * height_ratio).item()
        self.peaks, self.properties = signal.find_peaks(sample, height=self.height)

        self.peaks_x = sample["time"][self.peaks].to_numpy()
        self.peaks_y = sample[self.peaks].to_numpy()

        return self

    def plot_peaks(self, sample):
        sample.plot.line()
        plt.scatter(self.peaks_x, self.peaks_y)
        plt.xlim(0, 25)

        return self


fp = FindPeaks()
fp.find_peaks(sample=max_sample)
fp.plot_peaks(sample=max_sample)
peaks_x, peaks_y = fp.peaks_x, fp.peaks_y
peaks_x, peaks_y


In [None]:
fp


Visually it appears that all peaks are accounted for, thus the number of peaks is around {eval}`len(peaks)`. Thus we should expect around 22 significant components through PCA.

### Unfold Full Dataset

In [None]:
caug = raw_data.stack({"aug": ("sample", "time")}).transpose(..., "mz")
caug


In [None]:
from sklearn import decomposition


class MyPCA:
    def run_pca(self, data):
        obj = deepcopy(self)
        obj.pca = decomposition.PCA()

        obj.pca.fit_transform(data)

        return obj

    def scree_plot(self):
        if not hasattr(self, "pca"):
            raise RuntimeError("call `run_pca` first")

        xp_var = self.pca.explained_variance_[:10]
        x = range(1, len(xp_var) + 1)
        plt.bar(x, xp_var)
        plt.xlabel("components")
        plt.ylabel("explained variance")
        plt.title("explained variance vs. explained components")
        plt.plot(x, np.cumsum(xp_var), "r")


pca = MyPCA()
pca = pca.run_pca(caug)
pca.scree_plot()


But as we can see, the profile is coincidentally similar to the 2 peak slice with the vast majority of the variance explained by the first three compoonents, infact by the first two. We can presume that this is because of the dominance of the maxima peak. If we remove it from the set..

## Without Maxima

In [None]:
# Cut it the signal half way between the first two peaks

# cut the maxima off while retaining a decent amount of baseline

maxima_idx = peaks_y.argmax()
next_peak_idx = maxima_idx + 1
cut_time_start = peaks_x[maxima_idx] + (peaks_x[next_peak_idx] - peaks_x[maxima_idx])

mean_distance = int(np.ceil(np.diff(peaks_x).mean()))
last_peak = peaks_x[-1]
cut_time_end = last_peak + mean_distance

# chop the empty component of the signal off - empty is defined as a distance from the last peak equal to the average gap between peaks

# average gap between peaks

shortened = raw_data.where(
    (raw_data.time >= cut_time_start) & (raw_data.time <= cut_time_end)
).dropna("time")
shortened.sel(mz=44).plot(col="sample", col_wrap=4)


Looks good. Whats the PCA?

In [None]:
stacked_shortened = shortened.stack({"aug": ("sample", "time")}).transpose(..., "mz")

pca_shortened = pca.run_pca(data=stacked_shortened)
pca_shortened.scree_plot()


um. Kind of better? At least the drop off isnt so harsh, but evidently the PCA is still being dominated by only a few latent variables. Its time to implement scaling and centering, then if that doesnt work, binning *shudder*.

Want to mean center the columns, scale the sample rows.

In [None]:
from sklearn.preprocessing import StandardScaler, Normalizer

normalizer = Normalizer()
scaler = StandardScaler()
normed = normalizer.fit_transform(stacked_shortened)
scaled_normed = scaler.fit_transform(normed)
scaled_normed.shape
normed[0:10, 0:10]


In [None]:
# it is mean centered if the following equals zero.

np.mean(np.abs(np.round(np.mean(scaled_normed, axis=0), 9)))


In [None]:
pca_sn = pca.run_pca(scaled_normed)
pca.scree_plot()


Same. Fairly evident that binning is required.

## CORCONDIA

As the PCA approaches are failing to produce the expected results without extensive manual handling, we will move on to CORCONDIA, a method of approximating the number of components of PARAFAC and PARAFAC-like models (i.e. PARAFAC2) through observation of a Tucker3 core [@bro_newefficientmethod_2003]. It will iterate through components, starting at 1, until a limit (?) and we are looking for a steep dropoff as the indication of the optimal number of components.


In [None]:
from corcondia import corcondia_3d


In [None]:
corcondia_3d(X=raw_data.transpose("sample", "time", "mz").to_numpy(), k=22)


corcondia hasnt worked. The vibe of things is that none of these tools work for high numbers of peaks. Binning is required..

## PARAFAC2

In [None]:
from tensorly.decomposition import parafac2

best_err = np.inf
decomposition = None

true_rank = 22

for run in range(1):
    print(f"Training model {run}...")
    trial_decomposition, trial_errs = parafac2(
        raw_data[0:5, :, :].to_numpy(),
        true_rank,
        return_errors=True,
        tol=1e-8,
        n_iter_max=500,
        random_state=run,
        verbose=True,
    )
    print(f"Number of iterations: {len(trial_errs)}")
    print(f"Final error: {trial_errs[-1]}")
    if best_err > trial_errs[-1]:
        best_err = trial_errs[-1]
        err = trial_errs
        decomposition = trial_decomposition
    print("-------------------------------")
print(f"Best model error: {best_err}")


In [None]:
decomposition
