# Brute Froce PARAFAC2 Decomposition of Shiraz Dataset


Note: this is a reconstruction of an experiment which was accidentally deleted.

Efforts of using CORCONDIA to estimate the rank of the dataset have failed. Thus we will attempt to 'brute force' a solution. This will provide insight into how the model changes over increasing rank.

In [None]:
%reload_ext autoreload
%autoreload 3 --print

# get the test data as two tables: metadata and a samplewise stacked img table
import logging

import duckdb as db
import plotly.express as px
import polars as pl
from database_etl import get_data
from pca_analysis import xr_signal

from pca_analysis.definitions import DB_PATH_UV
from pca_analysis.get_sample_data import get_ids_by_varietal
import plotly.io as pio
import xarray as xr
import darkdetect

logger = logging.getLogger(__name__)

logger.setLevel(logging.DEBUG)

xr.set_options(display_expand_data=False, display_expand_coords=False)

if darkdetect.isDark():
    pio.templates.default = "plotly_dark"


with db.connect(DB_PATH_UV) as conn:
    ids = get_ids_by_varietal("shiraz", conn)

    ds = get_data(output="xr", con=conn, runids=ids)

# replace id with id_rank to be more human friendly
ds = (
    ds.assign_coords(
        id_rank=lambda x: (
            "id",
            x.coords["id"].to_dataframe()["id"].rank(method="dense").astype(int),
        )
    )
    .sel(wavelength="256")
    .swap_dims({"id": "id_rank"})
)
ds = ds.rename({"imgs": "raw_data"})


In [None]:
ds.pipe(
    xr_signal.facet_plot_multiple_traces,
    grouper=["id_rank"],
    data_keys=["raw_data"],
    x="mins",
    trace_kwargs=[dict(mode="lines", line=dict(color="cadetblue"))],
    col_wrap=3,
    fig_kwargs=dict(y_title="au"),
)


As we can see, sample 2 is very much an outlier when compared to the other samples, and will be removed.


In [None]:
ds = ds.where(lambda x: x.id_rank != 2, drop=True)


## Binning (Peak Picking)

- @ianeselli_completeanalysispipeline_2024 used HAC. They preprocessed with smoothing and peak alignment and used unsupervised HAC with Euclidean distance metric and linkage and dendrogram height equal to the average width of the peaks.
- @sinanian_multivariatecurveresolutionalternating_2016 binned to unit masses, but also discussed @bedia_compressionstrategieschemometric_2016 defining Regions of Interest (ROI), stating that while it is useful for greatly eliminating unnecessary data, it can miss low intensity peaks if the threshold is set too high.
- @bedia_compressionstrategieschemometric_2016 describes a feature detection and data compression method based on the *centWave* algorithm.
- @anthony_libraryintegratedsimplismaalsdeconvolution_2022 manually binned the peaks.
- @anthony_libraryintegratedsimplismaalsdeconvolution_2022 state that simple models are better at peak picking than complicated, abstract ones. They describe current issues with peak picking as a significant bottleneck in metabolomic studies.
- @haas_opensourcechromatographicdata_2023 released a Python GUI package for analysis of HPLC-DAD data including peak picking.

## Baseline Subtraction

To simplify tool development, we should first subtract the baseline from each sample. Whether or not there is a baseline is questionable, however the rise and fall does roughly correspond with the change in concentration of methanol in the mobile phase, potentially introducing background absorption. Either way, the data will be easier to work with with zeroed baselines.

In [None]:
import numpy as np
from pybaselines.smooth import snip


def apply_snip(da: xr.DataArray, **kwargs):
    """
    docs: https://pybaselines.readthedocs.io/en/latest/api/pybaselines/smooth/index.html
    """
    blines = []
    for x in da:
        bline, _ = snip(x, **kwargs)
        blines.append(bline)

    blines_ = np.stack(blines)
    blines_da = da.copy(data=blines_)
    blines_da.name = "baselines"

    da_ = xr.merge([da, blines_da])
    da_ = da_.assign(data_corr=lambda x: x["raw_data"] - x["baselines"])
    return da_


ds = ds.raw_data.pipe(apply_snip, max_half_window=30).where(
    lambda x: x.mins < 30, drop=True
)
(
    ds.pipe(
        xr_signal.facet_plot_multiple_traces,
        grouper=["id_rank"],
        data_keys=["raw_data", "data_corr"],
        x="mins",
        trace_kwargs=[
            dict(mode="lines", line=dict(color="cadetblue"), opacity=0.55),
            dict(mode="lines", line=dict(color="red"), opacity=0.95),
        ],
        col_wrap=3,
        fig_kwargs=dict(
            y_title="au",
        ),
    ).update_layout(height=1500)
)


In [None]:
ds


We first need to choose a test set. This is a region with a reasonably flat baseline on either side of a peak cluster, and a cluster of 3 or so peaks. One method of finding the points between clusters is to find local minima. The easiest way to do this is to invert the signal and run a peak finding routine.

## Smoothing

After gross baseline removal comes smoothing. The criteria is that with the default find_peaks params, no peaks are detected before the first 0.77 seconds. This can be achieved through savgol smoothing.

## Clustering


In [None]:
from pca_analysis.preprocessing.pipeline import clustering_by_maxima

ds_, box = clustering_by_maxima(
    ds=ds,
    signal_key="data_corr",
    x_key="mins",
    grouper=["id_rank"],
    by_maxima=True,
    # savgol_kwargs=dict(
    #     polyorder=2,
    #     window_length=70,
    # ),
    display_facet_peaks_plot=True,
    display_cluster_table=True,
    facet_peak_plot_kwargs=dict(
        col_wrap=3,
    ),
    find_peaks_kws=dict(rel_height=0.5, prominence=2),
    clustering_kws=dict(
        n_clusters=None,
        distance_threshold=2,
        linkage="average",
    ),
)

# when not using `n_clusters`, lower `distance_threshold` increases number of
# clusters.

# when clustering by minima, if one signals minima coincides with another's maxima,
# that peak will be cut. Thi     s is not desired.
box


In [None]:
ds


the results of which are quite acceptable. Without much fiddling, ranges are identified within which a reasonable amount of peaks fall (2 > x < 6). The only draw back is that some of the parameter values are currently hard coded data dependent values, meaning that a different baseline subtraction will require different values. A problem to be solved down track, but essentially means that every run will require a little manual tuning.

So now we will take a moment to study clustering of 1D arrays. [clustering](../clustering.ipynb). 

@ianeselli_completeanalysispipeline_2024 used heirarchical agglomerative clustering to cluster the peak maxima with a Euclidean distance metric and used average width linkage distance threshold (average linkage).


Observations:
- clustering on the whole dataset doesnt make any sense. Clustering on the whole signal would simply cluster accordnig to peak heights, i.e. along the y-axis, rather than the x. Thats why we detect peaks first. Either the minima or maxima. Now, Ianeselli chose to cluster the peak maxima rather than minima..
- maximising the number of peaks maximises the extent of the signal captured into all clusters. This is actually beneficial for rough binning.
- more often than not, the center of the inter-cluster regions approximate a local minima, meaning that splitting the interpeak areas between the two clusters is a good method of ensuring that all peak width is captured by its cluster region.
- the problem of cluster labelling is akin to a [gap and island problem](https://mattboegner.com/improve-your-sql-skills-master-the-gaps-islands-problem/)

TODO add a padding parameter
TODO add a smoothing and sharpening routine.
TODO achieve average peak density of 3 peaks per cluster. With sufficient smoothing/sharpening this should be possible.
TODO add demonstrations for other cluster methods (integrate into function?)