---
title: "My Document"
format: html
bibliography: references.bib
link-citations: true
---


# Retention Time Alignment

Due to systemic error, datasets of chromatogams of samples run under the same experimental conditions often will exhibit retention time shifting for the same compound peak. Inter-sample data analysis requires that features are aligned within the same vector. The magnitude and direction of run on run peak shift is unique for each peak for each run, within a distribution, creating a complex problem. A traditional approach is to reduce the dimensionality of the chromatograms through an aggregate measure such as peak area, discarding the time axis in favor of an element-wise ordering. This approach has two downsides - the first is the manual marking, grouping and ordering of peaks across the sampleset, which is subjective and often irreproducible [@zheng_2017], the second is the arbitrary loss of information fed to the statistical model, specifically the signal shapes [@nielsen_1998]. @bos-RecentApplicationsChemometrics-2020 lists correlation-optimized warping (COW), dynamic time warping (DTW), and correlation-optimized shifting (COSHIFT) as the most popular methods of alignment.

## DTW

Dynamic time warping is a method of aligning two time series, a reference series (to be aligned to) and a query series (to be aligned), originally developed in the context of speech recognition technology [@velichko_1970]. Alignment achieved by performing localizaed warping in the form of  stretching and compressing of the query series until a pre-defined level of alignment is reached, measured as the minimization of the distances between the two series. The distance between the two signals is measured as the sum of the distances between each series elementwise pairing [@giorgino_2009]. [@jiao_2015] has shown that DTW is an appropriate method of aligning organic sample chromatogram datasets. [@bork_2013] has discussed the importance of DTW for process monitoring. They describe how traditional DTW can alter y-axis values while aligning the x, however an extension to the method termed 'Derivative Dynamic Time Warping' [@keogh_2001] respects the shape of each series by observing their first derivatives, reducing unnecessary modifications.

### Outputting Aligned Tensors Through DTW

Tomasi discusses signal alignment through DTW in [@tomasi_2004, p. 7, sec. 2.3.3.] where they note that DTW does not itself produce aligned series of the same length, rather outputting shortened or lengthened series depending on the warping path taken. For stacking of sample signals into a tensor, the signals need to be the same length. This is not the intent of the design of the DTW algorithm, which rather is used to output the cost of aligning the series in the form of a distance metric. They do state that a desired synchronization can be achieved by either taking the mean value of intervals of stretching in the query (removing repeated time points in the warping), or an asymmetric warping algorithm which directly maps the query to the reference, but this can cause discontinuities in the warped signal, quoting @kassidas_1997.

## COW

@skov_2006 discussed the use of the COW algorithm for alignment of chromatographic data. The Correlation Optimized Warping (COW) algorithm was developed by Nielsen et al. [@nielsen_1998] for the purpose of data prepraration for multivariate statistical analysis.

COW can be used on 2d and 3d data and the output can be fed directly into models such as PCA.[@nielsen_1998].

COW is similar to DTW constrained to a number of windows along the time axis [@nielsen_1998].

What is COW?

COW aligns one signal onto another through localized linear stretching and compression of its time axis.
COW was developed by Nielsen et al. who demonstrated its use on single and multi-channel HPLC-DAD chromatograms of fungal extracts [@nielsen_1998].

The theory of COW is as follows (using notation from [@nielsen_1998]): for two signals, the 'target' (T) and the 'profile' (P), the two signals are divided up into a finite number of sections ($N=\frac{L_p}{m}$), each of which is internally warped to maximise alignment, resulting in an aligned signal (P'). This operation is constrained to ensure that time ordering is retained. Within each section, warping can either stretch or compress the sections, and in the case of a length mismatch, P' is linearly interpolated to match the length of T. The warping magnitude is constrained through parameter $t$, called 'the slack'. For P and T of different length, slack is defined as lying within the range $\Delta \pm t$ where $\Delta=\frac{L_T}{N}-m$.  Warping is performed section by section, and for the correlation coefficient of each pair is calculated and the end-point of the process is the maximisation of the sum of correlation coefficients. The identification of the section warping optimization is identified through dynamic programming. The calculation only requires specification of the segment length and slack parameters. Nielsen et al. demonstrated that COW performed better when fitting 3D data than 2D as the 'spectral information' restricted overfitting. In their example, they showed that subsetting the dataset to the target interval and baseline correction improved alignment. They also recommended using the cubed correlation coefficient over the base form as it selects for optimzied alignment without compromise. [@nielsen_1998].

DTW and COW were compared by Tomasi et al. [@tomasi_2004]. The MATLAB code the group wrote for COW can be found on the website of [Chemometrics Group of Copenhagen](https://ucphchemometrics.com/warping/).

COW has been adapted for 2D chromtaography by several indepedant research groups [@zhang_2008, @gros_2012].


## DTW Algorithm

notes from @giorgino_ComputingVisualizingDynamic_2009.

### Introduction

$X$: the *test* or *query*
$Y$: the *reference*

*i*: index of $X$
*j*: index of $Y$

*f*: *local dissimilarity function* defined between pairs of $x_i$ and $y_j$. Non negative: $$d(i, j)=f(x_i, y_j) \geq 0$$

$d$: cross-distance matrix between $X$ and $Y$

$\phi(k)$: warping curve $$\phi(k)=(\phi_x(k), \phi_y(k))$$

Where $\phi_x(k), \phi_y(k)$ outputs an integer from 1 to N.

$\phi_x$ remaps $X$ time indices
$\phi_y$ remaps $Y$ time indices

There is a average accumulated distortion between the warped $X$ and $Y$: $$d_\phi(X, Y) = \sum_{k=1}^{T} d(\phi_x(k), \phi_y(k)) \frac{m_\phi(k)}{M_\phi}$$

$m_\phi(k)$: per-step weighting coefficient, 
$M_\phi(k)$: normalizing constant of $m_\phi(k)$

 $\phi$ is constrained to ensure reasonable results. One constraint is monotonicity to ensure time ordering/avoid unnecessary loops: $$\phi_x(k+1) \geq \phi_x(k)$$  $$\phi_y(k+1) \geq \phi_y(k)$$

The goal of DTW is to minimize the distance between $X$ and $Y$: $$D(X, Y)=min \space d_\phi(X, Y)$$

Note: this mentions that Y is also deformed "The deformation of the time axes of $X$ **and** $Y$" (emphasis mine).

DTW can be computed in $O(N \cdot M)$ tiime.

$D(X, Y)$: "minimum global dissimilarity", "DTW distance". Stretch insensitive measure of the 'inherent difference' between $X$ and $Y$


## dtwalign

`dtwalign` is a Python package that includes outputting the alignment path


In [None]:
%load_ext autoreload
%autoreload 2

from pybaselines import Baseline
from dtwalign import dtw
import pandas as pd
import numpy as np
from wine_analysis_hplc_uv import definitions
import seaborn as sns
import matplotlib.pyplot as plt
from wine_analysis_hplc_uv.notebooks.dtw_methods import DTWNotebookMethods
from wine_analysis_hplc_uv.signal_processing.mindex_signal_processing import (
    SignalProcessor,
)

scipro = SignalProcessor()

nb_mtds = DTWNotebookMethods()

df = pd.read_parquet(definitions.XPRO_YPRO_DOWNSAMPLED_PARQ_PATH)
df.head()

To develop an optimal alignment, we will focus on the alignment of 2021 John Duval Shiraz with 2021 Torbreck Struie. 


In [None]:
# join x, y and aligned x on index

x = df.loc[:, ["176"]].pipe(
    lambda df: df.set_axis(
        axis=1,
        labels=pd.MultiIndex.from_arrays(
            [["176"], ["query"], ["NA"], ["abs"]],
            names=["sample", "status", "window_size", "unit"],
        ),
    )
)

y = df.loc[:, ["177"]].pipe(
    lambda df: df.set_axis(
        axis=1,
        labels=pd.MultiIndex.from_arrays(
            [["177"], ["ref"], ["NA"], ["abs"]],
            names=["sample", "status", "window_size", "unit"],
        ),
    )
)

In [None]:
# align 176 on 177 without any constraints

align_x = nb_mtds.dtw_align_series(x, y)

align_x.plot()

As we can see from the alignment plane plot and series overlays, a significant section of the query series has been compressed and then interpolated as a flat line, losing a number of peaks in the process, see below:


In [None]:
plotdata, g = nb_mtds.query_ref_align_plot(x, y, align_x)

As we can see, numerous peaks are lost post-alignment, which is unacceptable.


## Windowing

When aligning a query and reference chromatogram of similar samples we have the expectation that peaks within $d_m$ distance, where $d_m$ is misalignment distance, are the same compound and will be aligned to the same retention time post warping. Conversely, we do not expect regions where the query has peaks but the reference does not to be altered. Unfortunatey, the algorithm does not by default contain that information, and without a global constraint it will drastically alter the query along the entire serie to minimize what it percieves as the distance between the two, including compressing peaks present in the query but not the reference:


As we cam see the query is modified by the algorithm to match the reference in both the x and y axes. This leads to an unexpectedly deformed query. What is needed is to restrict warping to localized regions in the form of a windowing operation with windows of specific geometry. One such is the Sakoe-Chiba band [@sakoe_1978] which restricts how far apart two elements can be when matched: $$|\phi_x(k)-\phi_y(k)| \leq T_0$$ where $T_0$ is the absolute time deviation between two matched elements, specified by the user. As described by @giorgino_ComputingVisualizingDynamic_2009, this creates a boundary within the alignment plane within which the warping path can exist. For a window of size 10:


In [None]:
# get the windowed aligned x, assign a regular frequency time index, rename axes


sakoechiba_10_x_align = nb_mtds.dtw_align_series(
    x, y, dict(window_type="sakoechiba", window_size=10)
)
sakoechiba_10_x_align.plot()

Looks promising, how does it compare to the query and the reference?


In [None]:
plotdata, __ = nb_mtds.query_ref_align_plot(x=x, y=y, x_align=sakoechiba_10_x_align)

In [None]:
(
    plotdata.loc[:, pd.IndexSlice[:, :, :, :, "query and aligned query"]]
    .melt(ignore_index=False, value_name="mAU")
    .pipe(lambda df: df.set_index(keys=df.index.total_seconds() / 60))
    .pipe(sns.lineplot, x="mins", y="mAU", hue="status")
    .set_title("query and aligned query")
)

Success! Selecting a Sakoechiba window with size 10 dramaticaly alters the behavior of the warping, preserving peaks that were otherwise lost. It also appears as though the (in the default setting at least), baseline height differences affect the warp. It appears that reducing the distance/cost pre-warp will reduce overall warping errors. Before we continue we should examine the effects of baseline subtraction.


## DTW with Subtracted Baselines


For the same x and y, assess effect of DTW on baseline subtracted series


In [None]:
data = pd.concat([x, y], axis=1)
data

In [None]:
data = data.rename({"abs": "value"}, axis=1).pipe(
    lambda df: df.set_axis(df.columns.set_names("signal", level="unit"), axis=1)
)
data

A lam of 1000 has been chosen as that appears to fit the baseline to the base of each peak without fitting to the internal area of the peak


In [None]:
# assign the baseline

from pybaselines import Baseline


def assign_baseline_correction(df) -> pd.DataFrame:
    df = data.pipe(
        lambda df: df.melt(ignore_index=False, value_name="raw")
        .drop("signal", axis=1)
        .pipe(
            lambda df: df.groupby("sample", group_keys=False).apply(
                lambda grp: grp.assign(
                    bline=Baseline(grp.index.total_seconds()).asls(grp.raw, lam=1000)[0]
                )
            )
        )
        .pipe(
            lambda df: df.groupby("sample", group_keys=False).apply(
                lambda grp: grp.assign(bcorr=grp.raw - grp.bline)
            )
        )
        .pivot(
            columns=["sample", "status", "window_size"],
            values=["raw", "bline", "bcorr"],
        )
        .pipe(lambda df: df.set_axis(df.columns.set_names("signal", level=0), axis=1))
        .reorder_levels(order=["sample", "status", "window_size", "signal"], axis=1)
        .sort_index(axis=1)
    )
    (
        df.melt(ignore_index=False)
        .pipe(sns.FacetGrid, col="sample")
        .map_dataframe(sns.lineplot, x="mins", y="value", hue="signal")
        .add_legend()
    )
    return df


data = assign_baseline_correction(data)

Now lets observe how the warping behaves for the same parameters


In [None]:
dtw_align_series = nb_mtds.dtw_align_series(
    x=data.loc[:, pd.IndexSlice["176", :, :, "bcorr"]],
    y=data.loc[:, pd.IndexSlice["177", :, :, "bcorr"]],
)
data = (
    pd.concat([data.loc[:, pd.IndexSlice[:, :, :, "bcorr"]], dtw_align_series], axis=1)
    .sort_index(axis=1)
    .droplevel(axis=1, level=3)
)

In [None]:
idx = pd.IndexSlice


def plot_bcorr_dtw(data):
    x = data.loc[:, idx[:, "query", :]]
    y = data.loc[:, idx[:, "ref", :]]
    x_align = data.loc[:, idx[:, "aligned", :]]

    sp1 = data.loc[:, pd.IndexSlice[:, ["query", "ref"]]].pipe(
        lambda df: df.set_axis(
            axis=1,
            labels=pd.MultiIndex.from_frame(
                df.columns.to_frame().assign(subplot="query and reference")
            ),
        )
    )

    # sp2 is aligned and ref

    sp2 = data.loc[:, pd.IndexSlice[:, ["aligned", "ref"]]].pipe(
        lambda df: df.set_axis(
            axis=1,
            labels=pd.MultiIndex.from_frame(
                df.columns.to_frame().assign(subplot="aligned query and reference")
            ),
        )
    )

    ## sp3 is query and aligned

    sp3 = data.loc[:, pd.IndexSlice[:, ["query", "aligned"]]].pipe(
        lambda df: df.set_axis(
            axis=1,
            labels=pd.MultiIndex.from_frame(
                df.columns.to_frame().assign(subplot="query and aligned query")
            ),
        )
    )

    # concatenate the three subplot dataframes, melt, renaming the value column to 'mAU', convert the date time index to a minutes float, create a column-wise facetgrid on 'subplot', map a lineplot to the facetgrids of 'mins', 'mAU', 'status' for hue, set the subplot titles to subplot value.

    plotdata = pd.concat([sp1, sp2, sp3], axis=1)

    g = (
        plotdata.melt(ignore_index=False, value_name="mAU")
        .pipe(lambda df: df.set_index(df.index.total_seconds() / 60))
        .pipe(
            lambda df: sns.FacetGrid(df, col="subplot")
            .map_dataframe(sns.lineplot, x="mins", y="mAU", hue="status")
            .set_titles(col_template="{col_name}")
            .add_legend()
        )
    )


plot_bcorr_dtw(data)

I would call that successful without any further modifications to the dtw algorithm. At this point in time we need a number of tools -

1. a method of evaluating alignment
2. produce a matrix of subplots for each sample in the set row wise.


In [None]:
# baseline correct all samples

df = (
    df.melt(ignore_index=False, value_name="raw")
    .groupby("samplecode", group_keys=False)
    .apply(
        lambda grp: grp.assign(
            bline=Baseline(grp.index.total_seconds()).asls(grp.raw, lam=1000)[0]
        )
    )
    .assign(bcorr=lambda df: df.raw - df.bline)
    .pivot(columns=["samplecode", "wine"], values=["raw", "bline", "bcorr"])
    .pipe(lambda df: df.set_axis(df.columns.set_names(level=0, names="signal"), axis=1))
    .reorder_levels(axis=1, order=["samplecode", "wine", "signal"])
    .sort_index(axis=1)
)
df

In [None]:
# generate a 2x2 plot grid of each sample with an overlay of the raw, the baseline and
# the bcorr signals

(
    df.melt(ignore_index=False, value_name="mAU")
    #  .pipe(lambda df: df.set_index())
    .pipe(sns.FacetGrid, col="wine", col_wrap=2)
    .map_dataframe(sns.lineplot, hue="signal", x="mins", y="mAU")
    .set_titles(col_template="{col_name}")
    .add_legend()
)

The baseline corrected signals appear to be acceptable to me. Does it modify which is the reference?


In [None]:
ref = scipro.most_correlated(df)
ref

In [None]:
# add '(ref)' suffix to the reference wine of the dataset


def label_reference(df, ref):
    """
    add suffix '(ref)' to wine string of reference sample
    """
    oname = df[ref].columns.get_level_values("wine")[0]
    new_name = oname + " (ref)"

    return df.rename({oname: new_name}, axis=1)


df = label_reference(df, ref)

It does, 176 is now the most correlated.


In [None]:
y = df.loc[:, idx[ref, :, "bcorr"]]

# add aligned series to original df through concatenation. Need to subset the original
# with the warping path, reindex to 2S timedelta range, rename signal level to 'aligned
# then concatentate with original df

df = (
    df.loc[:, idx[:, :, "bcorr"]]
    .pipe(
        lambda df: df.groupby(["wine"], group_keys=False, axis=1).apply(
            lambda df: pd.concat(
                [
                    df,
                    df.iloc[dtw(x=df, y=y).get_warping_path(), :].pipe(
                        lambda df: df.set_index(
                            pd.timedelta_range(
                                start=df.index[0], end=df.index[-1], freq="2S"
                            )
                        )
                        .rename_axis("mins")
                        .rename({"bcorr": "aligned"}, axis=1)
                    ),
                ],
                axis=1,
            )
        )
    )
    .pipe(
        lambda df: df.reindex(
            axis=1,
            labels=pd.MultiIndex.from_frame(
                df.columns.to_frame()
                .assign(role="query")
                .assign(
                    role=lambda df: df.loc[:, "role"].where(
                        ~(df.samplecode.isin(ref)), "ref"
                    )
                )
            ),
        )
    )
    .reorder_levels(axis=1, order=["samplecode", "wine", "role", "signal"])
    .sort_index(axis=1)
)
df

In [None]:
# now we want to form the sets of three again.
"""
we are expecting to create a column 'subplot' with a pattern of 1,2,3 and for each
sample a combination of ('query','bcorr'), ('query','aligned'), and ('ref','bcorr').

specifically,

1: ('query','bcorr'), ('ref','bcorr')
2: ('query','aligned'), ('ref','bcorr')
3: ('query','bcorr'), ('query','aligned')

For each we're expecting 2 rows per sample, 8 rows for subplot 1, 8 rows for subplot 2, 8 rows for subplot 3.
We need combinations with repeats using the cartesian product (?)
"""


# get the reference

new_index_ = (
    df.columns.to_frame(index=False)
    .assign(
        state=lambda df: df.signal.where(~(df.signal == "aligned"), "aligned")
        .where(~(df.signal == "bcorr"), "query")
        .where(~(df.role == "ref"), "ref")
    )
    .drop(["role", "signal"], axis=1)
)

In [None]:
new_index = pd.MultiIndex.from_frame(new_index_)
df = df.set_axis(new_index, axis=1)

In [None]:
 column_index_df = df.columns.to_frame(index=False).set_index(["samplecode", "wine", "state"]).loc[lambda df: ~df.index.duplicated(keep='first'),:]
display(column_index_df)

In [None]:
class RelPlotDFBuilder:
    def __init__(self, df):
        """
        Handle the assignment and duplication process and logic to form a relplot df index
        in the following fashion:

        | sample  | row  | col  | role  |
        |---------|------|------|-------|
        | sample1 | row1 | col1 | query |
        | sample1 | row1 | col1 | ref   |
        | sample1 | row1 | col2 | align |
        | sample1 | row1 | col2 | ref   |
        | sample1 | row1 | col3 | query |
        | sample1 | row1 | col3 | align |

        do for each row then concat horizontally.

        Expects a tidy df of column index with levels (for example):

        |    | samplecode      | wine                                  | state   |
        |---:|:----------------|:--------------------------------------|:--------|
        |  0 | 154             | 2020 leeuwin estate shiraz art series | aligned |
        |  1 | 154             | 2020 leeuwin estate shiraz art series | query   |
        |  2 | 176             | 2021 john duval wines shiraz concilio | ref     |
        |  3 | 176             | 2021 john duval wines shiraz concilio | ref     |
        |  4 | 177             | 2021 torbreck shiraz the struie 1     | aligned |
        |  5 | 177             | 2021 torbreck shiraz the struie 1     | query   |
        |  6 | torbreck-struie | 2021 torbreck shiraz the struie 2     | aligned |
        |  7 | torbreck-struie | 2021 torbreck shiraz the struie 2     | query   |

        It is generalized enough to handle any names for the levels, and any values
        (as long as the pattern is consistant), however the 'samplecode' and 'state'
        levels must be in the same order, seperated by 1 level.
        """

        self.df = df

        # create a df out of the df multiindex, drop any duplicates.
        self.column_index_df = (
            df.columns.to_frame(index=False)
            .set_index(["samplecode", "wine", "state"])
            .loc[lambda df: ~df.index.duplicated(keep="first"), :]
        )

        # get the samplecodes as an iterable
        samples = self.column_index_df.index.get_level_values("samplecode").unique()

        # for each sample form a df of 'query' series and 'ref' series, then concat
        # them together
        col1 = pd.concat(
            [self.build_col(sample, "176", "query", "ref", 1) for sample in samples]
        ).assign(row=lambda df: df.groupby("wine").ngroup() + 1)

        # for each sample form a df of 'aligned', 'ref', then concat together
        col2 = pd.concat(
            [self.build_col(sample, "176", "aligned", "ref", 2) for sample in samples]
        ).assign(row=lambda df: df.groupby("wine").ngroup() + 1)

        # for each sample form a df of 'query', 'aligned' for the same sample, then concat
        col3 = pd.concat(
            [
                self.build_col3(sample, "176", "query", "aligned", 3)
                for sample in samples
            ]
        ).assign(row=lambda df: df.groupby("wine").ngroup() + 1)

        # combine all the col dfs
        self.index_df = pd.concat([col1, col2, col3])

        self.test_index_df()

        self.join_df = self.join_df_index_df()

        self.test_join_df()

    def build_col(self, samplecode_1, samplecode_2, state_val_1, state_val_2, colnum):
        """
        combine query and reference for overlaying in col1
        samplecode_1 is the base sample, samplecode_2 is the overlay, or comparison.
        state_val_1 and state_val_2 correspond to the respective samplecode.

        Used for column 1 and column 2.
        """

        if samplecode_1 == samplecode_2:
            sample = self.column_index_df.loc[idx[samplecode_2, :, state_val_2], :]

        else:
            # get sample row
            sample = self.column_index_df.loc[idx[samplecode_1, :, state_val_1], :]

        # get the reference row, reindex it so its 'wine' (row) is s1
        if samplecode_1 == samplecode_2:
            ref = self.column_index_df.loc[idx[samplecode_2, :, state_val_2], :]

        else:
            wine = sample.index.get_level_values("wine")
            ref = self.column_index_df.loc[idx[samplecode_2, :, state_val_2], :].pipe(
                lambda df: df.set_axis(
                    df.index.remove_unused_levels().set_levels(
                        level=["wine"], levels=[wine]
                    )
                )
            )

        # assign row and column identifier to reference and sample rows
        col1 = pd.concat([sample, ref]).assign(col=colnum)

        return col1

    def build_col3(self, samplecode, ref_samplecode, state_val_1, state_val_2, colnum):
        """
        combined query and aligned. Refer to build_col for parameter descriptions
        Cant use 'build_col' because the reference sample doesnt have the same state
        values as the other samples.

        Maybe we can modify how the reference sample is handled. Maybe seperate prior
        to initializing the concatenations in __init__
        """

        if samplecode == ref_samplecode:
            col3 = self.column_index_df.loc[[samplecode]].assign(col=3)

        else:
            col3 = self.column_index_df.loc[
                idx[samplecode, :, [state_val_1, state_val_2]], :
            ].assign(col=colnum)

        return col3

    def test_index_df(self):
        """
        test whether the output ready index_df matches the expected content and structure
        """
        # the expected output of RelPlotIndexBuilder.index_df. orient = 'tight' retains multiindex

        left = pd.DataFrame.from_dict(
            {
                "index": [
                    ("154", "2020 leeuwin estate shiraz art series", "query"),
                    ("176", "2020 leeuwin estate shiraz art series", "ref"),
                    ("176", "2021 john duval wines shiraz concilio (ref)", "ref"),
                    ("176", "2021 john duval wines shiraz concilio (ref)", "ref"),
                    ("177", "2021 torbreck shiraz the struie 1", "query"),
                    ("176", "2021 torbreck shiraz the struie 1", "ref"),
                    ("torbreck-struie", "2021 torbreck shiraz the struie 2", "query"),
                    ("176", "2021 torbreck shiraz the struie 2", "ref"),
                    ("154", "2020 leeuwin estate shiraz art series", "aligned"),
                    ("176", "2020 leeuwin estate shiraz art series", "ref"),
                    ("176", "2021 john duval wines shiraz concilio (ref)", "ref"),
                    ("176", "2021 john duval wines shiraz concilio (ref)", "ref"),
                    ("177", "2021 torbreck shiraz the struie 1", "aligned"),
                    ("176", "2021 torbreck shiraz the struie 1", "ref"),
                    ("torbreck-struie", "2021 torbreck shiraz the struie 2", "aligned"),
                    ("176", "2021 torbreck shiraz the struie 2", "ref"),
                    ("154", "2020 leeuwin estate shiraz art series", "query"),
                    ("154", "2020 leeuwin estate shiraz art series", "aligned"),
                    ("176", "2021 john duval wines shiraz concilio (ref)", "ref"),
                    ("177", "2021 torbreck shiraz the struie 1", "query"),
                    ("177", "2021 torbreck shiraz the struie 1", "aligned"),
                    ("torbreck-struie", "2021 torbreck shiraz the struie 2", "query"),
                    ("torbreck-struie", "2021 torbreck shiraz the struie 2", "aligned"),
                ],
                "columns": ["col", "row"],
                "data": [
                    [1, 1],
                    [1, 1],
                    [1, 2],
                    [1, 2],
                    [1, 3],
                    [1, 3],
                    [1, 4],
                    [1, 4],
                    [2, 1],
                    [2, 1],
                    [2, 2],
                    [2, 2],
                    [2, 3],
                    [2, 3],
                    [2, 4],
                    [2, 4],
                    [3, 1],
                    [3, 1],
                    [3, 2],
                    [3, 3],
                    [3, 3],
                    [3, 4],
                    [3, 4],
                ],
                "index_names": ["samplecode", "wine", "state"],
                "column_names": [None],
            },
            orient="tight",
        )

        pd.testing.assert_frame_equal(left=left, right=self.index_df)

    def join_df_index_df(self):
        """
        massage df and index_df to left join onto index_df
        """

        pdf = (
            self.df.melt(ignore_index=False, value_name="mAU")
            .reset_index()
            .set_index(["samplecode", "state"])
            .drop("wine", axis=1)
        )

        pindex_df = self.index_df.reset_index().set_index(["samplecode", "state"])
        join = pindex_df.join(pdf, how="left").dropna()

        return join

    def test_join_df(self) -> None:
        """
        Take the source df and join df and check whether the join is as expected. Specifically
        make sure that the relationship between the labels and series values has been maintained.

        Calculates the mean value of each series in both the base df and join_df then join
        the two on the mean column. Then checks for any mismatch by checking for NAs in
        result.
        """

        # calculate base df mean rounded to 10 digits and modify index to the 'mean' column
        df_means = (
            self.df.mean().to_frame("mean").round(10).reset_index().set_index("mean")
        )

        # calculate join df mean rounded to 10 digits and modify index to the 'mean' column
        post_join_means = (
            self.join_df.groupby(["samplecode", "wine", "state", "row", "col"])["mAU"]
            .apply(lambda df: df.mean().round(10))
            .to_frame(name="mean")
            .reorder_levels(["row", "col", "samplecode", "wine", "state"])
            .reset_index()
            .set_index("mean")
        )

        # join base df and join df on 'mean'
        mean_join = (
            post_join_means.join(
                df_means.loc[lambda df: ~(df.index.duplicated(keep="first"))],
                how="left",
                rsuffix="right",
                validate="many_to_one",
            )
            .reset_index()
            .set_index(["row", "col", "samplecode", "wine", "state"])
            .sort_index()
        )

        # test whether any NA in df, indicating a failed join. If assertion fails, outputs
        # rows with NA - use to identify mismatching join keys.
        assert ~mean_join.isna().all().all(), (
            "NAs in join, failed. Found in the"
            f" following\n{mean_join[mean_join.isna().any(axis=1)]}"
        )


relplotdata = RelPlotDFBuilder(df).join_df

In [None]:
relplotdata = (
    relplotdata.assign(mins=lambda df: df.mins.dt.total_seconds() / 60)
    .assign(
        col_label=lambda df: df.col.replace(
            {
                1: "query and ref",
                2: "aligned and ref",
                3: "query and aligned",
            }
        )
    )
    .assign(
        col_label=lambda df: pd.Categorical(
            values=df.col_label,
            categories=[
                "query and ref",
                "aligned and ref",
                "query and aligned",
            ],
            ordered=True,
        )
    )
    .reset_index()
    .set_index(["row", "col", "col_label", "samplecode", "state", "wine"])
)

In [None]:
# now display it..

relplot = sns.relplot(
    relplotdata,
    col="col_label",
    row="wine",
    x="mins",
    y="mAU",
    hue="state",
    kind="line",
    legend="full",
    facet_kws=dict(margin_titles=True, subplot_kws=dict(alpha=0.95)),
    errorbar=None,
    palette=sns.color_palette(n_colors=3, palette="colorblind"),
).set_titles(
    col_template="{col_name}",
    row_template="{row_name}",
)
relplot.fig.suptitle(
    "Comparison of Query and Reference Aligned Before and After DTW", fontsize=16
)
relplot.fig.subplots_adjust(top=0.93)

As we can see, the alignment is nominally successful for the samples within the set, remarkably even sample 177, which was the most divergent.


## Conclusion

We have visually shown the effectiveness of DTW with a Sakoe-Chiba global constraint and prior baseline-correction with asls for aligning CUPRAC detected shiraz wine chromatograms. Further investigation will require that I identify appropriate alignment and distance metrics rather than utilizing a visual check. Specifically, we've shown that peaks can be matched, however it is not evident if a mismatch is occuring - suspiciously, no peak remains unmatched in any sample.
