---
title: "Signal EDA by Wine Category"
format: html
---


Back: [Thesis](src/wine_analysis_hplc_uv/notebooks/thesis.qmd) Subheading Profile Descriptions by Category


This document will contain a description of the signal profiles of each wine category by detection method. The intent is to provide an intuition into how models will behave when exposed to these categories, and see how similar they may or may not be. The hypothesis is that within each selected category there will be significant correlation, but also sufficient variance to uniquely identify each sample. The working hypothesis is also that DTW with a Sakoe-Chiba band of window size 10 will enable alignment without extraneous mutation.

We will start with the varietal category.


To perform this method, we need to connect all the separate pipes that resulted in the dataset used to test DTW. As per @bos_2020 [1679], the stages are:

1. denoise and smooth
2. baseline correction
3. retention time alignment
4. peak deonvolution and resolution enhancement
5. data compression

Now, our signals are sufficiently quiet and smooth to skip the first step, and we're not interested yet in stage 4, and stage 5. is achieved through resampling. So, we need:

1. get data
2. resampling
3. baseline correction

All relevant methods are in mindex_signal_processing SignalProcessor but no full pipeline method has been established yet. Lets build one with a single sample as the test subject. 


In [None]:
# set up environment

%reload_ext autoreload
%autoreload 2

import pandas as pd
from wine_analysis_hplc_uv import definitions
import seaborn as sns
import seaborn.objects as so
from wine_analysis_hplc_uv.old_signal_processing.signal_processor import (
    SignalProcessor,
)

sns.set_theme(rc={"figure.dpi": 100})
import matplotlib.pyplot as plt
from wine_analysis_hplc_uv.notebooks import eda_by_category_methods

plotter = eda_by_category_methods.Plotting()

scipro = SignalProcessor()

data = pd.read_parquet(definitions.RAW_PARQ_PATH)
data.head()

In [None]:
data = data.pipe(scipro.propipe)
data.head()

In [None]:
data.plot()
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

In [None]:
a = data.pipe(scipro.most_correlated, "blinesub")

In [None]:
data = data.pipe(scipro.dynamic_time_warping, "blinesub")
data

In [None]:
relplot_df = (
    data.loc[:, pd.IndexSlice[:, :, :, ["blinesub", "aligned"]]]
    .melt(ignore_index=False)
    .reset_index()
    .assign(
        winelabel=lambda df: df.role + "_" + df.samplecode + "_" + df.wine,
        mins=lambda df: df.mins.dt.total_seconds() / 60,
    )
    .set_index("mins")
    .sort_index()
    .loc[0.0:20.0, :]
)

rp = sns.relplot(
    data=relplot_df,
    x="mins",
    y="value",
    hue="subsignal",
    col="winelabel",
    col_wrap=2,
    kind="line",
    errorbar=None,
    height=3,
    aspect=2,
)
fig, ax = plt.subplots(1)

hm = plotter.alignment_heatmap(data, ax=ax, signal_label="blinesub")

Its not actually clear if modifying the window size is having an effect on the alignment.. need a mathematical description of the alignment. Lets start with peak detection.


## Peak Detection


In [None]:
# import scipy.signal
from scipy import signal

In [None]:
# get the peak indices for each sample, join the resulting series with the data and add
# a sparse column containing values where peaks are detected by boolean masking
peaks = (
    data
    # go to long form df for groupby operations
    .melt(ignore_index=False)
    # groupby 'samplecode' and select 'value' series
    .groupby("samplecode")["value"]
    # find peaks based on the given parameters, the peak index array is the zeroth
    # element of the returned tuple
    .apply(
        lambda value: pd.Series(
            signal.find_peaks(
                value,
            )[0]
        )
    )
    # return to a frame
    .to_frame(name="idx")
    # move 'samplecode' from index to column
    .reset_index("samplecode")
    # add a column 'ispeak' that contains True values to be used to identify the peak
    # elements after the join
    .assign(ispeak=True)
    # set the index as 'samplecode', 'idx' to prepare for join with the df
    .set_index(["samplecode", "idx"])
)
display(peaks)

How do I index the original frame with the found peaks to get the peak values, and preferably assign a boolean peak column alongside the signal column?

One method could be to massage the df into the same shape as the peak df, i.e. long, indices by group. once thats achieved a `where` call should enable me to mark the values corresponding to the indice and the samplecode


In [None]:
(
    data
    # convert the timedelta index to float for easier plotting
    .pipe(lambda df: df.set_axis(axis=0, labels=df.index.total_seconds() / 60))
    # slice to the aligned signal from 0 to 16 minutes
    .loc[0:16, pd.IndexSlice[:, :, :, "aligned"]]
    # go long form for groupby operations
    .melt(ignore_index=False)
    # add a incremental index to match the find_peaks results
    .assign(idx=lambda df: df.groupby("samplecode").cumcount())
    # set the index as samplecode and the newly formed idx
    .reset_index()
    .set_index(["samplecode", "idx"])
    # left join based on samplecode and idx
    .join(peaks, on=["samplecode", "idx"])
    # any rows who did not have a corresponding peak element are NaN, now filled with False
    .assign(ispeak=lambda df: df.ispeak.fillna(False))
    # add a 'peak' column that is equal to the peak value using 'ispeak' as a mask on 'value'
    .assign(peak=lambda df: df["value"].loc[df.ispeak])
    # go to default index
    .reset_index()
    # add an overlay plot of the signals and their peaks
    .pipe(
        lambda df: so.Plot(df, x="mins", color="samplecode")
        .layout()
        .add(so.Line(), y="value")
        .add(so.Dot(), y="peak")
    )
)

from @scipy_findpeaks_2023: "a peak or local maximum is defined as any sample whose two direct neighbours have a smaller amplitude." It notes that noisy signals can result in errors due to loss of information about local maxima. In these cases they recommend using `find_peaks_cwt`, or exploring smoothing options.

Considering the results of above, perhaps it would be a good idea to experiment with smoothing.

In [None]:
data

In [None]:
# the above prototyped code has been wrapped in a class and placed in 'eda_by_category_methods'

processor = eda_by_category_methods.Processing()


def find_peaks_1(df, kwargs: None) -> pd.Series:
    a = signal.find_peaks(df.signal, **kwargs)[0]
    peaks = df.iloc[a].signal
    return peaks


data = data.loc[lambda df: df.signal_label == "blinecorr"].assign(
    peaks=lambda df: df.groupby("samplecode", group_keys=False).apply(find_peaks_1)
)


data.pipe(processor.find_peaks)

Based on the documentation of [chromatograpR](https://ethanbass.github.io/chromatographR/articles/chromatographR.html#pre-processing-data) we could use some smoothing, especially on 176, which is coincidentally our selected reference. But what is smoothing? Also, They use parametric time awrping or variable penalty dynamic time warping for alignment. They then use 'complete-linkage hierarchical clustering' to link peaks across samples.

Unfortunately there does not seem to be a Python library that implements penalty dynamic time warping, so lets focus on smoothing for now.

As per previous studies, first base is a Savitzky-Golay filter, which is implemented by SciPy. @cuadros-rodr√≠guez_2021 used a 5 point window and second order polynomial. The SciPy implementation is a 1D filter that requires the data, a window length and polyorder. There are also a number of other parameters. It returns an array.

In [None]:
sns.set(font_scale=1)

data = pd.read_parquet(definitions.RAW_PARQ_PATH)
f = processor.process(data)
display(f)

As we can see, no effect. The smoothing necessary to remove those detected peaks will result in unsatisfactory loss of signal information. Ergo better to use constraints in the peak detection algo. Also, remaining in simple long form with no multiindex massively reduces reshaping overhead and makes UDF functions much simpler to define..

Now lets add kwargs for peak finder.. added.

In [None]:
display(data)
pro_data = processor.process(data, find_peak_kwargs=dict(height=6))

pro_data

In [None]:
so.Plot(pro_data.loc[lambda df: df.mins < 21], x="mins", color="samplecode").add(
    so.Line(), y="signal"
).add(so.Dot(), y="peaks")

For testing purposes, we're only interested in the top 10 peaks

In [None]:
# get top 10 peaks per sample
pro_data = pro_data.assign(
    select_peaks=lambda df: df.groupby("samplecode", group_keys=False)[
        "peaks"
    ].nlargest(10)
)

# plot the peaks on top of the curves
(
    pro_data.loc[lambda df: df.mins < 21]
    .pipe(
        lambda df: (
            df
            if so.Plot(df, x="mins", color="samplecode")
            .add(so.Line(), y="signal")
            .add(so.Dot(), y="select_peaks")
            else df
        )
    )
    # peak table
    .pipe(
        lambda df: (
            df
            if display(
                df.loc[:, ["samplecode", "wine", "mins", "select_peaks"]]
                .dropna()
                .assign(n_peak=lambda df: df.groupby("samplecode").cumcount())
                .pivot(
                    columns=["samplecode", "wine"],
                    index=["n_peak"],
                    values=["mins", "select_peaks"],
                )
                .reorder_levels([1, 2, 0], axis=1)
                .sort_index(axis=1)
            )
            else df
        )
    )
)

Now align, get the top 10 peaks again, and compare them.

In [None]:
pro_data.head()

In [None]:
# apply dtw

# find reference
def reference(df):
    reference = (
        df.corr().mean().loc[lambda df: df == df.max()]
        # .pipe(scipro.dynamic_time_warpingi)
    )
    return reference.index


reference = reference(
    pro_data.pivot_table(
        columns=["samplecode", "wine"], index=["mins"], values="signal"
    )
)

display(reference)

In [None]:
# align

ref_signal = pro_data.loc[
    pro_data.samplecode == reference.get_level_values("samplecode")[0]
].signal.reset_index(drop=True)

(
    pro_data.set_index(["mins"])
    .groupby("samplecode", group_keys=False)["signal"]
    .apply(scipro.align_query_to_ref, ref_signal)
)