---
title: "Signal EDA by Wine Category"
format: html
---

Back: [Thesis](src/wine_analysis_hplc_uv/notebooks/thesis.qmd) Subheading Profile Descriptions by Category


This document will contain a description of the signal profiles of each wine category by detection method. The intent is to provide an intuition into how models will behave when exposed to these categories, and see how similar they may or may not be. The hypothesis is that within each selected category there will be significant correlation, but also sufficient variance to uniquely identify each sample. The working hypothesis is also that DTW with a Sakoe-Chiba band of window size 10 will enable alignment without extraneous mutation.

We will start with the varietal category.

To perform this method, we need to connect all the separate pipes that resulted in the dataset used to test DTW. As per @bos_2020 [1679], the stages are:

1. denoise and smooth
2. baseline correction
3. retention time alignment
4. peak deonvolution and resolution enhancement
5. data compression

Now, our signals are sufficiently quiet and smooth to skip the first step, and we're not interested yet in stage 4, and stage 5. is achieved through resampling. So, we need:

1. get data
2. resampling
3. baseline correction

All relevant methods are in mindex_signal_processing SignalProcessor but no full pipeline method has been established yet. Lets build one with a single sample as the test subject. 

In [None]:
# set up environment

%reload_ext autoreload
%autoreload 2

from dtwalign import dtw
import pandas as pd
from wine_analysis_hplc_uv import definitions
import seaborn as sns
from wine_analysis_hplc_uv.notebooks.dtw_methods import DTWNotebookMethods
from wine_analysis_hplc_uv.signal_processing.mindex_signal_processing import (
    SignalProcessor,
)

scipro = SignalProcessor()

df = pd.read_parquet(definitions.RAW_PARQ_PATH)
df.head()

In [None]:
df = scipro.standardize_time(df)
df

Now for y_axis correction

In [None]:
df = scipro.zero_y_axis(df)
df

In [None]:
df = scipro.baseline_correction(df)
df

I have named the 'raw' signal 'signal', as it contains all of the information of the derived 'subsignals' such as the baseline, the true signal, etc. The idea being that the raw signal is a superset of other signal sets, the noise set, the true signal set, etc.

In [None]:
# Rename value to signal, melt the frame so one column for numerical values, one for the signal labels. Make 'subsignal' an ordered categorical column of order 'signal','bline','blinesub'.

df = (
df

# .reset_index()
 .melt(ignore_index=False, var_name='subsignal')
 .assign(subsignal=lambda df: pd.CategoricalIndex(df.subsignal, categories = ['signal','bline','blinesub'],ordered=True))
)
df

In [None]:
df = (
    df
    .reset_index()
    .pivot(columns=['samplecode','wine','subsignal'], index='mins', values='value')
    .sort_index(axis=1)
)
df