this notebook will exhibit my test of performing a correlation matrix aggregation in DuckDB

In [None]:
%load_ext autoreload
%autoreload 2
from wine_analysis_hplc_uv import definitions
import seaborn as sns
import seaborn.objects as so
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
data = pd.read_parquet(definitions.RAW3DDSET)
data.head()

In [None]:
# clean frame
data = (
    data
    # remove superf column
    .drop("detection", axis=1)
    # set index as label columns
    .set_index(["id", "samplecode", "color", "varietal", "wine"])
    # correct mins offset so all time domains the same
    .assign(mins=lambda df: df.groupby(["id"])["mins"].transform(lambda x: x - x.min()))
    # sort frame by id then mins
    .sort_values(["id", "mins"])
)
data

In [None]:
# plot 256 nm as overlay. To do this we need to select the 256nm column then create the seaborn objects

p = (
    data.set_index(["mins"], append=True)
    .loc[:, ["256"]]
    .reset_index()
    .pipe(
        lambda df: so.Plot(df, x="mins", y="256").add(
            so.Line(), color="samplecode", legend=False
        )
    )
)
p

Dataset looks good. I am concerned that the imbalance between the maxima region and the rest of the dataset will create problems, but that is for another time. The next thing is to investigate whether we can drop some wavelengths. Lets produce a plot of the surface:

In [None]:
# select sample 85, melt on wavelength

mdata = (
    data.reset_index()
    .loc[lambda df: (df.samplecode == "85") &(df.mins < 27)]
    .melt(
        id_vars=["id", "samplecode", "color", "varietal", "wine", "mins"],
        value_name="mau",
        var_name="wavelength",
    )
)
mdata

In [None]:
# overlay plot of wavelengths for sample 85 before 27 mins

data_27mins = mdata
data_27mins.pipe(
    lambda df: so.Plot(df, x="mins", y="mau", color="wavelength").add(
        so.Line(), legend=False
    )
)

In [None]:
# what about baseline subtraction..
from wine_analysis_hplc_uv.notebooks import eda_by_category_methods

dtwprocess = eda_by_category_methods.DTWProcessing()

bcdata = dtwprocess.blinecorr(data_27mins, "wavelength", "mau", bcorr_label="bcorr")

bcdata.head()

In [None]:
# plot baseline subtracted sample 85 wavelength overlay

bcdata.pipe(
    lambda df: so.Plot(df, x="mins", y="bcorr", color="wavelength").add(
        so.Line(), legend=False
    )
)

In [None]:
# 
bcdata = (bcdata
    .assign(bcorr=lambda df: df.bcorr.where(lambda x: x<0, 0))
    .assign(
        logbcorr=lambda df: df.bcorr.transform(lambda x: np.log10(x)),
        wavelength=lambda df: df.wavelength.astype(int),
)
    .assign(
        logbcorr=lambda df: df.logbcorr.fillna(0)
    )
    )
bcdata

In [None]:
1e5

In [None]:
%matplotlib inline
from matplotlib.colors import LogNorm
fig, ax = plt.subplots()
c = ax.tricontourf(bcdata.wavelength, bcdata.mins, bcdata.bcorr, norm=LogNorm(vmax=1e5,clip=1))

# artists, labels= c.legend_elements()
# plt.legend(artists, labels,bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
cb = fig.colorbar(c)
cb.set_ticks(cb.get_ticks(),
              labels=['%.2g'%c for c in cb.get_ticks()])


Anyway, Ok, its not clear whether I should drop anything at this point. Lets continue to MCR-ALS and see if we have to drop later on.