## Introduction

This is a demonstrative pipeline of the production of a processed Red Wine CUPRAC detection dataset at 450nm. The output will be a parquet file ready for feeding to sklearn.

In [None]:
# todo: get a dataset of all cuprac reds and run them through the pipeline (gna need to adapt the downsampling into a classlike pipeline

%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
from wine_analysis_hplc_uv import definitions
import seaborn as sns
import matplotlib.pyplot as plt
from wine_analysis_hplc_uv.notebooks.dtw_methods import DTWNotebookMethods, AlignPipe, RelPlotDFBuilder
from wine_analysis_hplc_uv.old_signal_processing.mindex_signal_processing import (
    SignalProcessor,
)

scipro = SignalProcessor()

nb_mtds = DTWNotebookMethods()
idx = pd.IndexSlice
import duckdb as db
from wine_analysis_hplc_uv.etl.build_library.db_methods import get_data, pivot_wine_data

con = db.connect(definitions.DB_PATH)    

get_data.get_wine_data(con=con, detection=('cuprac',), color=('red',), wavelength=(450,))

rw_data = pivot_wine_data.pivot_wine_data(con)
    
rw_data

Within the dataset there are:

In [None]:
# get number of unique samples from unique values of samplecode

len(rw_data.columns.get_level_values('samplecode').unique())

unique samples

In [None]:
# select 'mins' and 'value' for each sample and standardize the time axis for the dset

d = rw_data.loc[:,idx[:,:,['mins','value']]].pipe(scipro.standardize_time)
d.head()

In [None]:
 # A number of samples will need to be eliminated. Produce an overlay plot via melt and convert time axis to float to observe bulk data shape

rw_data = (
        d   
        .melt(ignore_index=False)
        .reset_index().assign(mins=lambda df: df.mins.dt.total_seconds()/60)
        )
rw_data.pipe(sns.lineplot, x='mins',y='value',hue='samplecode')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Will need to remove those datasets that do not follow the general shape, defined as an average height greater than 25 between 5 and 10 mins.

In [None]:
"""
Display the samples whose average height is less than 25 between 5 and 10 minutes.
"""

(rw_data
 .set_index('mins')
 .groupby(['samplecode']).filter(lambda x: x.loc[5.0:10.0,'value'].mean()<25)
 .pipe(lambda df: df if display(df.pipe(sns.lineplot, x='mins',y='value',hue='samplecode')) is None else df)
 );
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

As in previous studies, we will be removing the above displayed chromatograms from the sampleset as their profiles is deemed erroneous.

|    | samplecode   |                           wine                          |
|---:|:-------------|:--------------------------------------------------------|
|  0 | 164          | 2015 yangarra estate old vine grenache                  |
|  1 | 163          | 2015 yangarra estate shiraz mclaren vale                |
|  2 | 128          | 2019 mount pleasant wines mount henry shiraz pinot noir |
|  3 | 165          | 2020 izway shiraz bruce                                 |
|  4 | ca0301       | 2021 chris ringland shiraz                              |
|  5 | 161          | 2021 le juice fleurie fleurie gamay                     |
|  6 | ca0101       | 2021 yering station pinot noir                          |

As we can see, these are not usable signals.

In [None]:
"""use the samplecodes of th samples displayed above to remove them from the dataset. Using
string literals to ensure a record of the samples identified as faulty in case something
changes in the pipeline down the track.

Display another overlay of the dataset to verify that all questionable signals are removed.
"""

samplecodes = ['164', '163', '128', '165', 'ca0301', '161', 'ca0101']

rw_data = (rw_data
 .set_index('mins')
 .groupby(['samplecode']).filter(lambda x: ~x.samplecode.isin(samplecodes).any())
 .pipe(lambda df: df if display(df.pipe(sns.lineplot, x='mins',y='value',hue='samplecode')) is None else df)
 )
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

In [None]:
# go back to tidy format, index back to timedelta, resample to 2S intervals, run DTW alignment pipe, slice to 0-20 mins, then plot

pro_data = (rw_data
 .pivot(columns=['samplecode','wine','vars'], values='value')
 .pipe(lambda df: df.set_axis(pd.TimedeltaIndex(df.index, unit='m')))
 .pipe(lambda df: df.resample("2S").interpolate())
 .pipe(lambda df : AlignPipe().align_pipe(df, display=False))
 .loc[pd.to_timedelta(0,unit='m'):pd.to_timedelta(20, unit='m')]
)

pro_data.plot(legend=False)


In [None]:
# currently the alignment pipe marks the unaligned signal as 'query' and aligned as 'aligned'. As the reference does not require alignment there is no need to distinguish between the aligned and unaligned, but to retain symmetry, two signal columns were kept, both labelled 'ref'. Unfortunately that causes an error when trying to write to parquet. the code below removes the duplicate ref column.

pro_data = pro_data.loc[:,~pro_data.columns.duplicated()]
display(f"num duplicate columns: {pro_data.columns.duplicated(keep='first').sum()}")

In [None]:
# store processed data to parquet file. 
pro_data.to_parquet(definitions.RW_CUP_450_PROCESSED)

In [None]:
pd.read_parquet(definitions.RW_CUP_450_PROCESSED)

As demonstrated above, the dataset has been successfully written to and read from a parquet file at the specified filepath stored in the constant `definitions.RW_CUP_450_PROCESSED`