# Creating a Downsampled Test Dataset

This notebook will create a parquet file of downsampled CUPRAC detected shiraz signals at 450 nm. The database data will be time offset and precision corrected and set to a universal time index. The wine column level will have duplicates made unique via the addition of an ascending integer.


In [None]:
%load_ext autoreload
%autoreload 2

from wine_analysis_hplc_uv import definitions
from wine_analysis_hplc_uv.signal_processing.mindex_signal_processing import SignalProcessor
import pandas as pd
import numpy as np
scipro = SignalProcessor()
df = pd.read_parquet(definitions.RAW_PARQ_PATH)
df

## Downsampling

Apply the methods pioneerd in [time_axis_characterisation_and_normalization](notebooks/time_axis_characterisation_and_normalization.ipynb) and [downsampling_signals](notebooks/downsampling_signals.ipynb) to move the dataset to a global time index then downsample to 2 seconds per observation.

In [None]:
df = \
(
    df.pipe(scipro.standardize_time)
    .resample("2S")
    .interpolate()
)
df

## Subsetting

As discussed in [downsampling_signals](notebooks/downsampling_signals.ipynb), the area of interest is 0 - 20 mins. I will first reduce this test dataset to that range:

In [None]:
df = (df
 .loc[:pd.to_timedelta(20,unit='min')]
 )
df

## Making 'wine' Unique

While not strictly necessary, for visualisation purposes, it is useful to have 1 column that is both the primary key and label. Currently I have two identically labelled 'torbreck the struie' wines, which means I either use the samplecode column, which is not relatable, or I make the wine column unique. I have gone with the latter.

In [None]:
df = scipro.unique_wine_label(df)
df

## Outcome

Below is a plot of the processed and downsampled dataset:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(1)

(
    df
    # .reset_index()
    .melt(ignore_index=False)
    .pipe(lambda df: df if display(df) else df) # display df
    .pipe(lambda df: sns.lineplot(df, ax=ax,x='mins',y='value',hue='wine'))
);
plt.tight_layout()
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) 

## Write to File

Testing between read times for parquet and csv found that parquet read times were twice as fast. so use parquet.

In [None]:
# from wine_analysis_hplc_uv import definitions
# df.to_parquet(
#     definitions.XPRO_DOWNSAMPLED_PARQ_PATH
# )

In [None]:
# a = pd.read_parquet(
#     definitions.XPRO_DOWNSAMPLED_PARQ_PATH
# )
# a