# Characteristics of Chromatographic Dataset Time Series and Downsampling For Signal Alignment and Compression

This notebook covers my exploratory analysis of the characteristics of the time axis of the chromatographic dataset, and what decisions can be made on data treatment during preprocessing. The ultimate goal is to identify a suitable method of compression to speed up pipeline development iterations. 

The dataset of study is my wine data, specifically cuprac detected shiraz at 450nm.

Decisions reached during this analysis will be recorded here and in the project [README](../README.md#eda-decisions). Methods developed during this analysis will be formalized [here](../src/wine_analysis_hplc_uv/signal_processing/mindex_signal_processing.py) in a class that will provide methods of preprocessing the time axis for use in pipelines.

### `TimeDelta`

2023-08-29 10:16:11 - move this somewhere else at some point.

A note on `TimeDelta`.

`TimeDelta` only contains the attributes `days`, `seconds`, `microseconds`, and `nanoseconds`. Thus to express the time axis in minutes I would need to access it as `.mins.seconds/60`.

In [None]:
# setup

from wine_analysis_hplc_uv import definitions
from wine_analysis_hplc_uv.db_methods import get_data, pivot_wine_data
import pandas as pd
import duckdb as db
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.mode.copy_on_write = True
pd.options.display.width = None
pd.options.display.max_colwidth = 50
pd.options.display.max_rows = 20
pd.options.display.max_columns = 15
pd.options.display.colheader_justify = "left"

In [None]:
# get data

def fetch_dataset(con):
    get_data.get_wine_data(con, detection=('cuprac',), wavelength=(450,), varietal=('shiraz',))
    df = pivot_wine_data.pivot_wine_data(con)
    return df

with db.connect(definitions.DB_PATH) as con:
    df = fetch_dataset(con)
    df.head()

## Sample 154

The sample used to explore the fundamentals of the time series is sample 154 - 2020 leeuwin estate shiraz art series, an Australian Shiraz from Margaret River, Western Australia. A Shiraz has been selected because they, at least anecdotally, have the highest peak intensity and signal complexity, meaning that patterns in the data should be easy to detect.

In [None]:
# get 154
df154 = (df
         .loc[:,pd.IndexSlice['154', :,['mins','value']]]
         .stack(['samplecode','wine'])
         .reset_index()
         .set_index(["mins",'samplecode','wine'])
         .unstack(['samplecode','wine'])
         .reorder_levels(['samplecode','wine','vars'], axis=1)
         .pipe(lambda df: df.set_index(pd.to_timedelta(df.index, unit='minutes')))
         .pipe(lambda df: df.set_index(df.index.round('L'))) # refer to README.md/decisions 1
)
display(df154.head())
display(df154.columns)
display(df154.shape)
ax = (
    df154
    .stack(['samplecode','wine'])
    .reset_index()
    .assign(secs=lambda df: df['mins'].dt.total_seconds())
    .plot.line(x='mins',y='value')

)
ax


## Observing frequency

The use of of sampling methods requires a method of gauging the frequency, and regularity of frequency of observation of each dataset. Observation frequency here is defined as the number of observations per second $\frac{n \space \text{obs}}{m \space \text{seconds}}$. 

In [None]:
(
    df154
    .index
    .to_frame()
    .rename_axis('mins_index')
    .diff(1)
    .rename({'mins':'diff'}, axis=1)
    .reset_index()
    .pipe(lambda df: df if display(df['diff'].value_counts()) else df)
    .agg({'diff': 'mode'})
    .assign(diff=lambda df: df['diff'].dt.total_seconds()*1000)
    .assign(freq=lambda df: (1/df['diff'])*1000)
    .rename({'diff':'diff (ms)', 'freq': 'freq (Hz)'}, axis=1)
)

So we can see that the observation frequency is one observation per 400 milliseconds, or 2.5 Hz, and that at least for this sample, the frequency is consistant. Thus, no extrenuous resampling is necessary beyond compression.

## Time Precision

An unfortunate side-effect of floating-point data types [@_d] is that for a given experimental variable observation, and depending on the numerical data type, there will be a higher number of digits stored in memory than the actual precision of the instrument. As one of my goals is to align all of my time series to one universal time axis, decimial digits beyond an identified level of precision can be treated as noise and discarded without further thought. Thus I need a method of identifying what an appropriate level of precision is. Agilent is not forthcoming with the rating of their DAD, so an internal analysis is required. In [determining_time_precision](./determining_time_precision.ipynb) I observed what effects changing the time scale had on the granularity of the data, and increased the time scale until I identified that a millisecond scale was the highest I could go without resulting in duplicates. A round-about way of approaching the problem, but an effective one.

## Discovery of Dataset Time Axis Offset

2023-08-24 12:48:27

I have discovered that there is a scalar offset across the entire dataset dependant on the time value of the first observation. subtracting the first observation time value aligns each signal to begin at zero time, and as they are all observed at the same frequency, they are all now aligned in the time dimension. The experiment is detailed [here](notebooks/determining_time_axis_offset.ipynb).

## Signal Resampling

- resampling is used to adjust a time series to a higher or lower frequency.
- Pandas has `.resample` or `.asfreq`
- `.resample` is aggregation, `.asfreq` is selection 
@vanderplas_2016

`.resample`
- requires an index/spec. column that is datetime-like
@pandas2022
- can upsample or downsample
- similar API to `DataFrameGroupBy`  in that you call `.resample` on the object, then call an aggregation function, i.e. `df.resample('M').mean()` [@mckinney_2022]


### downsample

- downsampling is a form of data compression as a reduction in frequency of observatoin of a signal. Typically it is achieved by selecting every nth observation from the original signal. The primary motivation for downsampling is to increase computation speed by reducing the dataset size to a minimum while preserving as much information as possible [@nielsen_2019].
  
### upsample

- upsampling is a method adding observations to a time series to increase its frequency through interpolation based on the original dataset. Nielsen [@nielsen_2019] gives the example of standardizing a dataset consisting of time series with different frequencies. To compare the series they would all need to be at the same frequency. You could either downsample everything to the lowest frequency series, which may lose too much information, or interpolate the lower frequency series to fill in the missing observations. Another example is an irregularly frequent time series - most calculations require a smooth time axis which can achieved through interpolation of the lower frequency regions of the series.


## Downsampling of Chromatographic Time Series

2023-08-29 10:30:17

I have created a proof of concept of dataset compression through downsampling from 2.5Hz to 0.5Hz without noticable loss of signal information [here](./downsampling_signals.ipynb). This showed that downsampling with mean aggregation reduced dataset size by 80%.

## Conclusion

The time axes of my samples has been thoroughly characterised and fundamental unification methods have been developed, whatsmore, a significant dataset compression method has been shown to work without noticable loss of information, and at the very least will provide a framework within which to optimize should it be found that information loss has indeed occured.