# Chapter - Signal Preprocessing


This chapter details my efforts to standardise a chromatographic dataset of timexintensity signals.


## Meta

Decisions reached during this analysis will be recorded here and in the project [README](../README.md#eda-decisions). Methods developed during this analysis will be formalized [here](../src/wine_analysis_hplc_uv/signal_processing/mindex_signal_processing.py) in a class that will provide methods of preprocessing the time axis for use in pipelines.

In [None]:
from wine_analysis_hplc_uv import definitions

## X-Axis Characterization and Standardisation

We were able to normalize all inter-sample time axes across the dataset and reduce to one universal time axis across the dataset by rounding to a millisecond scale and correcting for a scalar offset that started at time zero. More information can be found [here](notebooks/time_axis_characterisation_and_normalization.ipynb). Primarily, methods for normalizing can be found at [mindex_signal_processing](src/wine_analysis_hplc_uv/signal_processing/mindex_signal_processing.py) `.adjust_timescale` and `.correct_offset`

## Downsampling of Chromatographic Time Series

I have created a proof of concept of dataset compression [here](notebooks/downsampling_signals.ipynb) through downsampling from 2.5Hz to 0.5Hz without noticable loss of signal information. This showed that downsampling with mean aggregation reduced dataset size by 90% without noticable change in signal shape.  Whatsmore, I found that raw signals required an initial resampling to their current frequency to smooth out local irregularities, that 0 - 20 mins is the most interesting (and easiest to work with) range, and that Euclidean distance is an appropriate and intuitive measure of change during downsampling.

## Creating Downsampled Test Dataset

Following the findings of [downsampling_signals](notebooks/downsampling_signals.ipynb), I created a test dataset of 4 shiraz samples in parquet file format [here](notebooks/creating_downsampled_testset.ipynb) after it was found that reading and writing to parquet was twice as fast as csv, and that parquet preserved the multiindexed nature of the dataframe.


In [None]:
definitions.TEST_PARQ_PATH

## Y-Axis Correction and Standardisation

Utilizing the test dataset created [here](notebooks/creating_downsampled_testset.ipynb), I have proceeded to develop a number of y-axis correction methods. As of 2023-09-04 13:39:25 this includes subtraction of y[0], min-max scaling, and subsetting to within 0-20 mins. A 'processed' parquet file can be found at:


In [None]:
definitions.PRO_PARQ_PATH

## Baseline Correction

General notes on baseline correction algorithms are currently (2023-09-04 13:48:23) being developed [here](notebooks/baseline_correction.ipynb).

A brief investigation on Backcor, a literature recommended baseline correction algorithm for chromatograms was undertaken [here](notebooks/backcor_on_sampleset.ipynb). Without manually tweaking the parameters it was found to perform poorly on the dataset, especially at the ends.

Following x and y axis preprocessing developed [here](notebooks/offset_and_scale_y_axis.ipynb) and [here] respectively, an investigation into the different baseline correction methods has been started [here]


## Observation of The Derivatives of Chromatographic signals

It is well-known that the first and second derivatives of a chromatographic signal can be used for peak detection. [This notebook](notebooks/derivatives_of_chromatograms.ipynb) is intended to investigate their behavior and see whether they can be useful in dashboard level signal visualisation.
