# Chapter - Signal Preprocessing


This chapter details my efforts to standardise a chromatographic dataset of timexintensity signals.


In [None]:
from wine_analysis_hplc_uv import definitions

## Meta

Decisions reached during this analysis will be recorded here and in the project [README](../README.md#eda-decisions). Methods developed during this analysis will be formalized [here](../src/wine_analysis_hplc_uv/signal_processing/mindex_signal_processing.py) in a class that will provide methods of preprocessing the time axis for use in pipelines.

## X-Axis Characterization and Standardisation

We were able to normalize all inter-sample time axes across the dataset and reduce to one universal time axis across the dataset by rounding to a millisecond scale and correcting for a scalar offset that started at time zero. More information can be found at [time_axis_characterisation_and_normalization](./src/wine_analysis_hplc_uv/notebooks/time_axis_characterisation_and_normalization.ipynb). Primarily, methods for normalizing can be found at [mindex_signal_processing](src/wine_analysis_hplc_uv/signal_processing/mindex_signal_processing.py) `.adjust_timescale` and `.correct_offset`


## Downsampling of Chromatographic Time Series

I have created a proof of concept of dataset compression at [downsampling_signals](./src/wine_analysis_hplc_uv/notebooks/downsampling_signals.ipynb) through downsampling from 2.5Hz to 0.5Hz without noticable loss of signal information. This showed that downsampling with mean aggregation reduced dataset size by 90% without noticable change in signal shape.  Whatsmore, I found that raw signals required an initial resampling to their current frequency to smooth out local irregularities, that 0 - 20 mins is the most interesting (and easiest to work with) range, and that Euclidean distance is an appropriate and intuitive measure of change during downsampling.

## Creation of 'Raw' Dataset

For testing the preprocessing, I need to have a version of the dataset of state just prior to the processing. Hence I have created a test dataset :

In [None]:
definitions.RAW_PARQ_PATH

## Creating Downsampled Test Dataset

Following the findings of [downsampling_signals](./src/wine_analysis_hplc_uv/notebooks/downsampling_signals.ipynb), I created a test dataset of 4 shiraz samples in parquet file format at [creating_downsampled_testset](./src/wine_analysis_hplc_uv/notebooks/creating_downsampled_testset.ipynb) after it was found that reading and writing to parquet was twice as fast as csv, and that parquet preserved the multiindexed nature of the dataframe. In this process, the time axis is corrected and rounded and the dataset is moved to a universal time index. The wine column level is also made unique by the addition of an ascending integer count. The filepath of the test dataset parquet file is here:

In [None]:
definitions.XPRO_DOWNSAMPLED_PARQ_PATH

## Y-Axis Correction

Utilizing the test dataset created [here](./src/wine_analysis_hplc_uv/notebooks/creating_downsampled_testset.ipynb), I have proceeded to develop a number of y-axis correction methods. As of 2023-09-04 13:39:25 this only includes subtraction of y[0] to set time zero to y=0. A 'processed' parquet file can be found at:

In [None]:
definitions.XPRO_YPRO_DOWNSAMPLED_PARQ_PATH

Over time I expect more methods to be added to the above, at least a smoothing routine.

## Baseline Correction

General notes on baseline correction algorithms are currently (2023-09-04 13:48:23) being developed in [baseline_correction](./src/wine_analysis_hplc_uv/notebooks/baseline_correction.ipynb).


A brief investigation on Backcor, a literature recommended baseline correction algorithm for chromatograms was undertaken in [backcor_on_sampleset](./src/wine_analysis_hplc_uv/notebooks/backcor_on_sampleset.ipynb). Without manually tweaking the parameters it was found to perform poorly on the dataset, especially at the ends.

Following x and y axis preprocessing developed [here](././src/wine_analysis_hplc_uv/notebooks/time_axis_characterisation_and_normalization.ipynb) and [here](././src/wine_analysis_hplc_uv/notebooks/offset_and_scale_y_axis.ipynb) respectively, an investigation into the different baseline correction methods has been started in [developing_baseline_subtraction](././src/wine_analysis_hplc_uv/notebooks/developing_baseline_subtraction.ipynb).

## Signal Smoothing

A prefunctory investigation into the appliation of Savitzky-Golay smoothing has been started in [investigating_signal_smoothing](././src/wine_analysis_hplc_uv/notebooks/investigating_signal_smoothing.ipynb), but as of 2023-09-07 14:48:51 hasn't gotten very far.

## Multiple Signal Alignment via DTW

2023-09-07 14:56:11

The holy grail of this project is to apply [Dynamic Time Warping]() to a chromatographic dataset in order to align all peaks sample-by-sample. Efforts began in late April 2023 but faltered because the pipeline became unmaintainable. Five months later we're finally at a stage where we can begin again.

The earlier sections of this chapter addressed basic signal preprocessing. Now, we need to produce a method of identifying the 'most similar' signal to all other signals in the set, then align on that signal.

### Identifying Representative sample

A method of identifying a representative sample as been established in [identifying_most_similar_signal](././src/wine_analysis_hplc_uv/notebooks/identifying_most_similar_signal.ipynb) and added to SignalProcessor as `most_correlated`. It uses the pandas built-in method `.corr` to construct a correlation matrix then return the samplecode idx object of the sample with the highest mean correlation.

### DTW

A method of applying DTW utilising `most_correlated` will be developed in [Dynamic Time Warping](././src/wine_analysis_hplc_uv/notebooks/dynamic_time_warping.ipynb)

## Observation of The Derivatives of Chromatographic signals

It is well-known that the first and second derivatives of a chromatographic signal can be used for peak detection. [This notebook](./src/wine_analysis_hplc_uv/notebooks/derivatives_of_chromatograms.ipynb) is intended to investigate their behavior and see whether they can be useful in dashboard level signal visualisation.
