# Compressing Chromatographic Signals through Downsampling

It is of interest to compress my dataset in order to increase development iterations - progress is proportional to the number of development iterations, the time between iterations is proportional to the size of the dataset. We want to reduce the time between iterations, ergo reduce the size of the dataset whereever possible.

One method of signal compression through downsampling with an aggregate function. It is necessary to have a high fidelity detector, as without it peaks may only be partially detected - or not detected at all, however during data processing, we don't need every datapoint, just the most important information from a given interval. Thus multiple datapoints can be summarized through an aggregate function such as the average for an interval. This results in a compressed signal while preserving the overall shape of the dataset. A compromise has to reached between compression and granularity. One negative side-effect is that true peak heights can be lost when averaging out, and the overal intensity tends toward the total intensity mean as the number of observations decreases.

In [None]:
from wine_analysis_hplc_uv import definitions
import duckdb as db
import pandas as pd
from wine_analysis_hplc_uv.db_methods import get_data, pivot_wine_data

con = db.connect(definitions.DB_PATH)
get_data.get_wine_data(
    con, wavelength=("450",), detection=("cuprac",), varietal=("shiraz",)
)
df = pivot_wine_data.pivot_wine_data(con)
df

In [None]:
df = (
    df.stack(["samplecode", "wine"])
    .drop(["id", "detection"], axis=1)
    .reset_index("i")
    .assign(mins=lambda df: pd.to_timedelta(df["mins"], unit="min"))
    #  .unstack(['samplecode','wine'])
    #  .reorder_levels(['samplecode','wine','vars'], axis=1)
    #  .sort_index(axis=1)
)
df

To preserve non-numeric columns during a resample, it is necessary to provide an aggregation rule for them - typically `'first'` will enable you to preserve the 'left' value of the binned period for that column - if its the right side that needs preserving, use `'last'` .


Pandas handles resampling of datasets through the `resample` API which relies on a datetime column to apply binning and aggregation operations to the remaining columns.

The resample frequency is controlled by the `rule` argument which takes an offset string / object ([see docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html)). These strings are called 'offset aliases' and are defined in the [Time series / date functionality](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases) page. They describe a base level of atomization of a time series, and can accept arithmatic manipulations in the string (need to confirm, at least accept multiplication). For example, `'min'` = `'1min'` is minutely, and will downsample the signal to 1 observation per minute, `'0.5min'` will downsample it to one observation every half minute.

So to for example reduce the 40 minute, 6000 observation signal we start with, (6000 obs/40 min = 150 obs / 1 min = 150 obs per minute) by 20%, resulting in 120 observations per minute, (4800 observations)

Note: converting minutes to seconds then calculating the number of observations per second = frequency (Hz). The reciporical is the frequency in seconds, ergo 1/2.5 = 0.4 seconds = 400 milliseconds.

For example, we could reduce the dataset by a ratio of 1:6, which has been selected because the peak maxima lasts for approximately 6 observations:


In [None]:
(
    df.groupby(["samplecode", "wine"])
    .pipe(lambda g: g.get_group(list(g.groups.keys())[0]))
    .reset_index()
    .pipe(lambda df: df if df.plot(x="mins", y="value", title="o freq") else df) # display original signal
    .pipe(
        lambda df: df if display(df.loc[df["value"] >= df["value"].max() * 0.9]) else df
    ) # display observations within 10% of peak maxima
    .pipe(lambda df: df if display(df.info()) else df) # display information about original
    .pipe(
        lambda df: df.resample("2S", on="mins", group_keys=True).agg(
            dict(
                samplecode="first",
                wine="first",
                i="first",
                value="mean",
            )
        ) # resample signal to 1 observation every 2.5 seconds, from 1 every 0.4
    )
    .reset_index()
    .pipe(
        lambda df: df if display(df.loc[df["value"] >= df["value"].max() * 0.9]) else df
    ) # display observations within 10% of peak maxima after resampling
    .pipe(lambda df: df if display(df.info()) else df) # display info about resampled signal
    .pipe(lambda df: df if df.plot(x="mins", y="value", title="downsampled 4s") else df) # plot resampled signal
)

So we can see here that downsampling by a factor of 6 from a sampling rate frequency of 2.5Hz to a frequency of 0.5 Hz, or 1 observation every 2 seconds. This results in a dataframe memory size reduction of 234.5KB to 47 KB, a difference of 187.5 KB, or 79.96%.
