# Compressing Chromatographic Signals through Downsampling

It is of interest to compress my dataset in order to increase development iterations - progress is proportional to the number of development iterations, the time between iterations is proportional to the size of the dataset. We want to reduce the time between iterations, ergo reduce the size of the dataset whereever possible.

One method of signal compression through downsampling, a form of data compression as a reduction in frequency of observation of a signal [@nielsen_2019]. It is necessary to have a high fidelity detector, as without it peaks may only be partially detected - or not detected at all, however during data processing, we don't need every datapoint, just the most important information from a given interval. Thus multiple datapoints can be summarized through an aggregate function such as the average for an interval. This results in a compressed signal while preserving the overall shape of the dataset. A compromise has to reached between compression and granularity. One negative side-effect is that true peak heights can be lost when averaging out, and the overal intensity tends toward the total intensity mean as the number of observations decreases.

In [None]:
from wine_analysis_hplc_uv import definitions
import duckdb as db
import pandas as pd
import numpy as np
from wine_analysis_hplc_uv.db_methods import get_data, pivot_wine_data
import matplotlib.pyplot as plt
import seaborn as sns

con = db.connect(definitions.DB_PATH)
get_data.get_wine_data(
    con, wavelength=("450",), detection=("cuprac",), varietal=("shiraz",)
)
df = pivot_wine_data.pivot_wine_data(con)
df

## Convert mins to timedelta

First the time axis needs to be of timedelta datatype.

In [None]:
df = (
    df.loc[:, "154"]
    .stack(["wine"])
    .drop(["id", "detection"], axis=1)
    .reset_index()
    .drop("i", axis=1)
    .assign(mins=lambda df: pd.to_timedelta(df["mins"], unit="min").round("L"))
    .assign(mins=lambda df: df.mins - df.mins.iloc[0])
    .set_index(["wine", "mins"])
    .unstack("wine")
)
df

## Pandas Resample API

Pandas handles resampling of datasets through the `resample` API which relies on a datetime column to apply binning and aggregation operations to the remaining columns.

The resample frequency is controlled by the `rule` argument which takes an offset string / object ([see docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html)). These strings are called 'offset aliases' and are defined in the [Time series / date functionality](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases) page. They describe a base level of atomization of a time series, and can accept arithmatic manipulations in the string (need to confirm, at least accept multiplication). For example, `'min'` = `'1min'` is minutely, and will downsample the signal to 1 observation per minute, `'0.5min'` will downsample it to one observation every half minute.

## Calculating Sampling Ratios

So to for example reduce the 40 minute, 6000 observation signal we start with, (6000 obs/40 min = 150 obs / 1 min = 150 obs per minute) by 20%, resulting in 120 observations per minute, (4800 observations)

For example, we could reduce the dataset by a ratio of 1:6, which has been selected because the peak maxima lasts for approximately 6 observations:

## Relationship Between Observations per Second and Sampling Frequency

Note: converting minutes to seconds then calculating the number of observations per second = frequency (Hz). The reciporical is the frequency in seconds, ergo 1/2.5 = 0.4 seconds = 400 milliseconds. Observe below:


In [None]:
# calculate the mean sampling frequency for sample 157 by calculating the reciprocal of the mean difference in time elements

mean_sampling_frequency = (
    1 / df.reset_index("mins").mins.dt.total_seconds().diff().mean()
)
mean_sampling_frequency

Secondly, it is important to regularize the frequency. Due to systemic error, variations in the frequency can occur. Rectifying these errors can be achieved by calculating the mean sampling frequency, resampling the dataset to that frequency then interpolating the missing data:

In [None]:
# resample to the mean sampling frequency then interpolate the missing data, then plot

df = df.resample(f"{1/np.round(mean_sampling_frequency,3)}S").interpolate(
    method="linear"
)
df.plot()

### Subset to Interval of Interest

As per previous discussion, the region of interest is between 0 to 20 minutes, thus we will subset the signal:

In [None]:
# subset to 0 - 20 mins, plot result
df = df.loc[: pd.to_timedelta(20, unit="minutes")]
df.plot()
plt.tight_layout()
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

Which is a generally much more interesting region of the signal, and provides a vaguely symmetric baseline.

## Identifying an Appropriate Target Frequency

Generally speaking, choosing a target frequency is an arbitrary choice. Remembering the criteria that we should reduce memory size while preserving the overall shape of the dataset. At a bare minimum, a peak consists of three data points, a start point, a maxima, and an end point. Therefore observing the change in the number of data points making up the maxima peak will indicate the lower limit of sampling frequency of this dataset. Specifically, I will set it as the lowest frequency in which 3 data points are present in the top 10% intensity value of the peak maxima.

The properties of the original signal are as follows:

In [None]:
display(df.info())

In [None]:
# display a statistical description of df, display the points within 90% of the peak maxima

display(
    df.stack(["wine"])
    .pipe(lambda df: df.loc[df["value"] >= df["value"].max() * 0.9])
    .pipe(
        lambda df: df
        if df.reset_index()
        .assign(mins=lambda df: df.mins.dt.total_seconds() / 60)
        .plot.scatter(x="mins", y="value", title="points within 90% of maxima")
        else df
    )
    .pipe(lambda df: df if display(df.describe()) else df)  # display df
)

Downsampling to a frequency of 0.5 Hz, or 1 observation every 2 seconds results in the following:

In [None]:
# resample signal to 1 observation every 2 seconds, from 1 every 0.4

df = df.assign(
    value_2S=lambda df: df.groupby(["wine"], axis=1)
    .resample("2S")
    .interpolate(method="linear")
)

In [None]:
df.value_2S.describe()

In [None]:
"""
display interval of peak maxima within 10% of peak after resampling to 2S intervals,
both statistics and a plot
"""

(
    df.stack(["wine"]).pipe(
        lambda df: df.loc[df.value_2S >= df.value_2S.max() * 0.9, "value_2S"]
        .pipe(
            lambda df: df
            if df.reset_index()
            .assign(mins=lambda df: df.mins.dt.total_seconds() / 60)
            .plot.scatter(x="mins", y="value_2S")
            else df
        )
        .pipe(lambda df: df if display(df.describe()) else df)  # display df
    )
)

That meets the defined criteria, but how dos it compare to the original signal?

In [None]:
# display the original signal and downsampled signal as row-wise subplots

fig, axs = plt.subplots(2, 1)
ax = df["value"].plot.line(ax=axs[0]).legend(labels=["original"])
ax = df["value_2S"].interpolate().plot.line(ax=axs[1]).legend(labels=["downsampled"])
plt.suptitle("original and downsampled signals")
plt.show()

In [None]:
_ = {
    "original_signal": (df.value.memory_usage() / 1024).sum(),
    "2S_downsample": df.value_2S.dropna().memory_usage(index=True) / 1024,
}

(
    pd.DataFrame(_, index=["memory_usage (KB)"])
    .assign(
        diff=lambda df: df["original_signal"] - df["2S_downsample"],
    )
    .assign(perc_diff=lambda df: df["diff"] / df.original_signal * 100)
)

So we can see that downsampling to 2S decreases the series memory size by 91.6%, after removing NA rows. 

## Summarizing Change Due to Downsampling

As there is not a visual change, I need to be able to summarize how the signal has changed, or at least the magnitude of difference between the original and downsampled signals.


## Mapping Downsampled Signal To Original

NOTE: 2023-09-06 13:21:14 this has actually been solved by assigning the downsampled series back to the original frame rather than creating a new index. This section is kept for prosterity, but is now superfluous.

To measure the euclidean distance of two signals, I need to perform vector arithmatic. One of the fundamental rules of vector arithmatic is that the vectors cannot be of different length. The shorter vector needs to be mapped to the longer one before a comparison can be made. One method of doing this could be to upsample the downsampled signal back to the original frequency of 0.4 seconds per observation.

To develop this method we will use a signal downsampled to a sampling frequency of 1 per 30 seconds in order to have a much more obvious difference between the source and destination signals

### Comparing a Downsampled Series to the Original

Below is a comparison of the original time series and the same series downsampled to 30 seconds per observation then upsampled back to 2 seconds per obsveration via linear interpolation:

In [None]:
value_2S_df = df.value_2S.dropna()
upsampled = df.value_2S.resample("30S").interpolate().resample("2S").interpolate()
display(value_2S_df)
display(upsampled)

In [None]:
fig2, axs2 = plt.subplots(2, 1)

value_2S_df.plot(ax=axs2[0]).legend(labels=["2 Second sampling f"])
upsampled.plot(ax=axs2[1]).legend(labels=["30 second sampling f"])
plt.suptitle("Comparison of 2 and 30 second sampling frequencies")
display(pd.concat([value_2S_df.describe(), upsampled.describe()], axis=1))

In [None]:
compare = (
    pd.concat(
        [
            value_2S_df.reset_index()["mins"].agg(min="min", max="max", count="count"),
            upsampled.reset_index()["mins"].agg(min="min", max="max", count="count"),
        ],
        names=["a", "b"],
        axis=1,
    )
    .set_axis(["2S", "30S"], axis=1)
    .rename_axis("freq", axis=1)
    .rename_axis("agg")
)
compare["diff"] = compare["2S"] - compare["30S"]
compare

So there appears to be a discrepency in the resulting array lengths due to the positioning of the bins. The lower I downsample, the shorter the resulting resampled time series is. At default settings:

In [None]:
freqs = ["2S", "4S", "8S", "16S", "32S"]
length = []
for freq in freqs:
    length.append(
        value_2S_df.resample(freq).interpolate().resample("2S").interpolate().shape[0]
    )

ax = plt.scatter(freqs, length)
plt.xlabel("sampling frequency")
plt.ylabel("length")
plt.suptitle("sampling freq. v. resulting array length")

So that's a problem.

So 14 time points are missing on upsampling. Which ones?

In [None]:
value_2S_df.index.difference(upsampled.index)

So its literally the last 30 seconds that have been shaved off. When does that happen? Could it be during the first downsampling?

In [None]:
(
    value_2S_df.resample("30S")
    .interpolate()
    .reset_index()
    .agg(
        count=("mins", lambda x: x.count()),
        first=("mins", lambda x: x.iloc[0]),
        last=("mins", lambda x: x.iloc[-1]),
    )
)

During the downsampling. Its to do with the binning. the last x is compressed into a single data point on the left side of the bin. I think thats an unavoidable consequence of downsampling. So we will need to instead manually add the remaining data points. Or trim the original signal. Or ffill.

`pd.DataFrame.shift()` can be used to move a time series by periods, for example a shift of 15 periods will get the dataframe to 20 minutes. Forming the union of the shifted and original series will produce a time series index that starts at 0 and ends at 20. Reindexing the downsampled series then filling the missing values will result in a downsampled series that has the same number of elements as the original.

In [None]:
value_2S_df

In [None]:
def downsample_upsample(df: pd.DataFrame, new_freq: str) -> pd.DataFrame:
    """
    Downsample then upsample back to the original frequency, shift to match original
    timeframe, then fill so that the two time series can be compared.

    Takes a dataframe with a time series index and a offset alias new frequency. Outputs
    the new sampled dataframe.
    """
    # Time object (in this project typically Second) of input dataframe timeseries index
    ofreq = df.index.freq
    # downsampled-upsampled time series dataframe.
    # downsampled to 'new_freq' then upsampled again to match original dataframe length
    downupdf = df.resample(new_freq).interpolate().resample(ofreq.freqstr).interpolate()
    # difference between input dataframe.index and updowndf.index
    diff = df.index.difference(downupdf.index)
    # extend new dataframe index as union of index and difference, reindex new dataframe
    # and fill NA data from extension with last value
    out_df = downupdf.reindex(downupdf.index.union(diff)).ffill()

    return out_df


downsample_upsample(df.value, "30S")

Now, lets create a calibration curve following how the euclidean distance changes for increasing downsampling.

In [None]:
freqs = ["2S", "4S", "8S", "16S", "32S", "64S"]

downsamples = dict()
for freq in freqs:
    ndf = downsample_upsample(df.value, freq)
    downsamples[freq] = ndf

downsample_df = pd.concat(downsamples, axis=1, names=["freq"])
downsample_df.droplevel("wine", axis=1)

In [None]:
from scipy.spatial.distance import euclidean

dist_df = (
    downsample_df.droplevel("wine", axis=1)
    .stack(["freq"])
    .groupby("freq")
    .agg(
        edist=lambda x: euclidean(df.value.values.flatten(), x.values.flatten()),
    )
    .sort_index()
    .pipe(lambda df: df.set_axis(pd.Categorical(df.index, categories=freqs)))
    .sort_index()
    .rename_axis("freq_str")
    .rename_axis("labels", axis=1)
)
dist_df

In [None]:
dist_df["freq_val"] = dist_df.index.str.replace("S", "").astype("int")
dist_df = dist_df.sort_values("freq_val")
dist_df

In [None]:
from sklearn.linear_model import LinearRegression

fig, ax = plt.subplots(1)
x = dist_df.freq_val.values.reshape(-1, 1)
y = dist_df.edist.values.reshape(-1, 1)
linreg = LinearRegression(fit_intercept=False).fit(x, y=y)
dist_df["fit"] = linreg.predict(x)

(
    dist_df.reset_index()[
        #  .set_index('freq_val')
        ["edist", "fit"]
    ].plot(ax=ax)
)
display(dist_df)

(dist_df.reset_index().plot.scatter(x="freq_str", y="edist", ax=ax))
plt.suptitle("downsampling freq. v. euclidean distance")
plt.tight_layout()

With an $R^2$ of:

In [None]:
linreg.score(x, y)

So we can see that there is a vaguely exponential change for increasing levels of downsamples.

Below is a stacked display of increasing downsampling rates:

In [None]:
freqs

In [None]:
freq_cats = pd.Categorical(
    downsample_df.columns.get_level_values("freq"), categories=freqs, ordered=True
)
freq_cats

In [None]:
"""
One method of setting freq to categorical is to go to long format, modify it there
then going back to tidy. Can see that it worked by reversing the order as defined.
"""

downsample_df = (
    downsample_df.stack(["wine", "freq"])
    .reset_index(name="value")
    .assign(freq=lambda df: pd.Categorical(df.freq, categories=freqs, ordered=True))
    .set_index(["wine", "freq", "mins"])
    .rename_axis(axis=1, mapper="v")
    .unstack(["wine", "freq"])
    .reorder_levels(["wine", "freq", "v"], axis=1)
    .sort_index(level="freq", axis=1, ascending=False)
)
downsample_df

In [None]:
# scale the value axis to be within 0 and 1

from sklearn.preprocessing import MinMaxScaler

downsample_df = downsample_df.apply(
    lambda x: MinMaxScaler().fit_transform(x.values.reshape(-1, 1)).flatten()
)

downsample_df

In [None]:
"""
add a scalar offset to each freq column so that they will be spread out on a line plot.
This is done by first copying downsample_df and then iterate through every freq column 
and add an incrementing scalar value relative to the column name idx.
"""

offset_downsample_df = downsample_df.copy(deep=True).sort_index()

for i, col in enumerate(downsample_df.columns.get_level_values("freq")):
    offset_downsample_df[
        lambda df: (df.columns.get_level_values("wine").values[0], col)
    ] = downsample_df[
        lambda df: (df.columns.get_level_values("wine").values[0], col)
    ] + 0.6 * (i + 1)


offset_downsample_df

In [None]:
"""
display the offset line plots of increasingly downsampled signal. Dont display
x and y axes because the offset causes the y axis to be meaningless, x axis is in
wrong units.
"""

(
    offset_downsample_df.stack(["wine", "freq"]).pipe(
        lambda df: sns.lineplot(df, x="mins", y="value", hue="freq", alpha=0.9).set(
            xticklabels=[], yticklabels=[], ylabel=""
        )
    )
)
plt.suptitle("increasingly downsampled signal from 2S to 64S")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

Which provides an intuitive understanding of the relationship between downsampling frequency and euclidean distance. This means that euclidean distance is an acceptable (if not truly linear) method of gauging the difference between a signal and its downsampled form. This is a bulk summary though, and does not provide any information about local changes. We need to go further and find methods of measuring local changes such as smoothness, local amplitude, etc.

## Conclusion

This study set out with the intention of developing a method of compressing a signal through downsampling, wanted to gauge the positive and negative effects. During this investigation I found that while the mean frequency may be 2.5Hz, the are local variations that require an initial resampling to the mean frequency to smooth out. I also noted that after t=20 mins, the signal has no interest, so subsetting the signal to that point is a good choice, whatsmore it provides a relatively symmetrical baseline for later corrections. After investigation of peak resolution, I identified an appropriate downsample frequency of 2 seconds per observation, and found that there was no visual difference to the signal while saving more than 90% computer memory. Finally, I showed that Euclidean distance is an appropriate measure of the bulk magnitude of change between a source and downsampeld signal, with a weakly exponential positive trend for decreasing frequency.