# Characterizing and Normalizing Dataset Time Axis

This notebook covers our efforts to normalize the time axes of the CUPRAC dataset in order to move toward a universal time index. This is needed as multivariate statistical models such as XGBoost require that the same feature (peak) is in the same column (time) for each sample. We will do this by treating the sample signals as time series.

In [None]:
# setup

from wine_analysis_hplc_uv import definitions
from wine_analysis_hplc_uv.etl.build_library.db_methods import get_data, pivot_wine_data
import pandas as pd
import duckdb as db
import matplotlib.pyplot as plt
import seaborn as sns
import polars as pl

pd.options.mode.copy_on_write = True
pd.options.display.width = None
pd.options.display.max_colwidth = 50
pd.options.display.max_rows = 20
pd.options.display.max_columns = 15
pd.options.display.colheader_justify = "left"

In [None]:
# get data


def fetch_dataset(con):
    query = """--sql
    SELECT
        *
    FROM
        chromatogram_spectra_long cs 
    LEFT JOIN
        sample_metadata sm
    USING
        (id)
    WHERE
        detection='cuprac'
    AND 
        wavelength=450
    AND 
        varietal='shiraz'
    ORDER BY
        mins DESC
    """
    # get_data.get_wine_data(con, detection=('cuprac',), wavelength=(450,), varietal=('shiraz',))
    # df = pivot_wine_data.pivot_wine_data(con)
    return con.sql(query).pl()


con = db.connect(definitions.DB_PATH)
df = fetch_dataset(con)
df.head()

## Sample 154

The sample used to explore the fundamentals of the time series is sample 154 - 2020 leeuwin estate shiraz art series, an Australian Shiraz from Margaret River, Western Australia. A Shiraz has been selected because they, at least anecdotally, have the highest peak intensity and signal complexity, meaning that patterns in the data should be easy to detect.

In [None]:
# get 154


def fetch_154(con):
    query = """--sql
            SELECT
                wine, mins, absorbance
            FROM
                chromatogram_spectra_long cs 
            LEFT JOIN
                sample_metadata sm
            USING
                (id)
            WHERE
                samplecode='154'
            AND 
                wavelength=450
            ORDER BY
                mins ASC
            """
    return con.sql(query).pl()


df_154 = fetch_154(con)
viz = df_154.plot(x="mins", y="absorbance", title="154")
display(viz)

## Measuring Sampling Frequency 

The use of of sampling methods requires a method of gauging the frequency, and regularity of frequency of observation of each dataset. Sampling frequency here is defined as the number of observations per second $\frac{n \space \text{obs}}{m \space \text{seconds}}$. I am expecting the sampling frequency to equal 2.5Hz.

In [None]:
def calculate_sampling_frequency(df: pl.DataFrame):
    """
    Calculate the average sampling frequency for an input chromatogram
    """
    # TODO calculate the frequency in hertz, 1 / seconds * 1e6.(?)
    # lag shifts the column back one relative to the column.
    # lead shifts the column forward one relative to the column.
    # to find the sampling frequency, use the foreward diff, i.e. diff between the column and the one ahead.
    mean_hz = db.sql(
        """--sql
        -- First get the time dimension, expressed in seconds
        CREATE OR REPLACE TEMP TABLE hz_tbl
        AS (
            SELECT
                mins*60.0 AS seconds
            FROM
                df
            ORDER BY seconds ASC
        );
        -- shift the seconds column forward one such that a row contains time, time+1
        CREATE OR REPLACE TEMP TABLE hz_tbl
        AS 
        (
        SELECT
            seconds,
            lag(seconds) OVER () as lag_seconds
        FROM
            hz_tbl
        );
        -- calculate the difference between time and time+1, the forward difference
        CREATE OR REPLACE TEMP TABLE hz_tbl
        AS
        (
        SELECT
            seconds,
            lag_seconds,
            (seconds - lag_seconds) as forward_diff
        FROM
            hz_tbl
        );
        -- calculate the frequency of the difference, expressed in hertz
        CREATE OR REPLACE TEMP TABLE hz_tbl
        AS (
            SELECT
                seconds,
                lag_seconds,
                forward_diff,
                (1 / forward_diff) as hz,
            FROM
                hz_tbl
        );
        -- express the average seconds difference and average hertz.
        SELECT mean(forward_diff) as mean_diff, mean(hz) as mean_hertz FROM hz_tbl;
        """
    ).pl()

    display(mean_hz)


calculate_sampling_frequency(df=df_154)

So we can see that the sampling frequency is one observation per 400 milliseconds, or 2.5 Hz, and that at least for this sample, the frequency is consistant. Thus, no extrenuous resampling is necessary beyond compression.

## Determining Maximum Time Precision

An unfortunate side-effect of floating-point data types [@_d] is that for a given experimental variable observation, and depending on the numerical data type, there will be a higher number of digits stored in memory than the actual precision of the instrument. As one of my goals is to align all of my time series to one universal time axis, decimial digits beyond an identified level of precision can be treated as noise and discarded without further thought. Thus I need a method of identifying what an appropriate level of precision is. Agilent is not forthcoming with the rating of their DAD, so an internal analysis is required. In [determining_time_precision](./determining_time_precision.ipynb) I observed what effects changing the time scale had on the granularity of the data, and increased the time scale until I identified that a millisecond scale was the highest I could go without resulting in duplicates. A round-about way of approaching the problem, but an effective one.

There is question of what is the precision of the time points of my observations. For example, sample 154:

In [None]:
df_154.head()

the second time point of this sample is:

In [None]:
def observe_num_sigfigs(df: pl.DataFrame) -> None:
    obs = df.item(1, "mins")
    display(obs)
    display(f"n sigfigs: {len(str(obs).split('.')[1])}")


observe_num_sigfigs(df=df_154)

 Unfortunately even the 'raw' data in my database has a precision of sometimes 18 digits, which could not possibly be correct, and must be a symptom of float datatypes in Python. To settle this once and for all, I could either make a decision of what is the minimum time scale that retains unique values in the time column, or check a .UV file.

In [None]:
def observe_sig_figs_in_raw_file():
    import rainbow as rb
    import os

    filepath = os.path.join(definitions.LIB_DIR, "cuprac", "131.D")
    obs = rb.read(filepath).get_file("DAD1.UV").xlabels[0]
    display(obs)
    display(
        f"n sigfigs: {len(str(obs).split('.')[1])}",
    )


observe_sig_figs_in_raw_file()

Well I have been vindicated, as rainbow is also returning 18 significant figures. Thus the second approach is required - identify an appropriate level of granularity by testing several time scales and seeing when duplicate values appear. Observe the millisecond ('L') and second ('S') scales (refer to [offset alias](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases) for the symbology):

In [None]:
def observe_minimum_precision_without_duplicates(pl_df: pl.DataFrame) -> None:
    df = (
        pl_df.to_pandas()
        .set_index("mins")
        .pipe(lambda x: x.set_index(pd.to_timedelta(x.index, unit="minutes")))
    )
    display(
        "num duplicates at millisecond scale:",
        len(df.index[df.index.round(freq="L").duplicated()]),
    )
    display(
        "num duplicates at second scale:",
        len(df.index[df.index.round(freq="S").duplicated()]),
    )


observe_minimum_precision_without_duplicates(pl_df=df_154)

It appears that no duplicates are detected at the millisecond scale ('L') , however at the second ('S') scale, over half the observation points are now duplicates. Thus we will continue at the millisecond scale.

To reiterate, the time axis should be converted to a `timedelta_range` and then rounded to "L", or millisecond scale, in the following manner:

First reset the index to float:

In [None]:
def time_to_millisecond_timedelta(df: pd.DataFrame) -> pd.DataFrame:
    """
    convert 'mins' to `timedelta_range` and round to millisecond precision
    """
    df = df.pipe(lambda x: x.set_axis(pd.to_timedelta(x.index, unit="minutes"))).pipe(
        lambda x: x.set_axis(x.index.round("L"))
    )

    display(df.index.dtype)
    return df


time_to_millisecond_timedelta(df=df_154.to_pandas().set_index("mins"))

Then convert to timedelta and round:

This simple operation has been added to [SignalProcessor](src/wine_analysis_hplc_uv/signal_processing/mindex_signal_processing.py) as `.adjust_timescale()`.

## Identifying and Removing Scalar Offset

While determining the time precision, I noticed that once adjusted to a millisecond scale, it was clear that there was a scalar offset in sample 154 of 15 milliseconds at element zero. This is odd because we would expect the first observation to start at time zero.

 The first question to ask is whether there is a constant offset, as if not, resampling may be required.

In [None]:
def observe_offset(df: pl.DataFrame) -> None:
    """
    Observe the average observation time offset.

    TODO: observe whether there is a consistent difference. This would be exemplified by
      deviation from a monotonically increasing time series. The expected series would
      start at zero and increase by the frequency every observation. In this case, 0.4 seconds.

      To calculate this, one could first insert the column, then subtract the actual from the
      ideal.
    """
    db.sql(
        """--sql
        -- define a custom rounding macro to handle floating point error.
        CREATE OR REPLACE TEMP MACRO round_(a) AS round(a, 10);
        -- create an incremementing series
        CREATE OR REPLACE SEQUENCE serial START 1;
        -- workspace table containing the time in seconds. Round to avoid floating point errors
        CREATE OR REPLACE TEMP TABLE compare_seconds
        AS (
          SELECT
            round_(mins * 60) as seconds_actual,
            round_((row_number() OVER ()-1)*0.4)::DOUBLE AS seconds_ideal,
          FROM
            df
        );
        CREATE OR REPLACE TEMP TABLE
          compare_seconds
          AS (
          SELECT
            seconds_actual,
            seconds_ideal,
            round_(seconds_actual - seconds_ideal) as diff
          FROM
            compare_seconds
        );

        CREATE OR REPLACE TEMP TABLE
          average_time_offset
        AS (
          SELECT
            mode(diff)
          FROM
            compare_seconds
        );
        SELECT * FROM average_time_offset;
        """
    ).show()


observe_offset(df=df_154)

As we can see, there is a constant difference of 0.15 seconds from the ideal.

Is it the same for every sample?

In [None]:
def observe_time_difference():
    with open(
        "/Users/jonathan/mres_thesis/wine_analysis_hplc_uv/src/wine_analysis_hplc_uv/notebooks/lib_eda/sample_standardisation/get_time_frequency_all_samples.sql",
        "r",
    ) as f:
        query = f.read()
        con.sql(query).show()


observe_time_difference()

In [None]:
# possibly removing this
def observe_frequency_all_samples(con: db.DuckDBPyConnection):
    """
    Observe the time frequency across all samples in the database
    """

    # TODO: get the time column for all samples. This requires selecting by sample and 1 wavelength

    from pathlib import Path

    with open(Path(Path.cwd() / "get_time_frequency_all_samples.sql"), "r") as f:
        query = f.read()

        con.sql(query)
        con.sql("SELECT * FROM wine_seconds").show(max_width=200)
        con.sql("SELECT * FROM average_mode_over_samples;").show()
        display(
            con.sql("SELECT * FROM wine_hertz_agg")
            .pl()
            .plot.hist(y="mode_diff_hertz", title="distribution of ")
        )
        con.sql("FROM mode_counts").show()


observe_frequency_all_samples(con=con)
# df = fetch_all_samples(con)
# adf.head()
con.sql("SELECT COUNT(DISTINCT sample_num) FROM wine_seconds").show()

As we can see there is a consistant scalar observation per observation time difference of 400 milliseconds, which matches the expected frequency of 2.5Hz.

Now what about the first observation offset, what is the trend?

In [None]:
(
    adf.stack(["samplecode", "wine"])
    .groupby(["samplecode", "wine"])["mins"]
    .first()
    .plot(style=".", title="first time value per sample", ylabel="time (mins)")
)
plt.tick_params(axis="x", bottom=False, labelbottom=False)
plt.tight_layout()

We can see without further analysis that there is a random spread of values, thus we can be confident in merely subtracting that value from the time column, aligning observation zero with time zero:

In [None]:
adf = (
    adf.stack(["samplecode", "wine"])
    .assign(
        mins=lambda df: df.groupby(["samplecode", "wine"])["mins"].transform(
            lambda x: x - x.iloc[0]
        )
    )  # adjust time axis by initial value so they all start at 1
    .unstack(["samplecode", "wine"])
    .reorder_levels(["samplecode", "wine", "vars"], axis=1)
    .sort_index(level=0, axis=1, sort_remaining=True)
    .pipe(lambda df: df if display(df.head()) else df)
)

In [None]:
(
    adf.stack(["samplecode", "wine"])
    .groupby(["samplecode", "wine"])["mins"]
    .first()
    .plot(
        style=".",
        title="first time value per sample, first value subtracted",
        ylabel="time (mins)",
    )
)
plt.tick_params(axis="x", bottom=False, labelbottom=False)
plt.tight_layout()

Ok, that's convincing enough for me. As of 2023-08-23 22:47:37 I am going to assume the full dataset follows the same pattern. In summary: all data time axes have a varying offset equal to the value of the first measurement. Subtracting the first value from the axis will align the data so that the first measurement is zero. The caveat is that the observation frequency must be the same for all samples.

A method for correcting the offset has been created [here](src/wine_analysis_hplc_uv/signal_processing/mindex_signal_processing.py) under `.correct_offset`

## Creation of Universal Time Axis

Now that the offset has been corrected for, and rounding to milliseconds, the sample on sample time columns are looking regular, and I suspect that we can now use 1 universal time column as an index. To determine whether this is true, we should compare all time columns and find any outliers. I will investigate this by treating each time element as a column and calculating the z-score for each row in that column.


In [None]:
"""
Select 'mins' column, convert to float (seconds), transpose, forward and backfill
missing values to prepare for outlier detection
"""

adfT = (
    adf.stack(["samplecode", "wine"])["mins"]
    .unstack(["samplecode", "wine"])
    .apply(lambda x: x.dt.total_seconds())
    .T.ffill()
    .bfill()
)

In [None]:
"""
Calculate mean difference rounded to 3 decimal places (because floats), find those who 
are not equal to zero, then the total sum of True values
"""

import numpy as np

mask = (adfT.apply(lambda x: np.round(x - x.mean(), 3)) != 0).sum().sum()
mask

Subtracting the mean of each column from its elements should act as a outlier detector, as we are expecting the values to be either all equal, or not. As we can see, useing the condition `!=0` results in a boolean frame, and calling `.sum().sum()` will calculate the total number of elements which are not equal to zero. As we can see, that number is zero, thus all the time columns are now equal, and we can use a universal time column, rather than a inter-sample column.

## Conclusion

Observing both sample 154 and the overall dataset has enabled me to investigate treating the signals as time series, normalization methods, and moving towards a universal time axis. Specifically, we found that `pd.timedelta` was an appropriate time series datatype. Firstly we observed that in sample 154 there was a sampling frequency of 2.5Hz, or one observation per 400 milliseconds and then later on found this consistant across the whole CUPRAC dataset. Initally the time axes of each sample looked unreconcilable, but after rounding to a millisecond scale and subtracting a scalar offset, we proved that there was infact a universal time scale that can be used for all samples. The only caveat is that the samples need to be recorded at the same frequency. That being said, simple resampling would rectify those differences. Finally, methods to move to the universal time axis (index) have been created in `m_index_signal_processing`.