# Identifying the Most Similar Signal in a Dataset

In my efforts to align chromatographic signals in a dataset for modelling, I need a general method of identifying the most similar signal to all other signals, the idea being that if all signals in the set exist in a sample space, the most similar signal acts as a landmark upon which the other signals are aligned.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from wine_analysis_hplc_uv import definitions
from scipy.spatial.distance import euclidean

In [None]:
df = pd.read_parquet(definitions.XPRO_YPRO_DOWNSAMPLED_PARQ_PATH)
(
    df.melt(ignore_index=False)
    .reset_index()
    .pipe(
        lambda df: df if df.pipe(sns.lineplot, x="mins", y="value", hue="wine") else df
    )
)
plt.tight_layout()

As we can see the signals are out of alignment, especially '2021 torbreck shiraz the struie 2'. I will most likely need to leave it out of the data set to get an adequate result.


The original approach was to form tidy frame then call `.corr` to produce a correlation matrix


In [None]:
# calculate the pearsons correlation coefficient for each wine relative to each other, for a tidy df

corr_df = df.corr()
corr_df

then calculate the columnar mean and sort in ascending order.


In [None]:
# return the mean correlation values for each wine sorted in descending order

mean_corr_series = corr_df.mean().sort_values(ascending=False)
mean_corr_series

Somewhat ironically, 'struie 1' is the most correlated in this dataset, and as expected, 'struie 2' is the least.


In [None]:
# return the wine row with highest correlation value

most_corr_wine = mean_corr_series.loc[mean_corr_series == mean_corr_series.max()]
most_corr_wine

Now lets wrap all that up into a function add add it to SignalProcessor


In [None]:
def most_correlated(df: pd.DataFrame) -> str:
    """
    Take a tidy format df, identify the most correlated sample in the set and return its samplecode
    """
    samplecode_idx = (
        df.corr()
        .mean()
        .sort_values(ascending=False)
        .loc[lambda df: df == df.max(),]
        .index.get_level_values("samplecode")
    )
    return samplecode_idx


most_corr_sample = df.pipe(most_correlated)
most_corr_sample

And test it with a rudimentary dtw calculation:


In [None]:
# perform dtw on all the samples in the df to align with 'most_corr_sample' then display a path alignment plot for the sampleset

from dtw import dtw

fig, ax = plt.subplots(1)
(
    df.stack(["samplecode", "wine"])
    .reorder_levels(["samplecode", "wine", "mins"])
    .groupby("wine")
    .apply(lambda x: dtw(x=x, y=df.loc[:, most_corr_sample]).index1)
    .explode()
    .reset_index(name="dtw")
    .assign(i=lambda df: df.groupby(["wine"]).cumcount())
    .set_index(["wine", "i"])
    .pipe(lambda df: sns.lineplot(df, x="i", y="dtw", hue="wine"))
)

As we can see from the path plot above, each of the wines has been aligned onto the most correlated wine.
