---
title: outlier detection
description: outlier detection through mahalanobis clustering
project: eda
status: open
conclusion: ""
cdt: 2024-08-15T00:00:00
---


# Outlier Detection

1. get the data
2. reshape the data
3. PCA
4. viz PCA, first 2.
5. Mahalanobis clustering analysis / outlier detection.

Troubleshoot throughout.

## Get the Data

In [None]:
import duckdb as db
import polars as pl
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Markdown


def get_data():
    """
    Get the data out of the db.
    """

    db_path = "/Users/jonathan/mres_thesis/wine_analysis_hplc_uv/wines.db"

    with db.connect(db_path) as con:
        data_query = """--sql
            CREATE OR REPLACE TEMP TABLE raw_shiraz AS (
            SELECT
                *
            FROM
                pbl.sample_metadata
            WHERE
                detection='raw'
            AND
              varietal='shiraz'
            ORDER BY
                sample_num
            );
            SELECT
                *
            FROM
                pbl.chromatogram_spectra_long as cs
            JOIN
            raw_shiraz
            USING
                (id)
            WHERE
                cs.mins < 30
            ORDER BY
                sample_num, idx
                ;
            """

        get_sm_query = """--sql
        select * from raw_shiraz;
        """

        data = con.sql(data_query).pl()
        sm = con.sql(get_sm_query).pl()

        return data, sm


long_data, sm = get_data()
display(Markdown("## Sample Metadata"), sm)
display(Markdown("## Sample Metadata"), long_data.head(), long_data.tail())


In [None]:
# ensure that all samples are retrieved after the join

assert sm.select("sample_num").n_unique() == long_data.select("sample_num").n_unique()


In [None]:
# check that the id column "sample_num" is a unique identifier

sample_num_grps = long_data.group_by("sample_num")
with pl.Config() as cfg:
    cfg.set_tbl_rows(11)
    display(sample_num_grps.len())

    # simple test, if any are 50% greater than the average, then there is a doubling

    outlier_lengths = sample_num_grps.len().filter(
        pl.col("len") > pl.col("len").mean().mul(1.5)
    )
    assert (
        outlier_lengths.is_empty()
    ), f"outlier sample signal length detected: {outlier_lengths}"


In [None]:
# inspect data.
display(long_data.describe())


In [None]:
# display the data

import seaborn as sns

rp = sns.relplot(
    data=long_data.filter(pl.col("wavelength").eq(256)),
    x="mins",
    y="absorbance",
    col="sample_num",
    col_wrap=6,
    kind="line",
    height=3,
)
title = "HPLC-DAD Shiraz @ 256nm"
rp.figure.subplots_adjust(top=0.8)
plt.suptitle(title)


As we can see, sample 75 is obviously an outlier and easily considered a failed run. We will leave it in to see whether the outlier detection behaves as expected.

## Mahalanibos Covariance Matrix

See https://scikit-learn.org/stable/auto_examples/covariance/plot_mahalanobis_distances.html for an exmaple of how to do it in scikit learn.

From what ive gathered skimmiing the net, the general approach is to unfold each 3 way data such that each vector is a sample and each component is absorbance at a given wavelength/time. As per [@brereton_appliedchemometricsscientists_2007, p. 215, sec. "6.8.1 Unfolding"] This has the downside of both massively increasing the redundancy of the data and disconnecting the time and wavelength information connection.

That being said, I can't find any good examples of tensor covariance calculations.
That also being said, numpy appears to support multidimensional data covariance calculation

In [None]:
wide_data = (
    long_data.select("sample_num", "mins", "absorbance", "wavelength")
    .sort(["sample_num", "mins", "wavelength"])
    .pivot(
        on="wavelength",
        index=["sample_num", "mins"],
        values="absorbance",
        maintain_order=True,
    )
    .drop("mins")
)
wide_data


## PARAFAC2

According to @bro_parafac2partii_1999, PARAFAC2 can decompose multiway data into three matrices containing each compounds elution profile, spectral profile and a concentration value. This should be the area of focus. The research is over 25 years old, it should be mature enough to get results. Validation can simply be done by summing the elution profiles at a given wavelength (or all wavelengths) and observing the difference from the original.

The pipeline should be done on a toy dataset - one sample, then expanded to my data. First difficulty will be finding a toy dataset. Actually can probs just follow the tensorly tutorial. The tutorial will done[here]()

## Folding

As usual, getting the data into the form expected is a difficult one to ascertain as different schools use different terms. Tensorly may be promising software which provides a lot of algorithms you want to use.

From a quick look on the web it looks like tensorflow has the most convenient api for dataframe -> tensor

## Cluster Analysis

According to @miller_statisticschemometricsanalytical_2018 we [p. 243, sec. "cluster analysis"] need to perform heirarchical cluster analysis, or K Means. Sci-kit provides all the required algorithms.

### k-Means

We'll do k-Means first as its conceptually more straight forward.
