# Investigating Samples Labeled as Outliers

2023-11-13 10:18:05

After replacing the Raw 3D datast parquet file with a sql query, there are a number of samples that are excluded that formally were not. This has resulted in a much worse performing model for the same parameters. First step is to check whether the samples should have been excluded, as I dont have a clear study of them.

The excluded samples are:

In [None]:
# setup
%reload_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
from wine_analysis_hplc_uv import definitions
from wine_analysis_hplc_uv.notebooks.xgboost_modeling import dataextract
import seaborn.objects as so

In [None]:
de = dataextract.DataExtractor(definitions.DB_PATH)
de.create_subset_table(wavelengths=256, samplecode=("115", "98", "a0301", "99", "72"))
data = de.get_tbl_as_df()
de.con_.close()
data.head()

In [None]:
so.Plot(data, x="mins", y="nm_256").add(so.Line()).facet(
    col="code_wine", wrap=2
).layout(size=(15, 20)).share(x=False, y=False)

So they look ok tbh. 98 looks fine, 72 is definitely 2 samples, a0301 looks fine, 115 looks fine, 99 has a fucked gradient but nothing that cant be rectified with baseline subtraction. What else is going on?

Lets look at 72 first. To differentiate we're gna need the dates.

## Sample 72

In [None]:
import duckdb as db

con = db.connect(definitions.DB_PATH)
con.sql(
    """
        DESCRIBE c_chemstation_metadata
        """
)

In [None]:
sample_72 = data.query("samplecode=='72'")

sample_72.acq_date.value_counts()

As above, there are two recordings of that sample. The signals are below:

In [None]:
so.Plot(sample_72, x="mins", y="nm_256", color="acq_date").add(so.Line())

Makes for an easy fix. The sample recorded at 2023-02-22 16:10:39 is a half run, and needs to be removed. Are the ids different?

In [None]:
sample_72.id.value_counts()

Thankfully, yes. I guess we now get to add a 'exclude_id' condition to the sql query in `dataextract`

In [None]:
data = data.query("id != '6d8a370a-9f40-460d-acba-99fd4c287ad8'")
so.Plot(data, x="mins", y="nm_256").add(so.Line()).facet(
    col="code_wine", wrap=2
).layout(size=(15, 20)).share(x=False, y=False)

## Sample 99 and a0301

Next is to observe the dataset sizes. anything significantly less than 7800 observations will indicate a problem

In [None]:
data.groupby(["id", "code_wine"]).size()

How long do they run for with that number of observations?

In [None]:
data.query(f"samplecode in ('99','a0301')").mins.max()

so 99 and a0301 both have 6600 observations, and run to 44 mins. Its fine, once we subset the dataset to 30 mins they'll line up. Remove them from the exclusion list.

## Sample 98

Next is 98 with double the observations of everything else. Another repetition?

As before, if there are two acq_dates, there are two recordings of the same sample:

In [None]:
sample_98 = data.query("samplecode=='98'")
sample_98

In [None]:
sample_98.acq_date.value_counts()

Its not that, then what? What is the frequency of observation

In [None]:
data = (
    data.set_index(
        [
            "acq_date",
            "detection",
            "color",
            "varietal",
            "id",
            "code_wine",
            "samplecode",
            "wine",
            "mins",
        ]
    )
    .sort_index()
    .reset_index()
)
data

What is the average frequency of observation across the dataset, excluding sample 98?

In [None]:
data.shape

In [None]:
data

To observe the average frequency of observation as seconds, we just need to look at the difference in time between two observations, and calculate that average.

In [None]:
# clean up the time axes, as usual.

data = data.assign(
    mins=lambda x: x.groupby("id")["mins"].transform(lambda x: x - x.iloc[0])
)
data.head()

In [None]:
# convert mins to seconds
data = data.assign(seconds=lambda df: df.mins * 60)

# display the average frequency of observation
display(data.groupby("code_wine")["seconds"].aggregate((lambda x: x.diff().mean())))

data = data.pipe(lambda df: df.drop("nm_256", axis=1).assign(nm_256=df.nm_256))

As we can see in the output above, 98 has an average observation frequency of 0.2, half that of the other samples, resulting in twice as many observations. A simple resampling will rectify this problem.

In [None]:
# resampling 98 to 0.4S frequency
data.query("samplecode=='98'").assign(
    mins=lambda df: pd.to_timedelta(df.mins, unit="m")
).resample(on="mins", rule="0.4S").mean(numeric_only=True).plot(x="seconds", y="nm_256")

Now to instantiate this, it should be established as a cleaner function, but I CBF unless this sample is relevant.

In [None]:
data.loc[:, ["code_wine", "id"]].drop_duplicates()

## Sample 115

The last one is 115

In [None]:
sample_115 = data.query("samplecode=='115'")

sample_115.plot(x="mins", y="nm_256")

Again, it looks fine to me.

In [None]:
sample_115.mins.plot()

Signal looks fine, time axis is monotonically increasing.. why the hell was it labeled as an outlier?