---
title: "Handling Absorbance Outliers"
cdt: 2024-09-07T00:09:10
description: "Handling the absorbance outliers first observed in [Table Descriptions](./key_table_descriptions.ipynb)"
project: total_dataset_EDA
conclusion: "negative absorbances are a natural feature of the CUPRAC dataset and are not inherently an indictor of outliers if 'cuprac' subpopulation is included"
---

As per the observation of negative absorbances in [Table Descriptions](key_table_descriptions.ipynb), we are here to observe them, and eliminate if ncessary.

In [None]:
%reload_ext autoreload
%autoreload 2

import duckdb as db
import polars as pl
from pca_analysis.experiments.constants import db_path
import seaborn as sns
import matplotlib.pyplot as plt

con = db.connect(db_path)
con.sql("FROM dataset_eda.nm_254").columns


# Samples with Negative Absorbances

Negative absorbance in itself is not a massive deal, after all, without baseline adjustment there may be an amplitude around the zero line. But in the study mentioned, absorbances in the negative thousands were seen. Thus we will first collect a count of any sample with a negative absorbance and observe the distribution, fully expecting to have to fine tune the filter. Now we know, again *a priori* that different wavelengths have different ab


In [None]:

con.sql("""--sql
CREATE TABLE IF NOT EXISTS neg_abs_counts AS
(
WITH negs AS (
    SELECT
            sample_num,
            count(absorbance) as neg_count,
        FROM
            dataset_eda.nm_254
        WHERE
            absorbance < 0
        GROUP BY
            sample_num
),
total_abs AS (
    SELECT
        sample_num,
        count(absorbance) as counts,
    FROM
        dataset_eda.nm_254
    GROUP BY
        sample_num
    )
SELECT
        sample_num,
        neg_count,
        counts,
        neg_count / counts as neg_prop
FROM
        total_abs
JOIN
        negs
USING
        (sample_num)
ORDER BY
        neg_count DESC
);
SELECT * FROM neg_abs_counts
""").pl()

display(con.sql("SELECT * FROM neg_abs_counts LIMIT 10").pl())



Now, this is just observing the 254 nm wavelength, but regardless, again, we are not expecting a significant negative absorbance ever, because that is a sign of transmission - which will only occur if bubbles or particulates are within the elution {need to add citation}. Ergo negatives are bad. Surprisingly, some samples have more than 90% neg count?


In [None]:
sns.histplot(con.sql("""--sql
SELECT * FROM neg_abs_counts
""").pl(), x='neg_count', binwidth=200)

plt.title('counts of samples with negative absorbance')


From this view, it appears that all samples have significant negative values! But what if we add a cutoff of say, 10% below zero?

In [None]:
con.sql("""--sql
CREATE OR REPLACE TEMP TABLE samples_with_negs
    AS (SELECT * FROM dataset_eda.nm_254 WHERE absorbance < 0
);
SELECT * FROM samples_with_negs LIMIT 10
""").pl()


In [None]:
# top 10 negative counts

con.sql("""--sql
CREATE TEMP TABLE IF NOT EXISTS top_10_neg_counts AS 
        (SELECT
    sample_num,
    count(idx) as neg_count
FROM
    samples_with_negs
GROUP BY
    sample_num
ORDER BY
    neg_count DESC
LIMIT 10);
SELECT
    cast(sample_num AS VARCHAR) AS sample_num,
    idx,
    absorbance
FROM
    dataset_eda.nm_254
JOIN
    top_10_neg_counts
USING
    (sample_num)
ORDER BY
    sample_num, idx
""").pl().pipe(sns.lineplot, x='idx', y='absorbance', hue='sample_num')

plt.title(label="chromatogram @ 256 of top 10 samples \n with significant negative absorbance")


This can't be right. What do their total absorbances look like? Lets choose 113.

In [None]:
con.sql("""--sql
SELECT
        cast(sample_num as varchar) as sample_num,idx, absorbance FROM dataset_eda.nm_254 ORDER BY sample_num, idx
""").pl().plot(x='idx',y='absorbance',by='sample_num', title='all samples @ nm 254', width = 1000)


These results may be an artifact of the data organisation. Sample 152 and 163 are the worst. What does sample 152 look like?


## Sample 152

In [None]:
s152 = con.sql("""--sql
SELECT * FROM dataset_eda.nm_254 WHERE sample_num = 152 ORDER BY idx
""").pl()
s152.plot(x='idx',y='absorbance',by='sample_num', title='sample 152 @ 254nm')



so at 254 it looks like that. What about across all wavelengths?


In [None]:
con.sql("""--sql
(SELECT
    COUNT(distinct wavelength) as wavelength
    --*,
    -- ntile() OVER (order by wavelength, idx) as wavelength_bin
FROM
    pbl.chromatogram_spectra_long
JOIN
    pbl.sample_metadata
USING
    (id)
WHERE
    sample_num = 152
-- ORDER BY
   -- idx        
);
SELECT
    *
FROM
    s152
ORDER BY
    wavelength
    -- idx
""").pl()
# .plot(x='idx',y='absorbance', by='wavelength_bin', legend=False, title='sample 152 @ all wavelengths')


So sample 152 has wavelengths ranging from


In [None]:
represent_wavelengths_152 = con.sql("""--sql
SELECT
    min(wavelength) as min_wavelength,
    max(wavelength) as max_wavelength,
    count(distinct wavelength) as count,
    max_wavelength - min_wavelength as range_wavelength,
    [
        min_wavelength,
        approx_quantile(wavelength, 0.2),
        approx_quantile(wavelength, 0.4),
        approx_quantile(wavelength, 0.6),
        approx_quantile(wavelength, 0.8),
        max_wavelength,
        ] as representatives,
FROM
    pbl.chromatogram_spectra_long
JOIN
    pbl.sample_metadata
USING
    (id)
WHERE
    sample_num = 154
""").pl()
represent_wavelengths_152


In [None]:
con.sql("""--sql
SELECT
    *
FROM
    pbl.chromatogram_spectra_long
JOIN
    pbl.sample_metadata
USING
    (id)
WHERE
    sample_num = 152
AND
    wavelength in (SELECT UNNEST(representatives) FROM represent_wavelengths_152)
ORDER BY
    idx
""").pl().plot(x='idx',y='absorbance',by='wavelength', title='sample 152 @ 0.2 quantile wavelengths')


As we can see, sample 152 is a mess. The data is irredeemable and should be expunged.


In [None]:
# excluded tables

con.sql("""--sql
SELECT * FROM dataset_eda.excluded_samples
""").pl()


So that's 152. What about the rest?

## Sample 163

Same as sample 152, lets observe the representative wavelengths

In [None]:
con.sql("""--sql
WITH
    representative_wavelengths AS (
    SELECT
        [
        min(wavelength),
        approx_quantile(wavelength, 0.2),
        approx_quantile(wavelength, 0.4),
        approx_quantile(wavelength, 0.6),
        approx_quantile(wavelength, 0.8),
        max(wavelength),
        ] as representatives
    FROM
        pbl.chromatogram_spectra_long
    JOIN
        pbl.sample_metadata
    USING
        (id)
    WHERE
        sample_num = 163
        )
SELECT
    *
FROM
    pbl.chromatogram_spectra_long
JOIN
    pbl.sample_metadata
USING
    (id)
WHERE
    sample_num = 163
AND
    wavelength in (SELECT UNNEST(representatives) FROM representative_wavelengths)
""").pl().plot(x='idx',y='absorbance',by='wavelength', title='sample 163 @ 0.2 quantile wavelengths')


Are these CUPRAC samples? Hence the gulf?

In [None]:
con.sql("""--sql
SELECT
    distinct(detection)
FROM
    top_10_neg_counts
JOIN
    pbl.sample_metadata
USING
        (sample_num)
""").pl()


So all of the "negative" samples are CUPRAC samples.

# Conclusoin

So it's become apparent that nm 254 is NOT a representative wavelength across both the raw and CUPRAC datasets, almost by definition, as the CUPRAC reagent was selected because the wavelength range is isolated from indigenous chromophores in organic material. To verify this, we should observe absorbance as a function of wavelength for a select number of peaks in select samples.