---
cdt: 2024-09-09T13:41:13
title: Description of Raw Dataset Signal Profile over Samples
project: raw_dataset_EDA
description: an observation of the features of the raw dataset
conclusion: samplewise, the chromatographic profile varies significantly. There is are key peak regions at (idx) (0, 600), ~1200. Minor clustering follows throughout. Baseline rises and falls throughout profile, returns to origin at ~6300. Wavelength profile steeply drops after 190, same profile for all bin ranges, thus no wavelength range is specific to any bin region.
---

In [None]:
%reload_ext autoreload
%autoreload 2

import duckdb as db
import polars as pl
from pca_analysis.experiments.constants import db_path

con = db.connect(db_path, read_only=True)


This document contains a description of the raw dataset.


# Chromatographic Profile @ 256 nm


In [None]:
import holoviews as hv

included_samples = \
con.sql(
"""--sql
SELECT
    *
FROM
    pbl.sample_metadata
ANTI JOIN
    (select sample_num from dataset_eda.excluded_samples)
USING
    (sample_num)
"""
).pl()

raw_sample_selection = con.sql(
"""--sql
SELECT
    *
FROM
    included_samples
WHERE
    detection = 'raw'
USING
    SAMPLE reservoir(15 rows) repeatable (42)
"""
).pl()

raw_signals_256 = con.sql(
"""--sql
WITH
    raw_signals_256 AS (
            SELECT
                *
            FROM
                pbl.chromatogram_spectra_long as cs
            JOIN
                raw_sample_selection
            USING
                (id)
            WHERE
                cs.wavelength = 256
            ORDER BY
                sample_num, cs.idx
    )
SELECT
    *
FROM
    raw_signals_256
"""
).pl()

sample_plot_256 = \
(
    raw_signals_256
        .plot(x='idx',y='absorbance',by='sample_num', title='absorbance vs. time for samples @ 256 nm', legend=False, height =500, width =1000)
)
sample_plot_256


Every analyte in every sample elutes by 4100. We can place bins at 110, 630, 1200, 1800, 3000, 4200.

# Binned Sample Aggregations Across Wavelengths


In [None]:
# binning

max_idx = con.sql(
"""--sql
SELECT
    max(idx)
FROM
    pbl.chromatogram_spectra_long
JOIN
    included_samples
USING
    (id)
WHERE
    detection = 'raw'
"""
).pl().item()

idx = pl.DataFrame(pl.arange(0, max_idx, 1, eager=True).rename('idx'))

raw_binned_idx = con.sql(
"""--sql
SELECT
    idx,
    CASE
    WHEN idx < 110 THEN 0
    WHEN idx BETWEEN 110 AND 630 THEN 1
    WHEN idx BETWEEN 630 AND 1200 THEN 2
    WHEN idx BETWEEN 1200 and 1800 THEN  3
    WHEN idx BETWEEN 1800 and 3000 THEN 4
    WHEN idx BETWEEN 3000 and 4200 THEN 5
    WHEN idx > 4200 THEN 6
    END
    as bin
    FROM
        idx
""").pl()

raw_samples_binned = (con.sql(
"""--sql
WITH
    raw_samples_binned AS (
        SELECT
            *
        FROM
            raw_signals_256
        JOIN
            raw_binned_idx
        USING
            (idx)
        ORDER BY
            sample_num, idx
    )
SELECT
    *
FROM
    raw_samples_binned

""")
)

bins = con.sql(
"""--sql
SELECT
    max(idx) as bin_end,
FROM
    (
        SELECT
            bin,
            idx
        FROM
            raw_samples_binned
        JOIN
            (
                SELECT
                    first(sample_num) AS sample_num
                FROM
                    raw_samples_binned
            )
        USING
            (sample_num)
    )
GROUP BY
    bin
HAVING
-- skip last bin because we're only plotting the ends
bin < 6
""").pl().pipe(hv.VLines).opts(line_dash='dashed', line_width=1)

bins * sample_plot_256


Now that representative bins have been selected, we can observe across wavelengths. in the intrest of efficiency, we can increase the granularity to 10 nm between observations.


In [None]:
wavelength_selection = pl.DataFrame(
    pl.arange(190, 620, 20, eager=True).rename('wavelength')
    )
wavelength_selection


In [None]:
# get the granular expression of the samples

raw_signals_granular = con.sql(
"""--sql
SELECT
    *,
    ntile(50) OVER (ORDER BY wavelength) as wavelength_group
FROM
    pbl.chromatogram_spectra_long
JOIN
    raw_sample_selection
USING
    (id)
WHERE
    wavelength in (SELECT (wavelength) FROM wavelength_selection)
ORDER BY
    sample_num, wavelength,idx
"""
).pl()


In [None]:
# get the granular expression of the samples. For each wavelength, aggregate into n groups 

con.sql(
"""--sql
WITH
    grouped_wavelength AS (
        SELECT
            -- break the wavelengths into groups for later aggregation
            ntile(50) OVER (ORDER BY wavelength) as wavelength_group
        FROM
            -- the spectrochromato data
            pbl.chromatogram_spectra_long
        JOIN
            -- the sample nums in the sample
            raw_sample_selection
        USING
            (id)
        WHERE
            wavelength in (SELECT (wavelength) FROM wavelength_selection)
        ORDER BY
            sample_num, wavelength
    )
SELECT
    *,
    mean(absorbance) as absorbance
FROM
    grouped_wavelength
GROUP BY
    sample_num, wavelength
ORDER BY
    sample_num, wavelength
LIMIT 10
"""
).pl()


In [None]:
# add the bins


In [None]:
raw_bin_aggs = con.sql(
"""--sql
SELECT
    bin,
    wavelength,
    mean(absorbance) as absorbance
FROM
    raw_signals_granular
JOIN
    raw_binned_idx
USING
    (idx)
GROUP BY
    bin, wavelength
ORDER BY
    bin, wavelength
""").pl()

raw_bin_aggs.plot(x='wavelength',y='absorbance',by='bin', title='raw sample absorbance v. wavelength over time bins aggregated across samples')


As we can see, the profile is consistant across the wavelengths and time bins, however this in itself is not evidence for 256 being the optimal observance wavelength, as if this was the only evidence, then the conclusion would be that ~ 200 nm would be optimal. We do however know *a priori* that these wavelengths are dominated by the solvent background, and that ~ 256 nm contains the most prominent signals.

### Results

The profiles at 256 nm can be described as possesing two major clusters within the first 600 indexes, and a third cluster by 1252, depending on the sample. Following this, the baseline reaches a maximum at ~ 1800, with minor peak clusters throughout, and some samples possess a significant cluster at ~3200 and ~3800. By ~4200 the baseline has returned to origin for all samples observed. There is an notable rise at 6200 which can presumably be attributed to the 100% methanol purge.


### Conclusions

The chromatographic profile at 256nm across samples and a sample aggregation across manually defined time bins across all wavelengths were observed.

The chromatographic profile varies significantly by sample but can be summarised as possessing a key peak region between idx 0 and 600, followed by a third at ~1200. Clusters occur throughout the remainder of the elution profile with some samples possesing key clusters at ~3200 and ~3800. The baseline rises slowly after the first cluster reaching a maximum around 1800 before returning to origin at ~4200. There is a feature attributable to the methanol purge at ~6300. See the description of the experimental design for more information.

Regarding the aggregated spectral profile, we found that it is consistent across the selected time bins, meaning that no wavelength is superior for a given time range. It is however difficult to determine an optimal observation wavelength for all samples  and times from aggregated spectral profiles alone as no distinct maxima occurs.
