---
cdt: 2024-09-08T09:09:10
title: "'cuprac' Wavelength Profiles"
project: cuprac_dataset_EDA
description: "an observation of the distribution of the wavelenght profiles of the 'cuprac' dataset"
conclusion: "There aer key clusters centered around 360 and 1000, with minor clusters following. Baseline rises and falls, returning to origin at ~4000. There is a step trough across all samples at ~4700 which may be the elution of undissolved CUPRAC reagent. There are two wavelength maxima - 310 and 450 nm. 310 may be underivatised analytes, 450 can be attributed to derivatised analytes. Use 450 nm for CUPRAC analysis as per literature."
---


We want to observe trends in the 'cuprac' dataset to find the representative wavelength. The absorbance profile changes over time and over wavelength modes. There is clustering due to analytes of similar chemistry eluting at similar times. We can also assume that these analytes have similar chromophores, thus having similar profiles within the wavelength mode. So if looking for general trends, it is safe to aggregate regions in time into bins. We thus reduce the 3D surface to a series of 3D lines, preferably less than five. This aggregation has the advantage of not being overly affected by convolution as for equal levels of convolution, all aggregations will be equally tainted. Selection of Selection of the start and end points of those bins should be done based on the 256 and 450 nm profiles in raw and CUPRAC, respectively. The samples will be the samples with the highest average absorbance at those wavelengths.


To begin binning we will observe an overlay of chromatograms @ 450 nm:

In [None]:
cuprac_samples_450 = con.sql("""--sql
SELECT
        *
    FROM
        pbl.chromatogram_spectra_long
    JOIN
        included_samples
    USING
        (id)
    WHERE
        wavelength = 450
    AND
        sample_num in (SELECT UNNEST([138, 126, 124, 105, 125, 164]))
    ORDER BY
        idx
""").pl()

cuprac_samples_450.plot(x='idx', y='absorbance', title = "CUPRAC Samples @ 450 nm", by='sample_num')



As we can observe in the above plot, the CUPRAC sepearation is centered around the 1000 idx mark, with peak clusters at 360 to 770 and minor clusters after the main cluster until ~ 2800, with the baseline returning to origin ~ 4000. There is a consistent severe trough @ ~ 4730 presumably representing the removal of the CUPRAC reagent.

Now what is the behavior across the wavelengths? The right interval points shall be as follows:

1. 125
2. 320
3. 750
3. 2000
4. 3000
5. 4000

Which is not fair as the 1000 - 3000 region is heavily affected by baseline shift but I have no interest in performing signal processing at this time.

In [None]:
# create the bin table. best way might be to gen a new idx rather than sourcing the idx from the long table.

# the most efficient way to organise this would be to have a table containing the index and the bins then join that after the sample selection.


idx = pl.DataFrame(pl.arange(0, 7800, 1, eager=True).rename('idx'))

cuprac_binned_idx = con.sql(
"""--sql
SELECT
    idx,
    CASE
    WHEN idx < 125 THEN 0
    WHEN idx BETWEEN 125 AND 320 THEN 1
    WHEN idx BETWEEN 320 AND 750 THEN 2
    WHEN idx BETWEEN 750 AND 2000 THEN 3
    WHEN idx BETWEEN 2000 and 3000 THEN 4
    WHEN idx BETWEEN 3000 and 4000 THEN 5
    WHEN idx > 4000 THEN 6
    END
    as bin
    FROM
        idx
""").pl()

cuprac_binned_idx


In [None]:
binned_cuprac = \
con.sql(
"""--sql
SELECT
    *,
FROM
    pbl.chromatogram_spectra_long
JOIN
    (SELECT DISTINCT id, sample_num FROM cuprac_samples_450)
USING
    (id)
JOIN
    cuprac_binned_idx
USING
    (idx)
WHERE
    wavelength = 450
ORDER BY
    sample_num, idx
"""
).pl()


In [None]:
cuprac_chms_450 = binned_cuprac.plot(x='idx',y='absorbance',by='sample_num',title='CUPRAC samples @ 450 nm')

import holoviews as hv

bin_vline = \
con.sql("""--sql
SELECT
    max(idx) as right_end
FROM
    binned_cuprac
GROUP BY
    bin
HAVING
    bin < 6
ORDER BY
    right_end
""").pl().pipe(hv.VLines).opts(line_dash='dashed', line_width = 1)
bin_vline * cuprac_chms_450



In [None]:
binned_cuprac_aggs = \
con.sql(
    \
"""--sql
SELECT
    bin,
    wavelength,
    mean(absorbance) as abs
    FROM
        binned_cuprac
    GROUP BY
        bin, wavelength
    ORDER BY
        bin, wavelength
""").pl()

binned_cuprac_aggs


In [None]:
binned_cuprac.plot(x='')


### Results

In [None]:
binned_cuprac.plot(x='wavelength',y='abs',by='bin', legend = 'bottom', title='absorbance vs. wavelength for time bins', height = 500, width = 1000)
 

#### General Profile Description

First the general CUPRAC profile across a range of samples was observed. Generally speaking the separation is centered around the 1000th idx with peak clusters at 360 to 770 prior to the maxima, and a long range from the maxima to ~2800 containing minor peak clusters. The baseline gently returns to origin by ~ 4000th idx. Following this is a severe drop at ~4730 which may be the end of the CUPRAC reagent elution.

#### Time Bin Aggregations over Wavelength 

There appears to be a consistant maxima at 244nm , increasing in intensity along the time domain, an inversely proportional trough at 310nm, then another expected maxima at 450 nm. The last maxima is not directly proportional to time, with bin 3 possessing the highest average, and bins 0, 5 and 6 possessing almost zero average absorbance at that point. This is expected as those bins possess no peaks. Therefore the 244 peak is either uneluted analyte or solvent.

### Conclusion
 
The general profile over a sampling of wines at 450 nm and time bin sample aggregations over all wavelengths was observed, described and intepreted. The general profile consists of two key clusters containing a minor and major maxima at idx 360 and 1000 respectively, with minor clusters following until baseline return at ~ 4000. Another key feature is the presence of a steep trough at ~4700 which is attributed to the end of the CUPRAC reagent.

The absorbance by wavelength profile of 7 manually defined time aggregations across a sampling was observed. It was determined that there are three key maxima - two positive at 244 nm and 450 nm, and one negative at 310. The first positive maxima increases then sharply decreases over time, and is presumably correlated to underivatised sample, the second positive maxima increases and more gently decreases over time, corresponding to the peak profile of the chromatogram, with a maximum within bin 3. The negative trough is extreme, occuring at 310nm, and is presumably transmission from undissolved CUPRAC reagent. Note that some bins contain a much higher baseline than others, and once that is accounted for the results may be quite different. Furthermore, we require an explanation of the CURPAC profile terms of the solvent and cuprac gradients to fully understand the mechanism.