---
cdt: 2024-09-11T14:01:36
title: "'raw' Signal EDA"
description: "EDA on the signals of the 'raw' dataset, primarily to describe the distribution of times, wavelengths and absorbance, and find any outliers."
conclusion: ""
status: "open"
---

To export to xarray we need to unify the time and wavelength modes. To do this, we need a description of both. Removing any outliers will make this task simpler.

In [None]:
# environment

%reload_ext autoreload
%autoreload 2

import duckdb as db
import polars as pl
from pca_analysis.experiments.constants import db_path
import seaborn as sns
import matplotlib.pyplot as plt
pl.Config.set_fmt_str_lengths(9999)
pl.Config.set_tbl_rows(9999)
con = db.connect(db_path)


# Setup

To make this simpler, should make a 'raw' specific join table.


In [None]:
con.sql(
"""--sql
SELECT
    schema_name,
    table_name
FROM
    duckdb_tables
WHERE
    schema_name = 'joins'
"""
).pl()


In [None]:
con.sql(
"""--sql
SELECT
    *
FROM
    joins.chm_st_ct
LIMIT 10
"""
).pl()


In [None]:
con.sql(
"""--sql
CREATE OR REPLACE TABLE raw.raw_join_tbl AS
    SELECT
        jt.*
    FROM
        joins.chm_st_ct as jt
    JOIN
        clean.st as st
    ON
        jt.pk_st = st.pk
    WHERE
        detection = 'raw';
SELECT
    *
FROM
    raw.raw_join_tbl
"""
).pl().shape


Also, 'raw.cs_long' is missing the 'pk_chm' primary key. Easiest to recreate it here.


In [None]:
con.sql(
"""--sql
ALTER TABLE
    raw.cs_long
ADD COLUMN IF NOT EXISTS
    pk_chm INTEGER;
UPDATE
    raw.cs_long AS tar
SET
    pk_chm = (
        SELECT
            sou.pk_chm
        FROM
            pbl.chromatogram_spectra_long as sou
        JOIN
            raw.raw_join_tbl as jt
        ON
            jt.pk_chm = sou.pk_chm
        WHERE
            sou.id = tar.id
        );
SELECT
    *
FROM
    raw.cs_long
LIMIT
    5
"""
).pl()


And confirm? use the key to join to the metadata, finding distinct pairings and confirm that the id's match.

In [None]:
con.sql(
"""--sql
SELECT
    bool_and(pk = pk_chm) as if_true_pass,
    bool_and(cs.id = chm.id) as id_if_true_pass,
FROM
    raw.cs_long as cs
JOIN
    clean.chm as chm
ON
    chm.pk = cs.pk_chm
"""
).pl()


if true, the key is successful.

# Description of 'raw.cs_long'

In [None]:
con.sql(
"""--sql
select
    count(*) as rows
FROM
    raw.cs_long
"""
).pl()


Including 'excluded' samples, there are over 85 million rows in 'raw.cs_long'.

In [None]:
con.sql(
"""--sql
SELECT
    table_schema,
    table_name,
    column_name,
FROM
    information_schema.columns
WHERE
    table_schema = 'raw'
AND
    table_name = 'cs_long'
"""
).pl()


There are 8 columns: 'idx', 'id', 'mins', 'wavelength', 'absorbance', 'sample_num', 'detection', 'pk_chm'.

# Preprocessing

The time offset will have to be corrected again in this table. Easiest just to do it directly.

In [None]:
con.sql(
"""--sql
ALTER TABLE raw.cs_long ADD COLUMN IF NOT EXISTS mins_corrected DOUBLE;
ALTER TABLE raw.cs_long ADD COLUMN IF NOT EXISTS secs_corrected DOUBLE;
-- need to create an interm table
CREATE OR REPLACE TEMP TABLE time_corrected AS (
    WITH
        mins_corrected_ AS (
            SELECT
                pk_chm,
                wavelength,
                idx,
                mins - min(mins) OVER (PARTITION BY pk_chm, wavelength ORDER BY idx)as mins_corrected
            FROM
                raw.cs_long
        ),
        other_units_corrected AS (
            SELECT
                pk_chm,
                wavelength,
                idx,
                mins_corrected,
                mins_corrected * 60 as secs_corrected,
            FROM
                mins_corrected_
        )
    SELECT
        *
    FROM
        other_units_corrected
);
UPDATE raw.cs_long as tar
    SET mins_corrected = (
        SELECT
            mins_corrected
        FROM
            time_corrected as sou
        WHERE
            sou.pk_chm = tar.pk_chm
        AND
            sou.wavelength = tar.wavelength
        AND
            sou.idx = tar.idx
    );
UPDATE raw.cs_long as tar
    SET secs_corrected = (
        SELECT
            secs_corrected
        FROM
            time_corrected as sou
        WHERE
            sou.pk_chm = tar.pk_chm
        AND
            sou.wavelength = tar.wavelength
        AND 
            sou.idx = tar.idx
    );
SELECT
    *
FROM
    raw.cs_long
LIMIT
    10;
"""
).pl()


In [None]:
con.sql(
"""--sql
SELECT
    pk_chm,
    absorbance,
    idx,
    mins_corrected as mins,
FROM
    raw.cs_long
WHERE
    wavelength = 256
ORDER BY
    pk_chm,
    idx
"""
).pl().plot.line(x='idx',y='mins', by='pk_chm', title='idx vs. mins_corrected')


If the line is straight then 'mins_corrected' is correct. The data should now be ready to analyse.

# Mode Descriptions


## Time


In [None]:
con.sql(
"""--sql
SELECT * FROM dataset_eda.excluded_samples
"""
).pl()


In [None]:
con.sql(
"""--sql
SELECT
    schema_name
FROM
    duckdb_schemas
"""
).pl()


In [None]:
con.sql(
"""--sql
CREATE OR REPLACE TEMP TABLE included_cs AS (
WITH
    included AS (
        SELECT
            *
        FROM
            raw.raw_join_tbl as jt
        ANTI JOIN
            dataset_eda.excluded_samples as exc
        ON
            jt.sample_num = exc.sample_num
    )
SELECT
    cs.*
FROM
    raw.cs_long cs
JOIN
    included inc
ON
    cs.pk_chm = inc.pk_chm
ORDER BY
    cs.pk_chm,
    cs.wavelength,
    cs.idx
);
SELECT
    count( distinct pk_chm) = 96 as individuals 
FROM
    included_cs
"""
).pl()


In [None]:
con.sql(
"""--sql
SELECT
    *
FROM
    included_cs 
LIMIT
    3
"""
).pl()


In [None]:
con.sql(
"""--sql
CREATE OR REPLACE TABLE raw.time_aggs AS (
WITH 
    nm_256 AS
        (
        SELECT
            *
        FROM
            included_cs
        WHERE
            wavelength = 256
        ),
    time_aggs AS (
        SELECT
            pk_chm,
            min(mins_corrected) as min,
            max(mins_corrected) as max,
            count(mins_corrected) as count,
        FROM
            nm_256
        GROUP BY
            pk_chm
    )
SELECT
    *
FROM
    time_aggs
ORDER BY
    pk_chm
);
SELECT
    *
FROM
    raw.time_aggs
LIMIT 10
"""
).pl()


In [None]:
con.sql(
"""--sql
SELECT
    count(distinct count)
from
    raw.time_aggs
"""
).pl()


There are 4 distinct observation ranges, numbers of observation per sample.


Do they correspond to a method?


In [None]:
con.sql(
"""--sql
CREATE OR REPLACE TEMP TABLE acq_method_aggs AS (
    WITH
        acq_method_aggs AS (
        SELECT
            chm.acq_method,
            aggs.pk_chm,
            aggs.min,
            aggs.max,
            aggs.count,
        FROM
            raw.time_aggs as aggs
        JOIN
            clean.chm as chm
        ON
            chm.pk = aggs.pk_chm
        )
    SELECT
        *
    FROM
        acq_method_aggs
    );
SELECT
    *
FROM
    acq_method_aggs
LIMIT
    10
"""
).pl()


How many samples per method?


In [None]:
con.sql(
"""--sql
SELECT
    acq_method,
    count(pk_chm) as count
FROM
    acq_method_aggs
GROUP BY
    acq_method
"""
).pl()


Only three methods?

In [None]:
con.sql(
"""--sql
SELECT
    distinct acq_method
FROM
    acq_method_aggs
"""
).pl()


Presumably one was unique to the excluded samples?


In [None]:
con.sql(
"""--sql
-- display methods in the excluded samples  not present in 'acq_method_aggs'
SELECT
    acq_method
FROM
    clean.chm chm
JOIN
    dataset_eda.excluded_samples exc
ON
    chm.sample_num = exc.sample_num
EXCEPT
SELECT
    acq_method
FROM
    acq_method_aggs
"""
).pl()


Halo. The Halo method shouldnt be in the database at all.


Which sample uses it?

In [None]:
con.sql(
"""--sql
SELECT
    chm.sample_num,
    jt.pk_chm,
    jt.pk_st,
    acq_date,
    seq_name,
    st.notes,
FROM
    clean.chm chm
JOIN
    joins.chm_st_ct jt
ON
    jt.pk_chm = chm.pk
JOIN
    clean.st st
ON
    st.pk = jt.pk_st

WHERE
    acq_method LIKE '%halo%'
"""
).pl()


One sample, sample 87. It was run on an incorrect method. Its already in the excluded list for being run at 5Hz as well. Wild.

In [None]:
con.sql(
"""--sql
SELECT
    *
FROM
    dataset_eda.excluded_samples
WHERE
    sample_num = 87
"""
).pl()


Now, how many samples per remaining methods?

create an inclusive, raw chm table.


In [None]:
con.sql(
"""--sql
-- create a chm table specific for raw excluding the excluded samples.
CREATE OR REPLACE TABLE raw.chm AS (
SELECT
    *
FROM
    clean.chm as chm
ANTI JOIN
    dataset_eda.excluded_samples as exc
ON
    chm.sample_num = exc.sample_num
JOIN
    joins.chm_st_ct as jt
ON
    jt.pk_chm = chm.pk
JOIN
    clean.st st
ON
    st.pk = jt.pk_st
WHERE
    detection = 'raw'
);
SELECT
    *
FROM
    raw.chm
LIMIT 3
"""
).pl()


In [None]:
con.sql(
"""--sql
SELECT COUNT(*) FROM raw.chm
"""
).pl()


In [None]:
con.sql(
"""--sql
SELECT
    distinct acq_method
FROM
    raw.chm
"""
).pl()


Without the excluded samples, 3 methods remain in the raw dataset.


In [None]:
con.sql(
"""--sql
CREATE OR REPLACE TABLE raw.methods AS (
WITH
    timestamped AS (
        SELECT
            acq_method,
            CAST(acq_date AS TIMESTAMP) as acq_date
        FROM
            raw.chm
        ),
    numbered AS (
        SELECT
            acq_method,
            acq_date,
        ),
    method_aggs AS (
        SELECT
            acq_method,
            min(acq_date) as earliest,
            max(acq_date) as latest,
            latest - earliest as length_of_use
        FROM
            timestamped
        GROUP BY
            acq_method        
        )
SELECT
    dense_rank() OVER (ORDER BY earliest) as method_idx,
    acq_method,
    earliest,
    latest,
    length_of_use
FROM
    method_aggs
ORDER BY
    method_idx
    );
SELECT
    *,
FROM
    raw.methods
ORDER BY
    earliest
"""
).df()


As we can see, some methods were more popular than others. method 2, avantor 100 x 4.6mm C18 H20:MeOH at 2.5% was the most popular, with 55 days of use. The corresponding 2.1 and 2.5 44 mins were less popular, at 7 and 8 days respectively.

In [None]:
con.sql(
"""--sql
WITH raw_samples_per_method AS (
    SELECT
        acq_method,
        count(pk) as sample_count
    FROM
        raw.chm as chm
    GROUP BY
        acq_method
    )
SELECT
    methods.method_idx,
    methods.acq_method,
    aggs.sample_count,
    methods.earliest,
    methods.latest,
    methods.length_of_use,
FROM
    raw_samples_per_method as aggs
LEFT JOIN
    raw.methods as methods
USING
    (acq_method)
ORDER BY
    method_idx
"""
).df()


As we can see, There are method 1. has 26 samples, method 2 has 64 samples, method 3 has 6 samples, method 5 has 71 samples.

What sets the methods apart? While the gradient is a concerning difference, There is not much I can do to observe it at the moment, without checking the data files. I should probably just do that..

The thing is that the methods while having the same name throughout, may have been modified. The only way to tell is to check each samples gradient directly. This information can be manually accessed in acq.txt as a text based table. Not very useful for parsing. BUt that does contain the method name and gradient profile, so presumably the "macaml" does too.

If we wanted to get the information we could try parsing the macaml again..
