---
title: Correcting Offset
cdt: 2024-09-06T15:55:14
description: "Correct the time offset and express as secs for sample 254."
project: dataset_EDA
execution_order: '001'
---

# Summary

- correcting offset
  - the vast majority of samples possess a zeroth value time offset that can be corrected by subtracting the zeroth element from the time column ('mins') element-wise.


# Conclusion
 
'mins_corrected', and 'secs_corrected' written to table 'dataset_eda.nm_254'.

The time offset is the zeroth value. We can correct for it by subtracting it element-wise from the 'mins' column.

In [1]:
import duckdb as db
import polars as pl
from pca_analysis.experiments.constants import db_path

con = db.connect(db_path, read_only=True)


.


In [2]:
con.sql(
    """--sql
    select
        id,
        min(mins) as zeroth_mins
    from
        dataset_eda.nm_254
    group by
        id
    order by
        id, zeroth_mins
    """
).pl().plot.scatter(x='id',y='zeroth_mins', title="zeroth 'mins' against 'id'")


As we can see, without going into it too deeply, the offset appears essentially random and is to be corrected:


In [6]:
nm_254 = con.sql(
    """--sql
    SELECT
        mins - first(mins) OVER (PARTITION BY id ORDER BY id, idx) as mins_corrected,
        mins_corrected * 60 as secs_corrected,
        *,
    FROM
        pbl.chromatogram_spectra_long
    WHERE
        wavelength = 254
    ORDER BY
        id,
        idx
    """
)

nm_254.pl()


mins_corrected,secs_corrected,idx,id,mins,wavelength,absorbance
f64,f64,i64,str,f64,i32,f64
0.0,0.0,0,"""037f76ff-8c25-4e43-b5bd-4530e1…",0.004367,254,-0.011832
0.006667,0.4,1,"""037f76ff-8c25-4e43-b5bd-4530e1…",0.011033,254,-0.012062
0.013333,0.8,2,"""037f76ff-8c25-4e43-b5bd-4530e1…",0.0177,254,-0.012055
0.02,1.2,3,"""037f76ff-8c25-4e43-b5bd-4530e1…",0.024367,254,-0.011511
0.026667,1.6,4,"""037f76ff-8c25-4e43-b5bd-4530e1…",0.031033,254,-0.010625
…,…,…,…,…,…,…
39.966667,2398.0,5995,"""ff228654-60ac-409b-b5d2-646cfd…",39.9677,254,33.507116
39.973333,2398.4,5996,"""ff228654-60ac-409b-b5d2-646cfd…",39.974367,254,33.180915
39.98,2398.8,5997,"""ff228654-60ac-409b-b5d2-646cfd…",39.981033,254,32.565385
39.986667,2399.2,5998,"""ff228654-60ac-409b-b5d2-646cfd…",39.9877,254,31.754076


To confirm that the mins has been corrected, correctly, find the first and last value of each sample

In [7]:
con.sql(
"""--sql
with min_mins AS (
    SELECT
        min(mins_corrected) OVER (PARTITION BY id ORDER BY id, mins) min_mins_corrected,
        first(mins_corrected) OVER (PARTITION BY id ORDER BY id, mins) min_first_corrected,
        
    FROM
        nm_254
),
test_min_mins AS (
    SELECT
        *,
        CASE WHEN min_mins_corrected = min_first_corrected THEN 'pass' WHEN min_mins_corrected != 0 THEN 'fail' ELSE 'fail' END AS test
    FROM
        min_mins

    )
SELECT
    *
FROM
    test_min_mins
WHERE
    test = 'fail'
"""
).pl()


min_mins_corrected,min_first_corrected,test
f64,f64,str


In every sample, the first "mins" is the minimum, in this case, zero.

# Add corrected columns to nm_254

As the corrected mins col passes the test above, we can safely add 'mins_corrected' mins and 'secs_corrected' columns to 'nm_254'.


In [8]:
con.sql("""--sql
DESCRIBE dataset_eda.nm_254
""").pl()


column_name,column_type,null,key,default,extra
str,str,str,str,str,str
"""sample_num""","""BIGINT""","""YES""",,,
"""idx""","""BIGINT""","""YES""",,,
"""id""","""VARCHAR""","""YES""",,,
"""mins""","""DOUBLE""","""YES""",,,
"""wavelength""","""INTEGER""","""YES""",,,
"""absorbance""","""DOUBLE""","""YES""",,,
"""mins_corrected""","""DOUBLE""","""YES""",,,
"""secs_corrected""","""DOUBLE""","""YES""",,,


In [43]:
con.close()

with db.connect(db_path) as con:

    display(con.sql("""--sql
    CREATE OR REPLACE TABLE dataset_eda.nm_254 AS 
    SELECT
        *,
        first(mins) OVER (PARTITION BY id ORDER BY id, idx) as first_min,
        mins - first_min as mins_corrected,
        mins_corrected * 60 as secs_corrected,
        1/secs_corrected as hz,
    FROM
        dataset_eda.nm_254
    ORDER BY
        id, idx
                    ;
    SELECT * FROM dataset_eda.nm_254 LIMIT 10;
    """).pl().head())


sample_num,idx,id,mins,wavelength,absorbance,first_min,mins_corrected,secs_corrected,hz
i64,i64,str,f64,i32,f64,f64,f64,f64,f64
76,0,"""037f76ff-8c25-4e43-b5bd-4530e1…",0.004367,254,-0.011832,0.004367,0.0,0.0,
76,1,"""037f76ff-8c25-4e43-b5bd-4530e1…",0.011033,254,-0.012062,0.004367,0.006667,0.4,2.5
76,2,"""037f76ff-8c25-4e43-b5bd-4530e1…",0.0177,254,-0.012055,0.004367,0.013333,0.8,1.25
76,3,"""037f76ff-8c25-4e43-b5bd-4530e1…",0.024367,254,-0.011511,0.004367,0.02,1.2,0.833333
76,4,"""037f76ff-8c25-4e43-b5bd-4530e1…",0.031033,254,-0.010625,0.004367,0.026667,1.6,0.625


and verify within the database..

In [11]:
con.sql(
"""--sql
with min_mins AS (
    SELECT
        min(mins_corrected) OVER (PARTITION BY id ORDER BY id, idx) min_mins_corrected,
        first(mins_corrected) OVER (PARTITION BY id ORDER BY id, idx) min_first_corrected,
        
    FROM
        dataset_eda.nm_254
),
test_min_mins AS (
    SELECT
        *,
        (CASE
            WHEN
                min_mins_corrected = min_first_corrected
            THEN
                'pass'
            WHEN
                min_mins_corrected != 0
            THEN
                'fail'
            ELSE
                'fail'
            END)
            AS test
    FROM
        min_mins

    )
SELECT
    *
FROM
    test_min_mins
WHERE
    test = 'fail'
"""
).pl()


min_mins_corrected,min_first_corrected,test
f64,f64,str


still passed! well done. Remember, this will have to be executed after [dataset_EDA](/Users/jonathan/mres_thesis/pca_analysis/pca_analysis/experiments/notebooks/experiments/dataset_description_wavelength_time.ipynb)
