---
cdt: '2024-09-14T00:14:55'
title: 'Division of raw samples by gradient'
project: 'raw_dataset_EDA'
description: 'look at division of raw samples by solvent gradient'
status: 'closed'
conclusion: '2 gradients within the raw dataset were identified, 2.5, with 98 samples and 3.17 ul/min with 6 samples. the 3.17 gradients were moved to 'database_etl/data/raw_uv_3.17_grad'
dependency: '[Normalising bin_pump](./normalising_ipynb)'
---

Need to determine how many distinct elutions, which samples fall into which, then label them.


In [11]:
# environment

%reload_ext autoreload
%autoreload 2

import duckdb as db
import polars as pl
from database_etl.definitions import DB_PATH

pl.Config.set_fmt_str_lengths(9999)
pl.Config.set_tbl_rows(9999)
con = db.connect(DB_PATH)

con.sql(
    """--sql
show tables
"""
).pl()


name
str
"""bin_pump_mech_params"""
"""chm"""
"""counts_per_gradient"""
"""ct"""
"""excluded"""
"""gradients"""
"""image_stats"""
"""inc_chm"""
"""inc_img_stats"""
"""raw_gradients_3_17"""


# Calculate Gradient by Sample

Now that the data is organised correctly, we can easily calculate the gradients as the change from the zeroth percent to the point where channel 'b' is 100%

In [12]:
try:
    con.sql(
        """--sql
    from solvprop_over_time limit 5
    """
    ).pl().pipe(display)
except Exception as e:
    con.close()
    del con
    raise e


pk,time,channel,percent
i32,f64,str,f64
1,38.0,"""a""",0.0
2,40.0,"""a""",0.0
3,52.0,"""b""",5.0
5,40.0,"""b""",100.0
5,52.0,"""b""",5.0


In [13]:
con.sql(
    """--sql
show tables
"""
).pl()


name
str
"""bin_pump_mech_params"""
"""chm"""
"""counts_per_gradient"""
"""ct"""
"""excluded"""
"""gradients"""
"""image_stats"""
"""inc_chm"""
"""inc_img_stats"""
"""raw_gradients_3_17"""


In [14]:
def calculate_gradient(con: db.DuckDBPyConnection) -> None:
    if not con.sql(
        """--sql
    select
        *
    from
        solvprop_over_time
    """
    ).fetchall():
        raise ValueError("solvprop_over_time is empty")
    con.sql(
        """--sql
    CREATE OR REPLACE view gradients AS (
        select
            pk,
            percent_diff/time as gradient
        from (
            select
                pk,
                idx,
                time,
                percent,
                percent - lag(percent) OVER (PARTITION BY pk ORDER BY idx) as percent_diff,
                --lag(percent) as percent_shift,
            from  (
                select
                pk,
                dense_rank() OVER (partition by pk order by time) as idx,
                time,
                percent,
                from
                    solvprop_over_time
                where
                    channel = 'b'
            )
        )
        where
            idx = 2
        ORDER BY
            pk,
            time
    );
    select
        *
    from
        gradients
    limit 5 
    """
    ).pl().pipe(display)

    if not con.sql(
        """--sql
    select
        *
    from
        gradients
    """
    ).fetchall():
        raise ValueError("gradients is empty")


try:
    calculate_gradient(con=con)
except Exception as e:
    con.close()
    del con
    raise e


pk,gradient
i32,f64
1,2.5
2,2.5
3,2.5
5,2.5
6,2.5


In [15]:
con.sql(
    """--sql
create or replace view counts_per_gradient as (
    select
        gradient,
        count(*) as count
    from
        gradients
    group by gradient
    order by
        gradient
    )
"""
)

con.sql(
    """--sql
select
    *
from
    counts_per_gradient
"""
).pl().pipe(display)

con.close()
del con


gradient,count
f64,i64
2.5,95


So as we can see, there are two groups, 3.167 and 2.5. Thankfully the 2.5 group includes 98 samples vs. the 6 in 3.167.

Now which are in 3.167?

So samples from 04-13 and 04-21, including the first run of the 'wine deg' project. Well that's all we need to know for now. Should add them to the excluded samples. These samples have been moved to a directory 'database_etl/data/raw_uv_3.17_grad', and can be revisited at a later time.

# Conclusion

Out of the 104 raw samples, 6 were identified to be run on a different gradient, that of 3.17 ul/min, compared to the majority @ 2.5ul/min. They were moved out of the raw data library to a directory 'database_etl/data/raw_uv_3.17_grad'.