---
cdt: 2024-09-15T14:57:44
title: Normalising `bin_pump`
description: "the `bin_pump` schema tables are a mess and need normalisation in order to extract useful information such as the gradients"
conclusion: "`solvcomps` and `timetables` were rearranged into `pump_mech_params`, `solvents` and `solvprop_over_time`, with the latter containing the gradient elution program. Useless or superfluous columns were removed"
status: closed
project: database_architecture
dependency: '[Extracting Binary Pump Tables](./extracting_binary_pump_tables.ipynb)'
---

In [None]:
# environment

%reload_ext autoreload
%autoreload 2

import duckdb as db
import polars as pl
from database_etl.definitions import DB_PATH

pl.Config.set_fmt_str_lengths(9999)
pl.Config.set_tbl_rows(9999)
con = db.connect(DB_PATH)

# Describing `bin_pump`

`bin_pump` consists of three tables: `id`, `solvcomps`, and `timetables`. `id` contains a primary key `id` and a unique column `num` which is the destined foreign key for the columns in `timetables` and `solvcomps`. However, the corresponding `num` columns are not physically constrained because they are stored in wide form with duplicate rows. One goal of this exercise is to convert the columns to wide form so that constraints can be correctly enacted.

In [None]:
con.sql(
    """--sql
show tables
"""
).pl()

In [None]:
con.sql(
    """--sql
SELECT 'solvcomps' as table_name, * FROM (DESCRIBE solvcomps)
UNION
SELECT 'timetables' as table_name, * FROM (DESCRIBE timetables)
"""
).pl()

Ok so at least id has the primary key constraint..

I assume this is because I stored them as wide tables?

As we can see, in all cases solvent A is water, and solvent B is methanol.

In [None]:
con.sql(
    """--sql
SELECT
    *
FROM
    timetables
LIMIT 3
"""
).pl()

# Normalizing 'bin_pump'


## Timetables


'a', and 'b' have different values for each row corresponding to which channel is the subject, but flow and pressure are general to both, according to the time. Thus 'flow' and 'pressure' should be moved to their own table.

In [None]:
con.sql(
    """--sql
describe timetables
"""
).pl()

### Table: `pump_mech_params`


Now the mechanical parameters 'flow' and 'pressure' can be moved to their own table, 'pump_mech_params'

In [None]:
con.sql(
    """--sql
select
    *
from
    timetables
order by pk, idx
limit
    5
"""
).pl()

In [None]:
def create_pump_mech_params(
    con: db.DuckDBPyConnection, overwrite: bool = False
) -> None:
    if overwrite:
        con.execute(
            """--sql
            drop table if exists bin_pump_mech_params;
            """
        )
    con.execute(
        """--sql
    create or replace table bin_pump_mech_params (
    pk INTEGER primary key references chm(pk),
    flow FLOAT,
    pressure FLOAT,
    );
    insert into bin_pump_mech_params
        select
            distinct pk,
            flow,
            pressure
        from
            timetables
        order by
            pk;
    """
    )


try:
    create_pump_mech_params(con=con, overwrite=True)
except db.ConstraintException as e:
    con.close()
    del con
    raise e

con.sql(
    """--sql
select
    *
from
    bin_pump_mech_params
limit 5
"""
).pl()

then we can create another table specific to the channels..

### Table: `channels`


In [None]:
con.sql(
    """--sql
select * from timetables limit 5
"""
).pl()

In [None]:
def create_channels(
    con: db.DuckDBPyConnection,
) -> None:
    con.sql(
        """--sql
    CREATE OR REPLACE TABLE
    channels (
        pk integer references chm(pk),
        time double not null,
        channel varchar not null,
        percent double not null,
        primary key (pk, time, channel)
    );
    insert into channels
        unpivot (
            select
                pk,
                time,
                a,
                b,
            from
                timetables
                )
        on
            a, b
        into
            name
                channel
            value
                percent
        order by
            pk,
            time,
            channel
"""
    )


create_channels(con=con)

con.sql(
    """--sql
select
    *
from
    channels
limit 5
"""
).pl()

Thus the columns in `timetables` have been better organised.

## Solvcomps

Now can we do the same to solvcomps?

Well, ch2, or channel 2 is not used, i dont believe it is ever used, so we can drop 'ch2_solv', 'selected', 'name_2'. Also, if that is the case, then 'used', and 'selected' are redundent.


In [None]:
con.sql(
    """--sql
SELECT
    *
FROM
    solvcomps
LIMIT
    5
"""
).pl()

In [None]:
con.sql(
    """--sql
SELECT
    'selected' as col,
    unique_vals
FROM (
    SELECT
        distinct selected as unique_vals
    FROM
        solvcomps
        )
UNION
SELECT
    'used' as col,
    unique_vals
FROM (
    SELECT
        distinct used as unique_vals
    FROM
        solvcomps
)
UNION
SELECT
    'name_1' as col,
    unique_vals
FROM (
    SELECT
        distinct name_1 as unique_vals
    FROM
        solvcomps
        )
UNION
SELECT
    'name_2' as col,
    unique_vals
FROM (
    SELECT
        distinct name_2 as unique_vals
    FROM
        solvcomps
        )

"""
).pl()

so we can remove those columns without worry.


So we've now got two forms of information, one repeated redundently - ch1_solv, and percent.

Moving ch1_solv into its own table would be advisable.

### Table: `solvents`


In [None]:
def create_solvents(con: db.DuckDBPyConnection) -> None:
    con.execute(
        """--sql
    CREATE OR REPLACE TABLE solvents (
        pk integer primary key references chm(pk),
        a varchar,
        b varchar,
    );
    insert into solvents
        pivot
            (select
                pk,
                lower(channel) as channel,
                ch1_solv
            from
                solvcomps
                )
        on
            channel
        using
            first(ch1_solv)
        order by
            pk
    """
    )


create_solvents(con=con)

con.execute(
    """--sql
SELECT
    *
FROM
    solvents
LIMIT
    5
"""
).pl()

In [None]:
con.sql(
    """--sql
show tables
"""
).pl()

## Combining `channels` and `zero_percents` into `solvprop_over_time`


Finally, the initial proportions of A and B should be added to the 'channels' table, providing us with a zero time value. 

Then the zero percents can be added to the timetable..


In [None]:
def load_solvprop_over_time(
    con: db.DuckDBPyConnection, overwrite: bool = False
) -> None:
    if overwrite:
        con.execute(
            """--sql
        drop table if exists solvprop_over_time;
        """
        )
    con.execute(
        """--sql
        CREATE TABLE solvprop_over_time (
            pk integer references chm(pk),
            time double,
            channel varchar,
            percent double,
            primary key (pk, time, channel)
        );

        INSERT INTO solvprop_over_time (
            with solvcomp_subset as (
            select
                pk,
                CAST(0 as double) as time,
                lower(channel) as channel,
                percent
            FROM
                solvcomps
            ORDER BY
                pk,
                time,
                channel
            )
            select
                pk,
                time,
                channel,
                percent
            from channels
            UNION
                select
                    pk,
                    time,
                    channel,
                    percent
                from
                    solvcomp_subset
            )
        """
    )

    if not con.execute(
        """--sql
    select
        *
    from
        solvprop_over_time
    """
    ).fetchall():
        raise ValueError("after insert query, solvprop_over_time is empty")


def create_solvprop_over_time(
    con: db.DuckDBPyConnection, overwrite: bool = False
) -> None:
    load_solvprop_over_time(con=con, overwrite=overwrite)

    con.sql(
        """--sql
    DROP TABLE if exists channels;
    """
    )


create_solvprop_over_time(con=con, overwrite=True)

con.execute(
    """--sql
select * from solvprop_over_time limit 5
"""
).pl().pipe(display)

#

In [None]:
con.execute(
    """--sql
select
    *
from
    solvprop_over_time
order by
    pk,
    time,
    channel
limit 10
"""
).pl().pipe(display)

con.sql(
    """--sql
show tables
"""
).pl().pipe(display)
con.close()
del con

# Conclusion

so we started with three tables:

1. `timetables`
2. `solvcomps`
3. `id`

The table `timetables` was decomposed into: `pump_mech_params`, which contains the flow and pressure of each sample run; and `channels`, containing the percent change in solvent composition over time. `solvcomps` was decomposed into:  `solvents`, containing the label of the solvent in each channel; and `zero_percents`, the proportions of the solvents at the zero point, which was combined with `channels` to form `timetable`, the complete solvent composition change over time for the elution. Thus the final tables are `pump_mech_params`, `solvents`, and `solvprop_over_time`