---
cdt: 2024-09-10T16:44:02
title: "Creating Join Tables"
description: "Contains a discussion of and the code required to create join tables for 'clean.chm' to 'clean.st' and 'clean.st' to 'clean.ct."
project: database_architecture
conclusion: a join table 'joins.chm_st_ct' has been created, consisting of 'pk_chm', 'pk_st', 'pk_ct', and 'sample_num" corresponding to the primary keys of each table and 'pbl.sample_metadata'. As some samples are missing CT rows, the 'pk_ct' column contains nulls.
---


In [None]:
%reload_ext autoreload
%autoreload 2
import duckdb as db
import polars as pl
from pca_analysis.experiments.constants import db_path
from great_tables import GT
pl.Config.set_tbl_rows(999).set_tbl_width_chars(2000).set_fmt_str_lengths(99999)

con = db.connect(db_path)


## Creation of Join Table CHM to ST

'join_samplecode' connects 'c_chemstation_metadata' and 'c_sample_tracker'. Note: 'join_samplecode' was manually added, and thus that connection is fragile without the code that added it. It is somewhere in the 'wine_analysis_hplc_uv' project, and will be added here at a later date. In the meantime, you are warned.

update: its stored in `wine_analysis_hplc_uv.etl.build_library.chemstation.ch_m_cleaner` and simply consists of some value replacement and formatting. Shouldn't be difficult to implement here. 'ch_samplecode' is the original samplecode as entered in the Chemstation data files, 'join_samplecode' is the cleaned version of the samplecode.

In [None]:
join_tbl = con.sql(
"""--sql
CREATE schema IF NOT EXISTS joins;
CREATE OR REPLACE TABLE joins.chm_st_ct AS (
SELECT
    chm.sample_num as sample_num,
    chm.pk as pk_chm,
    st.pk as pk_st,
    ct.pk as pk_ct
FROM
    clean.chm as chm
JOIN
    clean.st as st
ON
    chm.st_samplecode = st.samplecode
LEFT JOIN
    clean.ct as ct
ON
    st.ct_wine_name = concat(ct.vintage, ' ',ct.name)
);
SELECT * FROM joins.chm_st_ct ORDER BY sample_num;
"""
).pl()

join_tbl.head()


In [None]:
con.sql(
"""--sql
SELECT
    *
FROM
    join_tbl
WHERE
    pk_ct IS NULL
"""
).pl()


Looks good to me. We can add a readme table to describe this in the schema.

In [None]:
con.sql(
"""--sql
CREATE OR REPLACE TABLE joins.readme (readme VARCHAR);
INSERT INTO joins.readme
    VALUES ('Use joins.chm_st_ct to go between the three tables based on individuals as defined by chm.id.\nEach individual has a corresponding st entry, but currently (2024-09-11) may not have a ct entry.\nThis is symbolised by a null value in the pk_ct column.');
INSERT INTO joins.readme
    VALUES ('joins.chm_st_ct also has sample_num for joins to pbl.sample_metadata');
SELECT * FROM joins.readme
"""
).pl()


# Conclusion

a join table 'joins.chm_st_ct' has been created, consisting of 'pk_chm', 'pk_st', and 'pk_ct' corresponding to the primary keys of each table. As some samples are missing CT rows, the 'pk_ct' column contains nulls. A 'sample_num' column derived from the order of 'acq_date' in 'clean.chm' has been added as well, to provide a link to pbl.sample_metadata'.
