---
cdt: 2024-09-10T16:27:00
title: Creating the 'clean' Schema and Primary Keys
description: Introduces the 'clean' schema and provides the code necessary for its creation from the source tables. A defining characteristic is the presence of 'pk' primary keys in each table
project: dataset_EDA
---

In [2]:
%reload_ext autoreload
%autoreload 2
import duckdb as db
import polars as pl
from pca_analysis.experiments.constants import db_path
from great_tables import GT
pl.Config.set_tbl_rows(999).set_tbl_width_chars(2000).set_fmt_str_lengths(99999)

con = db.connect(db_path)


[Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture) describes a tiered database structure from the raw, dirty, messy input data to cleaned, organised and human readable outputs. From this point of view, our data is still very much in the lower stages. In order to begin organising the data in a more cohesive manner, each table needs a contextless primary key. The addition of the primary key column ('pk') to each of the 'c_chemstation_metadata', 'c_sample_tracker' and 'c_cellar_tracker' will begin this second-stage cleaning process, and the new tables will be added to a newly created 'clean' schema. The advantage of the distinction is that it a: saves on namespace, and b. protects the source data tables from incorrect edits.

To produce the primary keys we identify columns that can be used as unique identifiers, then use `dense_rank()` over them to produce a unique, monotonically increasing key column. This was done for each table. See below for details.


# Create 'clean' Schema

Schemas need to be explicitly created before tables can be added to them.

In [4]:
con.sql(
"""--sql
CREATE SCHEMA IF NOT EXISTS clean;
""")


# Adding Primary Keys and Creating 'clean.' Tables



## Sample Tracker


### Observing 'samplecode' as a Primary Key


is ch_samplecode the primary key of sample tracker? If it is, there should be an equal number of distinct values as there are rows of the table.

In [None]:
con.sql(
"""--sql
SELECT
    samplecode,
    count(*) as count
FROM
    c_sample_tracker
GROUP BY
    samplecode
HAVING
    count > 1
"""
).pl()


samplecode,count
str,i64


yes. However lets add a new 'pk' column based on it. This can be achieved by densely ranking the 'samplecode' column:

In [None]:
con.sql(
"""--sql

"""
)


Now I have to recreate the table.


Now I have to create the clean schema table:


In [None]:
con.sql(
"""--sql
SELECT * FROM st LIMIT 1
"""
).pl().columns


['detection',
 'sampler',
 'samplecode',
 'vintage',
 'name',
 'open_date',
 'sampled_date',
 'added_to_cellartracker',
 'notes',
 'size',
 'ct_wine_name',
 'pk']

In [None]:
con.sql(
"""--sql
-- create a intermediate table with the primary key to avoid editing the source
CREATE TEMP TABLE st AS
    SELECT
        *,
        rank_dense() OVER (order by samplecode) as pk
    FROM
        c_sample_tracker;

-- create the destination table with the primary key column constraint
CREATE OR REPLACE TABLE clean.st (
    detection VARCHAR,
    sampler VARCHAR,
    samplecode VARCHAR,
    vintage VARCHAR,
    name VARCHAR,
    open_date VARCHAR,
    sampled_date VARCHAR,
    added_to_cellartracker VARCHAR,
    notes VARCHAR,
    size VARCHAR,
    ct_wine_name VARCHAR,
    pk INTEGER PRIMARY KEY,
);

-- populate the new table
INSERT INTO clean.st
SELECT
    *
FROM
    st;

-- cleanup of the intermedaite table
DROP TABLE st;
SELECT
    *
FROM
    clean.st
LIMIT 3
"""
).pl()


detection,sampler,samplecode,vintage,name,open_date,sampled_date,added_to_cellartracker,notes,size,ct_wine_name,pk
str,str,str,str,str,str,str,str,str,str,str,i32
"""raw""","""jonathan""","""00""","""2016""","""zema estate 'family selection' cabernet sauvignon""",,,"""y""","""freezer storage. sampled at 21:20 20230122, stored for 10h before transport to lab.""","""750""","""2016 zema estate cabernet sauvignon family selection""",1
"""raw""","""jonathan""","""01""","""2022""","""william downie 'cathedral' pinot noir""",,,"""y""","""ambient 2 weeks. sampled 20230111.""","""750""","""2022 william downie cathedral""",2
"""raw""","""jonathan""","""02""","""2021""","""babo chianti""",,,"""y""","""ambient 2 weeks. sampled 20230111.""","""750""","""2021 babo chianti""",3


And see if the primary key constraint worked..

If the above passed, then the primary key constraint on sample tracker worked.

## Chemstation Metadata

The chemstation metadata primary key can be created from the 'id' column.

To do this we create a new 'clean' table,



The primary key of the 'chemstation_metadata' table is the 'id'.


There are 175 unique rows.


create the pk column


In [None]:
if not con.sql(
"""--sql
SELECT
    id,
    count(*) as count
FROM
    c_chemstation_metadata
GROUP BY
    id
HAVING
    count > 1
"""
).pl().is_empty():
    raise ValueError("duplicate detected in table")


In [3]:
con.sql(
"""--sql
-- create an intermediate table to avoid editing the source dat and add the primary key
CREATE OR REPLACE TEMP TABLE chm AS
SELECT
    rank_dense() OVER (order by id) as pk,
    id,
    path,
    acq_date,
    acq_method,
    unit,
    signal,
    vendor,
    inj_vol,
    seq_name,
    seq_desc,
    vialnum,
    originalfilepath,
    "desc",
    join_samplecode as st_samplecode,
FROM
    c_chemstation_metadata;

-- create the destination table with the primary key restraint
CREATE OR REPLACE TABLE clean.chm (
    pk INTEGER PRIMARY KEY,
    id VARCHAR UNIQUE,
    path VARCHAR,
    acq_date VARCHAR UNIQUE,
    acq_method VARCHAR,
    unit VARCHAR,
    signal VARCHAR,
    vendor VARCHAR,
    inj_vol VARCHAR,
    seq_name VARCHAR,
    seq_desc VARCHAR,
    vialnum VARCHAR,
    originalfilepath VARCHAR,
    st_samplecode VARCHAR,
    "desc" VARCHAR,
);

-- populate the destination table
INSERT INTO clean.chm
    SELECT
        pk,
        id,
        path,
        acq_date,
        acq_method,
        unit,
        signal,
        vendor,
        inj_vol,
        seq_name,
        seq_desc,
        vialnum,
        originalfilepath,
        st_samplecode,
        "desc"
    FROM
        chm;

-- cleanup
DROP TABLE chm;

SELECT * FROM clean.chm LIMIT 3
"""
).pl()


pk,id,path,acq_date,acq_method,unit,signal,vendor,inj_vol,seq_name,seq_desc,vialnum,originalfilepath,st_samplecode,desc
i32,str,str,str,str,str,str,str,str,str,str,str,str,str,str
1,"""037f76ff-8c25-4e43-b5bd-4530e12cc5a6""","""/users/jonathan/uni/0_jono_data/mres_data_library/raw_uv/085.d""","""2023-04-04 19:11:36""","""avantor100x4_6c18-h2o-meoh-2_5.m""","""mau""","""dad1i, dad: spectrum""","""agilent""","""10.00""","""2023-04-04_wines_2023-04-04_12-01-53""",,"""vial 9""","""c:\chem32\1\data\0_jono_data\2023-04-04_wines_2023-04-04_12-01-53""","""85""",
2,"""03d2138b-5aad-4f81-af22-698e97ab28dc""","""/users/jonathan/uni/0_jono_data/mres_data_library/cuprac_wines_2023-05-22 2023-05-22 10-24-28/008-2701.d""","""2023-05-23 04:23:53""","""0_cuprac_3_16_40-mins-4min100%hold.m""","""mau""","""dad1i, dad: spectrum""","""agilent""","""5.00""","""cuprac_wines_2023-05-22 2023-05-22 10-24-28""","""cuprac version of the ambient daily repeat runs.""","""vial 8""","""c:\chem32\3\data\cuprac_wines_2023-05-22 2023-05-22 10-24-28""","""ca0101""",
3,"""06902f86-0024-418d-b449-79843f96bf09""","""/users/jonathan/uni/0_jono_data/mres_data_library/raw_uv/096.d""","""2023-04-13 12:00:29""","""avantor100x4_6c18-h2o-meoh-2_5_44-mins.m""","""mau""","""dad1i, dad: spectrum""","""agilent""","""10.00""","""2023-04-13_wines_2023-04-13_11-59-01""",,"""vial 1""","""c:\chem32\1\data\0_jono_data\2023-04-13_wines_2023-04-13_11-59-01""","""96""",


And now to populate. I think i'll have to do this column by column? Need a join table

Now write chemstation metadata to the clean schema


### Clean CHM Description


In [4]:
con.sql(
"""--sql
DESCRIBE clean.chm
"""
).pl()


column_name,column_type,null,key,default,extra
str,str,str,str,str,str
"""pk""","""INTEGER""","""NO""","""PRI""",,
"""id""","""VARCHAR""","""YES""","""UNI""",,
"""path""","""VARCHAR""","""YES""",,,
"""acq_date""","""VARCHAR""","""YES""","""UNI""",,
"""acq_method""","""VARCHAR""","""YES""",,,
"""unit""","""VARCHAR""","""YES""",,,
"""signal""","""VARCHAR""","""YES""",,,
"""vendor""","""VARCHAR""","""YES""",,,
"""inj_vol""","""VARCHAR""","""YES""",,,
"""seq_name""","""VARCHAR""","""YES""",,,


: 

### Constraints Table


In [None]:
con.sql(
"""--sql
SELECT
    schema_name,
    table_name,
    constraint_column_names,
    constraint_type
FROM
    duckdb_constraints()
WHERE
    schema_name = 'clean'
AND
    table_name = 'chm'
"""
).pl()


Now it would acquire the means to connect to the sample tracker table..

In [None]:
join_tbl = con.sql(
"""--sql
SELECT
    st.ct_wine_name as pk_st_to_ct,
    jt.pk_chm_to_st,
    jt.pk_id
FROM
    c_sample_tracker as st
JOIN
    join_tbl as jt
ON
    jt.pk_chm_to_st = st.samplecode
"""
).pl()

join_tbl.head()


In [None]:
con.sql(
"""--sql
SELECT
    *
FROM
    
"""
).pl()


In [None]:
con.sql(
"""--sql
select table_schema, table_name, column_name from information_schema.columns WHERE table_name = 'c_sample_tracker'
"""
).pl()['column_name'].to_list()


In [None]:
con.sql(
"""--sql
SELECT
    wine
FROM
    c_cellar_tracker
"""
).pl()


In [None]:
con.sql(
"""--sql
SELECT
    ct_wine_name
FROM
    c_sample_tracker
"""
).pl()


'c_chemstation_metadata' has 175 rows but 174 distinct 'join_samplecode' values. This is the key to join with 'c_sample_tracker'. Is there a duplicate?

In [None]:
con.sql(
"""--sql
select
    count(distinct join_samplecode)
from
    c_chemstation_metadata
"""
).pl()


In [None]:
con.sql(
"""--sql
select
    *
from
    c_cellar_tracker as ct
JOIN
    c_sample_tracker as st
ON
    ct.wine = st.ct_wine_name
"""
).pl()


In [None]:
con.sql(
"""--sql
SELECT
    
"""
).pl()


In [None]:
con.sql(
"""--sql
SELECT
    vintage,
    count(vintage) as count
FROM
    pbl.sample_metadata
JOIN
    c_a
    
"""
).pl()


## Cellar Tracker

To form a join table between ST and CT it is best to create a new primary key on cellar tracker (CT).


### CT Duplicate Row


CT has one duplicate row, wine 'Mumm Tasmania Brut Prestige'.

In [None]:
con.sql(
"""--sql
SELECT
    vintage,
    name,
    count(*) as duplicate_count
FROM
    c_cellar_tracker
GROUP BY
    name, vintage
HAVING
    count(*) > 1
"""
).pl()


vintage,name,duplicate_count
str,str,i64
,"""mumm tasmania brut prestige""",2


This can be remedied by selecting distinct rows only in the query.

In [None]:
con.sql(
"""--sql
-- create an intermediate table including the primary key to avoid editing the source table.
CREATE OR REPLACE TEMP TABLE ct AS
    SELECT
        size,
        vintage,
        name,
        locale,
        country,
        region,
        subregion,
        appellation,
        producer,
        type,
        color,
        category,
        varietal,
        wine,
        rank_dense() OVER (order by vintage, name) as pk,
    FROM
        (
            SELECT
                -- remove duplicate row from consideration - Mumm NV is duplicated
                DISTINCT concat(vintage, name),
                size,
                vintage,
                name,
                locale,
                country,
                region,
                subregion,
                appellation,
                producer,
                type,
                color,
                category,
                varietal,
                wine,
            FROM
                c_cellar_tracker
            );

-- create the destination table
CREATE OR REPLACE TABLE clean.ct (
    size VARCHAR,
    vintage VARCHAR,
    name VARCHAR,
    locale VARCHAR,
    country VARCHAR,
    region VARCHAR,
    subregion VARCHAR,
    appellation VARCHAR,
    producer VARCHAR,
    type VARCHAR,
    color VARCHAR,
    category VARCHAR,
    varietal VARCHAR,
    wine VARCHAR,
    pk INTEGER PRIMARY KEY
);

-- populate the destination table
INSERT INTO clean.ct (
SELECT
    size,
    vintage,
    name,
    locale,
    country,
    region,
    subregion,
    appellation,
    producer,
    type,
    color,
    category,
    varietal,
    wine,
    pk
FROM
    ct
);

-- cleanup
DROP TABLE ct;

SELECT * FROM clean.ct LIMIT 3
"""
).pl()


size,vintage,name,locale,country,region,subregion,appellation,producer,type,color,category,varietal,wine,pk
str,str,str,str,str,str,str,str,str,str,str,str,str,str,i32
"""750ml""","""2008""","""torbreck descendant""","""australia, south australia, barossa, barossa valley""","""australia""","""south australia""","""barossa""","""barossa valley""","""torbreck""","""red""","""red""","""dry""","""shiraz blend""","""2008 torbreck descendant""",1
"""750ml""","""2009""","""st hugo cabernet sauvignon coonawarra""","""australia, south australia, limestone coast, coonawarra""","""australia""","""south australia""","""limestone coast""","""coonawarra""","""st hugo""","""red""","""red""","""dry""","""cabernet sauvignon""","""2009 st hugo cabernet sauvignon coonawarra""",2
"""750ml""","""2013""","""woodlands cabernet merlot""","""australia, western australia, south west australia, margaret river""","""australia""","""western australia""","""south west australia""","""margaret river""","""woodlands""","""red""","""red""","""dry""","""red bordeaux blend""","""2013 woodlands cabernet merlot""",3


Now create the new table with the primary key constraint.


# Conclusion

In conclusion, the following tables were created in the 'clean' schema: 1. 'clean.st', 2. 'clean.chm', and 'clean.ct' corresponding to the 'c_sample_tracker', 'c_chemstation_metadata', and 'c_cellar_tracker' tables respectively. 1.'s primary key was created from the 'id' column, 2. from the 'samplecode' column and '3.' from the combination of 'vintage' and 'name' columns.
