---
project: dataset_EDA
title: Raw Dataset Description
cdt: 2024-09-09T23:44:30
description: a descriptive analysis of the raw dataset
---


# Sample Metadata

We define the metadata as all categorical information pertaining to the character of the data: the wine name, geographic origin, variety, producer, vintage, etc.


In [1]:
%reload_ext autoreload
%autoreload 2
import duckdb as db
import polars as pl
from pca_analysis.experiments.constants import db_path
from great_tables import GT
pl.Config.set_tbl_rows(999).set_tbl_width_chars(2000).set_fmt_str_lengths(99999)

con = db.connect(db_path, read_only=True)


.


- ~~how many samples~~
- ~~what colors~~
- ~~what varieties~~
- what vintages


In [2]:
sm = con.sql(
"""--sql
SELECT
    *
FROM
    pbl.sample_metadata
ANTI JOIN
    (SELECT sample_num FROM dataset_eda.excluded_samples)
USING
    (sample_num)
WHERE
    detection = 'raw'
"""
).pl()
sm


detection,acq_date,wine,color,varietal,samplecode,id,sample_num
str,str,str,str,str,str,str,i64
"""raw""","""2023-02-15 15:21:28""","""2022 william downie cathedral""","""red""","""pinot noir""","""05""","""4fe49506-74e4-473b-b7a8-23500c…",1
"""raw""","""2023-02-15 16:15:09""","""2021 babo chianti""","""red""","""sangiovese""","""06""","""e56c4dcd-2847-4d34-b457-743be1…",2
"""raw""","""2023-02-15 17:08:54""","""2020 boutinot uva non grata""","""red""","""gamay""","""07""","""5eb3135c-33a2-404b-8042-e23cae…",3
"""raw""","""2023-02-15 18:02:36""","""2021 matias riccitelli malbec …","""red""","""malbec""","""08""","""8cfa23c8-ffa6-4c27-be70-0251d4…",4
"""raw""","""2023-02-15 18:56:22""","""2018 crawford river cabernets""","""red""","""red bordeaux blend""","""09""","""38601b0b-5338-4154-9f04-cf85b8…",5
…,…,…,…,…,…,…,…
"""raw""","""2023-05-09 01:26:57""","""2015 clos du marquis""","""red""","""red bordeaux blend""","""110""","""7311903b-1d99-4dd9-b41d-8e5755…",98
"""raw""","""2023-05-09 02:20:25""","""2022 clonakilla shiraz oriada""","""red""","""shiraz""","""111""","""ca0b15df-e1e0-48fa-a28c-9c4b01…",99
"""raw""","""2023-05-09 05:00:53""","""2019 bodega catena zapata malb…","""red""","""malbec""","""114""","""839df657-69a3-4935-a30b-3bd17e…",102
"""raw""","""2023-05-09 05:54:23""","""2022 bleeding heart shiraz""","""red""","""shiraz""","""115""","""d4d65f88-a5db-4ac9-9dca-520fcc…",103


## Columns


In [4]:
sm.columns


['detection',
 'acq_date',
 'wine',
 'color',
 'varietal',
 'samplecode',
 'id',
 'sample_num']

The sample metadata table contains the following columns: 'detection': the detection method of the sample signal, 'acq_date': the date the sample was observed, 'wine': the name of the wine from which the sample was taken, 'color': the color of the wine, 'varietal': the grape variety the wine was made from, 'samplecode': the unique identifier assigned to the sample at the time of observation, 'id' a unique identifer hash generated by the Agilent Chemstation software used to join the metadata to the signal table, 'sample_num': a human-readable unique identifier monotonically increasing from 1 acording to the 'acq_datee' in ascending order. 

## Sample Count

In [5]:
con.sql(
"""--sql
SELECT
    count( distinct sample_num) as sample_num
FROM
    sm
"""
).pl()


sample_num
i64
96


In the raw dataset are 96 samples.

## Color


In [6]:
con.sql(
"""--sql
SELECT
    color,
    count(sample_num)
FROM
    sm
GROUP BY color
ORDER BY
    color
"""
).pl()


color,count(sample_num)
str,i64
"""orange""",1
"""red""",68
"""rosé""",3
"""white""",24


There are 4 unique colors: 'orange', 'red', 'rosé', and 'white'. The 96 samples can be broken down into the following: 'orange' = 1, 'red' = 68, 'rosè' = 3, 'white' = 24.


## Variety

### Samples per Varietal

In [7]:
varietal_counts = con.sql(
"""--sql
SELECT
    varietal,
    count(sample_num) as count
FROM
    sm
GROUP BY
    varietal
ORDER BY
    varietal
"""
).pl()

varietal_counts.describe()


statistic,varietal,count
str,str,f64
"""count""","""33""",33.0
"""null_count""","""0""",0.0
"""mean""",,2.909091
"""std""",,2.673523
"""min""","""cabernet sauvignon""",1.0
"""25%""",,1.0
"""50%""",,2.0
"""75%""",,3.0
"""max""","""white blend""",11.0


There are 33 varieties within the 'raw' dataset.


In [27]:
con.sql(
"""--sql
WITH
    binned AS(
        SELECT
            varietal,
            count,
            CASE
                WHEN
                    count = 1
                THEN
                    {'desc':'one', 'bin_rank': 0}
                WHEN
                    count BETWEEN 2 AND 5
                THEN
                    {'desc':'between 2 and 5', 'bin_rank':1}
                WHEN
                    count BETWEEN 5 AND 10
                THEN
                    {'desc':'between 5 and 10', 'bin_rank':2}
                WHEN
                    count > 10
                THEN
                    {'desc':'greater than 10', 'bin_rank':3}
                END AS bin
                    
        FROM
            varietal_counts
        ORDER BY
            bin ASC
        ),
    agg AS (
        SELECT
            bin,
            count(varietal) as var_per_bin,
        FROM
            binned
        GROUP BY
            bin
        ORDER BY
            bin
        ),
    unpacked AS (
        SELECT
            bin.*,
            var_per_bin
        FROM
            agg
        )
SELECT
    bin_rank + 1 as bin_rank,
    "desc",
    var_per_bin
    
FROM
    unpacked
ORDER BY
    bin_rank
"""
).pl().pipe(GT)


bin_rank,desc,var_per_bin
1,one,13
2,between 2 and 5,16
3,between 5 and 10,2
4,greater than 10,2


There is a variation in numerical representation of varieties within the dataset, ranging from 1 to 11 samples. Four key ranges were identified: 1: $n = 1$, 2: $2 \le n < 5$, 3: $5 \le n < 10$, and 4: $n>10$ (11 samples). bin 1 possessed 13 varietals, bin 2 16 varietals, 3 and 4 both had 2. The full tabulation can be found in the [appendix](#varietal-counts).

### Most Represented Varietals


In [28]:
varietal_counts.sort('count', descending=True).head(8).pipe(GT)


varietal,count
pinot noir,11
shiraz,11
chardonnay,7
red bordeaux blend,6
gamay,5
malbec,5
nebbiolo,5
riesling,5


The most represented varieties are Pinot Noir (11), Shiraz (11), Chardonnay (7), Red Bordeaux Blends (6), Gamay (5), Malbec (5), Nebbiolo (5) and riesling (5).

## Vintage

In [46]:
con.sql(
"""--sql
select table_schema, table_name, column_name from information_schema.columns WHERE table_name = 'c_cellar_tracker'
"""
).pl()['column_name'].to_list()


['size',
 'vintage',
 'name',
 'locale',
 'country',
 'region',
 'subregion',
 'appellation',
 'producer',
 'type',
 'color',
 'category',
 'varietal',
 'wine']

# Creating a Join Table

As we have a lot of potentially useful metadata, it is efficient to store the information in context-specific tables. The creation of a centralised join table containing the primary keys of each individual sample will be useful. It will have the id, chemstation metadata key, sample tracker key, and cellar tracker key.

This will require the creation of another set of keys methinks. prefixing the primary key columns with 'pk_' will make it clear what is what.

The 'id' column is the possessor of all individuals, so to speak. We would start there.

## Creation of Join Table CHM to ST

'join_samplecode' connects 'c_chemstation_metadata' and 'c_sample_tracker'. Note: 'join_samplecode' was manually added, and thus that connection is fragile without the code that added it. It is somewhere in the 'wine_analysis_hplc_uv' project, and will be added here at a later date. In the meantime, you are warned.


In [None]:
join_tbl = con.sql(
"""--sql
CREATE schema IF NOT EXISTS joins;
CREATE OR REPLACE TABLE joins.mta_st AS (
SELECT
    mta.pk as pk_mta,
    st.pk as pk_st,
FROM
    c_chemstation_metadata as mta
JOIN
    clean.st as st
ON
    mta.join_samplecode = st.samplecode
);
SELECT * FROM joins.mta_st
"""
)


then the joining of mta and st becomes..

In [None]:
con.sql(
"""--sql
SELECT
    * EXCLUDE pk_st
FROM
    clean.st AS st
JOIN
    joins.mta_st AS jtbl
ON
    jtbl.pk_st = st.pk
JOIN
    c_chemstation_metadata as mta
ON
    jtbl.pk_mta = mta.pk
"""
).pl().head()


## Join Table ST to CT

The connection between ST and CT is based on the wine name. This is however tenuous, and a perfect example of why a foreign key would be useful. In this case, first we'll proceed with the creation of a join table.

### CT Primary Key Creation

To form a join table between ST and CT it is best to create a new primary key on cellar tracker (CT).


#### CT Duplicate Row


CT has one duplicate row, wine 'Mumm Tasmania Brut Prestige'.

In [None]:
con.sql(
"""--sql
SELECT
    vintage,
    name,
    count(*) as duplicate_count
FROM
    c_cellar_tracker
GROUP BY
    name, vintage
HAVING
    count(*) > 1
"""
).pl()


vintage,name,count
str,str,i64
,"""mumm tasmania brut prestige""",2


In [288]:
con.sql(
"""--sql
FROM c_cellar_tracker
"""
).pl().columns


['size',
 'vintage',
 'name',
 'locale',
 'country',
 'region',
 'subregion',
 'appellation',
 'producer',
 'type',
 'color',
 'category',
 'varietal',
 'wine']

In [388]:
con.sql(
"""--sql
CREATE OR REPLACE TEMP TABLE ct AS
    SELECT
        size,
        vintage,
        name,
        locale,
        country,
        region,
        subregion,
        appellation,
        producer,
        type,
        color,
        category,
        varietal,
        wine,
        rank_dense() OVER (order by vintage, name) as pk,
    FROM
        (
            SELECT
                -- remove duplicate row from consideration - Mumm NV is duplicated
                DISTINCT concat(vintage, name),
                size,
                vintage,
                name,
                locale,
                country,
                region,
                subregion,
                appellation,
                producer,
                type,
                color,
                category,
                varietal,
                wine,
            FROM
                c_cellar_tracker
            );
SELECT
    *
FROM
    ct
LIMIT 5 
"""
).pl()


size,vintage,name,locale,country,region,subregion,appellation,producer,type,color,category,varietal,wine,pk
str,str,str,str,str,str,str,str,str,str,str,str,str,str,i64
"""750ml""","""2008""","""torbreck descendant""","""australia, south australia, barossa, barossa valley""","""australia""","""south australia""","""barossa""","""barossa valley""","""torbreck""","""red""","""red""","""dry""","""shiraz blend""","""2008 torbreck descendant""",1
"""750ml""","""2009""","""st hugo cabernet sauvignon coonawarra""","""australia, south australia, limestone coast, coonawarra""","""australia""","""south australia""","""limestone coast""","""coonawarra""","""st hugo""","""red""","""red""","""dry""","""cabernet sauvignon""","""2009 st hugo cabernet sauvignon coonawarra""",2
"""750ml""","""2013""","""woodlands cabernet merlot""","""australia, western australia, south west australia, margaret river""","""australia""","""western australia""","""south west australia""","""margaret river""","""woodlands""","""red""","""red""","""dry""","""red bordeaux blend""","""2013 woodlands cabernet merlot""",3
"""750ml""","""2014""","""perrier-jouët champagne belle epoque""","""france, champagne""","""france""","""champagne""","""unknown""","""champagne""","""perrier-jouët""","""white - sparkling""","""white""","""sparkling""","""champagne blend""","""2014 perrier-jouët champagne belle epoque""",4
"""750ml""","""2014""","""shaw and smith shiraz""","""australia, south australia, mount lofty ranges, adelaide hills""","""australia""","""south australia""","""mount lofty ranges""","""adelaide hills""","""shaw and smith""","""red""","""red""","""dry""","""shiraz""","""2014 shaw and smith shiraz""",5


In [389]:
con.sql(
"""--sql
SELECT COUNT(*) FROM ct
"""
).pl()


count_star()
i64
148


Now create the new table with the primary key constraint.


In [391]:
con.sql(
"""--sql
CREATE OR REPLACE TABLE clean.ct
(
    size VARCHAR,
    vintage VARCHAR,
    name VARCHAR,
    locale VARCHAR,
    country VARCHAR,
    region VARCHAR,
    subregion VARCHAR,
    appellation VARCHAR,
    producer VARCHAR,
    type VARCHAR,
    color VARCHAR,
    category VARCHAR,
    varietal VARCHAR,
    wine VARCHAR,
    pk INTEGER PRIMARY KEY
);
INSERT INTO clean.ct (
SELECT
    size,
    vintage,
    name,
    locale,
    country,
    region,
    subregion,
    appellation,
    producer,
    type,
    color,
    category,
    varietal,
    wine,
    pk
FROM
    ct
);
SELECT * FROM clean.ct LIMIT 5
"""
).pl()


size,vintage,name,locale,country,region,subregion,appellation,producer,type,color,category,varietal,wine,pk
str,str,str,str,str,str,str,str,str,str,str,str,str,str,i32
"""750ml""","""2008""","""torbreck descendant""","""australia, south australia, barossa, barossa valley""","""australia""","""south australia""","""barossa""","""barossa valley""","""torbreck""","""red""","""red""","""dry""","""shiraz blend""","""2008 torbreck descendant""",1
"""750ml""","""2009""","""st hugo cabernet sauvignon coonawarra""","""australia, south australia, limestone coast, coonawarra""","""australia""","""south australia""","""limestone coast""","""coonawarra""","""st hugo""","""red""","""red""","""dry""","""cabernet sauvignon""","""2009 st hugo cabernet sauvignon coonawarra""",2
"""750ml""","""2013""","""woodlands cabernet merlot""","""australia, western australia, south west australia, margaret river""","""australia""","""western australia""","""south west australia""","""margaret river""","""woodlands""","""red""","""red""","""dry""","""red bordeaux blend""","""2013 woodlands cabernet merlot""",3
"""750ml""","""2014""","""perrier-jouët champagne belle epoque""","""france, champagne""","""france""","""champagne""","""unknown""","""champagne""","""perrier-jouët""","""white - sparkling""","""white""","""sparkling""","""champagne blend""","""2014 perrier-jouët champagne belle epoque""",4
"""750ml""","""2014""","""shaw and smith shiraz""","""australia, south australia, mount lofty ranges, adelaide hills""","""australia""","""south australia""","""mount lofty ranges""","""adelaide hills""","""shaw and smith""","""red""","""red""","""dry""","""shiraz""","""2014 shaw and smith shiraz""",5


# Adding Primary Key Constraints to tables

To ensure that the keys are infact primary keys, we should recreate all of the tables while setting those columns as primary keys. Unfortunately DUCKDB does not ship with a method of setting a constraint outside of the table creation query, so we'll need to remake the tables. We will at the same time create a 'clean' schema to place the new tables in. Downtrack we can delete the old tables but that will invalidate a lot of prior code.

In [None]:
con.close()
con = db.connect(db_path)
con.sql(
"""--sql
CREATE SCHEMA clean;
""")


## Adding Primary Keys



### Sample Tracker


is ch_samplecode the primary key of sample tracker? If it is, there should be an equal number of distinct values as there are rows of the table.

In [None]:
con.sql(
"""--sql
SELECT
    count( distinct samplecode)
FROM
    c_sample_tracker
WHERE
    samplecode IS NOT NULL
"""
).pl()


count(DISTINCT samplecode)
i64
190


yes. However lets add a new 'pk' column based on it. This can be achieved by densely ranking the 'samplecode' column:

In [None]:
con.sql(
"""--sql
ALTER TABLE
    c_sample_tracker
ADD COLUMN
    pk INTEGER DEFAULT NULL;
"""
)
con.sql(
"""--sql
CREATE OR REPLACE TABLE samplecode_pk AS
SELECT
    samplecode,
    rank_dense() OVER (order by samplecode) as pk
FROM
    c_sample_tracker
"""
)

con.sql(
"""--sql
UPDATE c_sample_tracker as st
SET pk = (
    SELECT
        pk
    FROM
        samplecode_pk
    WHERE
        samplecode_pk.samplecode = st.samplecode
)
"""
)

con.sql(
"""--sql
SELECT COUNT(distinct pk) FROM c_sample_tracker
"""
).pl()



Now I have to recreate the table.


In [None]:
con.sql(
"""--sql
SELECT
    *
FROM
    c_sample_tracker
"""
).pl().head()


Now I have to create the clean schema table:


In [None]:
con.sql(
"""--sql
SELECT * FROM c_sample_tracker LIMIT 1
"""
).pl().columns


In [None]:
con.sql(
"""--sql
CREATE TABLE clean.st (
    detection VARCHAR,
    sampler VARCHAR,
    samplecode VARCHAR,
    vintage VARCHAR,
    name VARCHAR,
    open_date VARCHAR,
    sampled_date VARCHAR,
    added_to_cellartracker VARCHAR,
    notes VARCHAR,
    size VARCHAR,
    ct_wine_name VARCHAR,
    pk INTEGER PRIMARY KEY,
)
"""
)

con.sql(
"""--sql
INSERT INTO clean.st
SELECT
    *
FROM
    c_sample_tracker
"""
)

con.sql(
"""--sql
SELECT * FROM clean.st
"""
).pl().head()

con.sql(
"""--sql
    DROP TABLE samplecode_pk
"""
)





And see if the primary key constraint worked..

In [None]:
try:
    con.sql(
    """--sql
    INSERT INTO clean.st BY NAME (SELECT 91 AS pk)
    """
    ).pl()
except db.ConstraintException as e:
    print(e)


Constraint Error: Duplicate key "pk: 91" violates primary key constraint. If this is an unexpected constraint violation please double check with the known index limitations section in our documentation (https://duckdb.org/docs/sql/indexes).


If the above passed, then the primary key constraint on sample tracker worked.

### Chemstation Metadata

THe primary key of the 'chemstation_metadata' table is the 'id'.


In [223]:
con.sql(
"""--sql
SELECT
    count(distinct id)
FROM
    c_chemstation_metadata
"""
).pl()


count(DISTINCT id)
i64
175


There are 175 unique rows.


create the pk column


In [227]:


con.sql(
"""--sql
ALTER TABLE
    c_chemstation_metadata
ADD COLUMN
    pk INTEGER DEFAULT NULL;
"""
)

# create a temporary pk table
con.sql(
"""--sql
CREATE OR REPLACE TABLE id_pk AS
SELECT
    id,
    rank_dense() OVER (order by id) as pk
FROM
    c_chemstation_metadata
"""
)

# add the pk column to the metadata table
con.sql(
"""--sql
UPDATE c_chemstation_metadata as mta
SET pk = (
    SELECT
        pk
    FROM
        id_pk
    WHERE
        id_pk.id = mta.id
)
"""
)

con.sql(
"""--sql
SELECT COUNT(distinct pk) FROM c_chemstation_metadata
"""
).pl()

con.sql(
"""--sql
DROP TABLE id_pk
"""
)



AttributeError: 'NoneType' object has no attribute 'pl'

In [231]:
con.sql(
"""--sql
SELECT COUNT(DISTINCT pk) FROM c_chemstation_metadata
"""
).pl()


count(DISTINCT pk)
i64
175


In [228]:
con.sql(
"""--sql
    SELECT * FROM c_chemstation_metadata
    LIMIT 10
"""
).pl()


path,ch_samplecode,acq_date,acq_method,unit,signal,vendor,inj_vol,seq_name,seq_desc,vialnum,originalfilepath,id,desc,join_samplecode,pk
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,i32
"""/users/jonathan/uni/0_jono_dat…","""114""","""2023-05-09 05:00:53""","""avantor100x4_6c18-h2o-meoh-2_5…","""mau""","""dad1i, dad: spectrum""","""agilent""","""10.00""","""2023-05-08_uv_vis_wines 2023-0…","""the autonomous ambient wine sa…","""vial 14""","""c:\chem32\3\data\0_jono_data\0…","""839df657-69a3-4935-a30b-3bd17e…","""2019 catena malbec vallle de u…","""114""",105
"""/users/jonathan/uni/0_jono_dat…","""50""","""2023-03-15 15:09:51""","""avantor100x4_6c18-h2o-meoh-2_5…","""mau""","""dad1i, dad: spectrum""","""agilent""","""10.00""","""wines_2023-03-15_11-33-51""",,"""vial 5""","""c:\chem32\1\data\0_jono_data\w…","""36ac3c9f-aea7-4a31-a8fa-740002…",,"""50""",38
"""/users/jonathan/uni/0_jono_dat…","""98""","""2023-04-13 13:32:12""","""halo150x4_6c18-h2o-meoh-2_1.m""","""mau""","""dad1i, dad: spectrum""","""agilent""","""10.00""","""2023-04-13_wines_2023-04-13_11…",,"""vial 3""","""c:\chem32\1\data\0_jono_data\2…","""6bf0e36f-819a-4303-9386-76d206…",,"""98""",76
"""/users/jonathan/uni/0_jono_dat…","""92""","""2023-04-05 00:32:28""","""avantor100x4_6c18-h2o-meoh-2_5…","""mau""","""dad1i, dad: spectrum""","""agilent""","""10.00""","""2023-04-04_wines_2023-04-04_12…",,"""vial 15""","""c:\chem32\1\data\0_jono_data\2…","""c1e3411f-1d9e-4780-aace-a4650b…",,"""92""",146
"""/users/jonathan/uni/0_jono_dat…","""96""","""2023-04-13 12:00:29""","""avantor100x4_6c18-h2o-meoh-2_5…","""mau""","""dad1i, dad: spectrum""","""agilent""","""10.00""","""2023-04-13_wines_2023-04-13_11…",,"""vial 1""","""c:\chem32\1\data\0_jono_data\2…","""06902f86-0024-418d-b449-79843f…",,"""96""",3
"""/users/jonathan/uni/0_jono_dat…","""0101""","""2023-02-15 19:50:05""","""avantor100x4_6c18-h2o-meoh-2_1…","""mau""","""dad1i, dad: spectrum""","""agilent""","""10.00""","""2023-02-15_wines_2023-02-15_15…",,"""vial 6""","""c:\chem32\1\data\0_jono_data\2…","""a69d4665-d7b3-4706-be76-4d8e39…",,"""10""",126
"""/users/jonathan/uni/0_jono_dat…","""110""","""2023-05-09 01:26:57""","""avantor100x4_6c18-h2o-meoh-2_5…","""mau""","""dad1i, dad: spectrum""","""agilent""","""10.00""","""2023-05-08_uv_vis_wines 2023-0…","""the autonomous ambient wine sa…","""vial 7""","""c:\chem32\3\data\0_jono_data\0…","""7311903b-1d99-4dd9-b41d-8e5755…","""2015 clos du marquis""","""110""",85
"""/users/jonathan/uni/0_jono_dat…","""54""","""2023-03-15 18:44:27""","""avantor100x4_6c18-h2o-meoh-2_5…","""mau""","""dad1i, dad: spectrum""","""agilent""","""10.00""","""wines_2023-03-15_11-33-51""",,"""vial 9""","""c:\chem32\1\data\0_jono_data\w…","""461c5c4f-36cc-4552-a7ab-47bf19…",,"""54""",51
"""/users/jonathan/uni/0_jono_dat…","""42""","""2023-03-15 04:47:25""","""avantor100x4_6c18-h2o-meoh-2_5…","""mau""","""dad1i, dad: spectrum""","""agilent""","""10.00""","""2023-03-14_wines_2023-03-14_19…",,"""vial 11""","""c:\chem32\1\data\2023-03-14_wi…","""f4f1ff8c-05f2-42ab-a3ab-3d1a34…",,"""42""",171
"""/users/jonathan/uni/0_jono_dat…","""86""","""2023-04-04 20:05:08""","""avantor100x4_6c18-h2o-meoh-2_5…","""mau""","""dad1i, dad: spectrum""","""agilent""","""10.00""","""2023-04-04_wines_2023-04-04_12…",,"""vial 10""","""c:\chem32\1\data\0_jono_data\2…","""6330ab34-a340-4bc5-bb1a-a68929…",,"""86""",70


In [271]:
con.sql(
"""--sql
DROP TABLE clean.chm
"""
)


CatalogException: Catalog Error: Table with name chm does not exist!
Did you mean "st"?

In [272]:
con.sql(
"""--sql
CREATE OR REPLACE TABLE clean.chm (
    pk INTEGER PRIMARY KEY,
    id VARCHAR UNIQUE,
    path VARCHAR,
    acq_date VARCHAR UNIQUE,
    acq_method VARCHAR,
    unit VARCHAR,
    signal VARCHAR,
    vendor VARCHAR,
    inj_vol VARCHAR,
    seq_name VARCHAR,
    seq_desc VARCHAR,
    vialnum VARCHAR,
    originalfilepath VARCHAR,
    description VARCHAR,
);
"""
)


And now to populate. I think i'll have to do this column by column? Need a join table

Now write chemstation metadata to the clean schema


In [274]:
con.sql(
"""--sql
INSERT INTO clean.chm
    SELECT
        pk,
        id,
        path,
        acq_date,
        acq_method,
        unit,
        signal,
        vendor,
        inj_vol,
        seq_name,
        seq_desc,
        vialnum,
        originalfilepath,
        "desc"
    FROM
        c_chemstation_metadata
"""
)


In [276]:
con.sql(
"""--sql
DESCRIBE clean.chm
"""
).pl()


column_name,column_type,null,key,default,extra
str,str,str,str,str,str
"""pk""","""INTEGER""","""NO""","""PRI""",,
"""id""","""VARCHAR""","""YES""","""UNI""",,
"""path""","""VARCHAR""","""YES""",,,
"""acq_date""","""VARCHAR""","""YES""","""UNI""",,
"""acq_method""","""VARCHAR""","""YES""",,,
…,…,…,…,…,…
"""seq_name""","""VARCHAR""","""YES""",,,
"""seq_desc""","""VARCHAR""","""YES""",,,
"""vialnum""","""VARCHAR""","""YES""",,,
"""originalfilepath""","""VARCHAR""","""YES""",,,


In [280]:
con.sql(
"""--sql
SELECT
    schema_name,
    table_name,
    constraint_column_names,
    constraint_type
FROM
    duckdb_constraints()
WHERE
    schema_name = 'clean'
AND
    table_name = 'chm'
"""
).pl()


schema_name,table_name,constraint_column_names,constraint_type
str,str,list[str],str
"""clean""","""chm""","[""pk""]","""PRIMARY KEY"""
"""clean""","""chm""","[""id""]","""UNIQUE"""
"""clean""","""chm""","[""acq_date""]","""UNIQUE"""
"""clean""","""chm""","[""pk""]","""NOT NULL"""


Now it would acquire the means to connect to the sample tracker table..

In [81]:
join_tbl = con.sql(
"""--sql
SELECT
    st.ct_wine_name as pk_st_to_ct,
    jt.pk_chm_to_st,
    jt.pk_id
FROM
    c_sample_tracker as st
JOIN
    join_tbl as jt
ON
    jt.pk_chm_to_st = st.samplecode
"""
).pl()

join_tbl.head()


pk_st_to_ct,pk_chm_to_st,pk_id
str,str,str
"""2021 babo chianti""","""06""","""e56c4dcd-2847-4d34-b457-743be1…"
"""2020 boutinot uva non grata""","""07""","""5eb3135c-33a2-404b-8042-e23cae…"
"""2021 matias riccitelli malbec …","""08""","""8cfa23c8-ffa6-4c27-be70-0251d4…"
"""2018 crawford river cabernets""","""09""","""38601b0b-5338-4154-9f04-cf85b8…"
"""2021 john duval wines shiraz c…","""10""","""a69d4665-d7b3-4706-be76-4d8e39…"


In [None]:
con.sql(
"""--sql
SELECT
    *
FROM
    
"""
).pl()


In [48]:
con.sql(
"""--sql
select table_schema, table_name, column_name from information_schema.columns WHERE table_name = 'c_sample_tracker'
"""
).pl()['column_name'].to_list()


['detection',
 'sampler',
 'samplecode',
 'vintage',
 'name',
 'open_date',
 'sampled_date',
 'added_to_cellartracker',
 'notes',
 'size',
 'ct_wine_name']

In [51]:
con.sql(
"""--sql
SELECT
    wine
FROM
    c_cellar_tracker
"""
).pl()


wine
str
"""2020 agricola punica montessu …"
"""2022 alkina grenache kin"""
"""2016 anna maria abbona barolo …"
"""2019 domaine des ardoisières a…"
"""2021 babo chianti"""
…
"""2020 yangarra estate roussanne…"
"""2015 yangarra estate shiraz mc…"
"""2020 yangarra estate shiraz mc…"
"""2021 yering station pinot noir"""


In [53]:
con.sql(
"""--sql
SELECT
    ct_wine_name
FROM
    c_sample_tracker
"""
).pl()


ct_wine_name
str
"""2016 zema estate cabernet sauv…"
"""2022 william downie cathedral"""
"""2021 babo chianti"""
"""2021 joshua cooper cabernet sa…"
"""2022 william downie cathedral"""
…
"""2020 leeuwin estate cabernet s…"
""" """
"""2021 terraviva cerasuolo dabru…"
"""2020 le macchiole bolgheri ros…"


'c_chemstation_metadata' has 175 rows but 174 distinct 'join_samplecode' values. This is the key to join with 'c_sample_tracker'. Is there a duplicate?

In [60]:
con.sql(
"""--sql
select
    count(distinct join_samplecode)
from
    c_chemstation_metadata
"""
).pl()


count(DISTINCT join_samplecode)
i64
174


In [54]:
con.sql(
"""--sql
select
    *
from
    c_cellar_tracker as ct
JOIN
    c_sample_tracker as st
ON
    ct.wine = st.ct_wine_name
"""
).pl()


size,vintage,name,locale,country,region,subregion,appellation,producer,type,color,category,varietal,wine,detection,sampler,samplecode,vintage_1,name_1,open_date,sampled_date,added_to_cellartracker,notes,size_1,ct_wine_name
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""750ml""","""2021""","""babo chianti""","""italy, tuscany, chianti""","""italy""","""tuscany""","""chianti""","""chianti""","""babo""","""red""","""red""","""dry""","""sangiovese""","""2021 babo chianti""","""raw""","""jonathan""","""02""","""2021""","""babo chianti""",,,"""y""","""ambient 2 weeks. sampled 20230…","""750""","""2021 babo chianti"""
"""750ml""","""2021""","""babo chianti""","""italy, tuscany, chianti""","""italy""","""tuscany""","""chianti""","""chianti""","""babo""","""red""","""red""","""dry""","""sangiovese""","""2021 babo chianti""","""raw""","""jonathan""","""06""","""2021""","""babo chianti""","""2023-02-04""",,"""y""",,"""750""","""2021 babo chianti"""
"""750ml""","""2020""","""boutinot uva non grata""","""france, vin de france""","""france""","""france""","""unknown""","""vin de france""","""boutinot""","""red""","""red""","""dry""","""gamay""","""2020 boutinot uva non grata""","""raw""","""jonathan""","""07""","""2020""","""uva non grata gamay""","""2023-02-04""",,"""y""",,"""750""","""2020 boutinot uva non grata"""
"""750ml""","""2021""","""matias riccitelli malbec hey m…","""argentina, mendoza, lujan de c…","""argentina""","""mendoza""","""lujan de cuyo""","""unknown""","""matias riccitelli""","""red""","""red""","""dry""","""malbec""","""2021 matias riccitelli malbec …","""raw""","""jonathan""","""08""","""2021""","""hey malbec""","""2023-02-04""",,"""y""",,"""750""","""2021 matias riccitelli malbec …"
"""750ml""","""2018""","""crawford river cabernets""","""australia, victoria, western v…","""australia""","""victoria""","""western victoria""","""henty""","""crawford river""","""red""","""red""","""dry""","""red bordeaux blend""","""2018 crawford river cabernets""","""raw""","""jonathan""","""09""","""2018""","""crawford river cabernets""","""2023-02-01""",,"""y""",,"""750""","""2018 crawford river cabernets"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""750ml""","""2021""","""st hugo grenache shiraz mataro""","""australia, south australia, ba…","""australia""","""south australia""","""barossa""","""barossa valley""","""st hugo""","""red""","""red""","""dry""","""red rhone blend""","""2021 st hugo grenache shiraz m…","""cuprac""","""jonathan""","""169""","""2021""","""st hugo gsm""","""2023-05-26""","""2023-05-30""","""y""",,,"""2021 st hugo grenache shiraz m…"
"""750ml""","""2019""","""kuenhof (peter pliger) rieslin…","""italy, trentino-alto adige, al…","""italy""","""trentino-alto adige""","""alto adige""","""valle isarco / eisacktaler""","""kuenhof (peter pliger)""","""white""","""white""","""dry""","""riesling""","""2019 kuenhof (peter pliger) ri…","""raw""","""jonathan""","""22""","""2019""","""kuenhof riesling""","""2023-02-10""",,"""y""",,"""750""","""2019 kuenhof (peter pliger) ri…"
"""750ml""","""2020""","""hochkirch pinot noir maximus""","""australia, victoria, western v…","""australia""","""victoria""","""western victoria""","""henty""","""hochkirch""","""red""","""red""","""dry""","""pinot noir""","""2020 hochkirch pinot noir maxi…","""raw""","""jonathan""","""70""","""2020""","""hochkirch pinot noir""","""2023-03-23""","""2023-03-23""","""y""",,"""750""","""2020 hochkirch pinot noir maxi…"
"""750ml""","""2022""","""chalmers vermentino""","""australia, victoria, central v…","""australia""","""victoria""","""central victoria""","""heathcote""","""chalmers""","""white""","""white""","""dry""","""vermentino""","""2022 chalmers vermentino""","""cuprac""","""colin""","""160""","""2022""","""chalmers vermentino""",,,"""y""",,,"""2022 chalmers vermentino"""


In [None]:
con.sql(
"""--sql
SELECT
    
"""
).pl()


In [29]:
con.sql(
"""--sql
SELECT
    vintage,
    count(vintage) as count
FROM
    pbl.sample_metadata
JOIN
    c_a
    
"""
).pl()


BinderException: Binder Error: Referenced column "vintage" not found in FROM clause!
Candidate bindings: "sm.wine"

# Signal


# Appendix

## Varietal Counts


In [18]:
from great_tables import GT, md, html

varietal_counts_tbl = con.sql(
"""--sql
WITH 
    agg AS (
        SELECT
            varietal,
            count(varietal) as count,
        FROM
            sm
        GROUP BY 
            varietal
        order by
            varietal
    ),
    tiled AS (
        SELECT
            ntile(2) OVER (ORDER BY varietal) as col,  
            *
        FROM
            agg
        ),
    row_nummed AS (
        SELECT
            row_number() OVER (PARTITION BY col ORDER BY varietal) as row_num,
            *
        FROM
            tiled
        ),
    col_1 AS (
        SELECT
            *,
        FROM
            row_nummed
        WHERE
            col = 1
        ),
    col_2 AS (
        SELECT
            *,
        FROM
            row_nummed
        WHERE
            col = 2
        ),
    joined AS (
        SELECT
            *
        FROM
            col_1 as col1
        JOIN
            col_2 as col2
        USING
            (row_num)
        ORDER BY
            row_num
            
        )
SELECT
    * EXCLUDE (col, row_num, col_1)
FROM
    joined
"""
).pl()
varietal_counts_tbl.pipe(GT).cols_label(
    varietal_1=html('varietal'),
    count_1=html('count')
)


varietal,count,varietal.1,count.1
cabernet sauvignon,2,pinot grigio,1
cabernet-shiraz blend,1,pinot noir,11
carricante,2,red blend,2
catarratto,1,red bordeaux blend,6
chardonnay,7,red rhone blend,2
chardonnay blend,1,riesling,5
chenin blanc,2,rosé blend,1
gamay,5,roussanne,1
garganega,1,sangiovese,3
grenache,2,sangiovese blend,1
