---
project: cleaning_data_tables
title: cleaning sampletracker wines
cdt: 2024-09-17T12:55:31
status: closed
description: "clean up of the entered sampletracker wine names with those in the ct to enable joins on those columns
conclusion: have matched 175/190 wines with entered rows in ct. Those missing are either unidentifiable or not present in the ct database. Recommendation is to add them to an excluded list until such a time as it is worth manually adding their metadata."
---


unfortunately, when I was entering the samples into the tracker, I did not have a clear data structure in place. In order to acquire the metadata, I planned on joining the entered wine names with those present in the cellartracker database. Unfortunately, fuzzy joining is not a sound foundation and it was deemed necessary to replace the original names with their verified matches. This notebook produces that result. To do this, we need to get both tables, fuzzy join on the names after cleaning, inspect the results, and replace where appropriate.

# Get the Tables


In [2]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import duckdb as db
import polars as pl
from database_etl.definitions import DATA_DIR
from database_etl.etl.sql import ct

pl.Config.set_fmt_str_lengths(999).set_tbl_rows(15)


overwrite_sample_tracker = False
overwrite_cellar_tracker = False
raw_ct_path = DATA_DIR / "dirty_cellar_tracker.csv"
dirty_st_path = str(
    DATA_DIR / "original_sample_tracker" / "original_dirty_sample_tracker.csv"
)
download_new_file = False

con = db.connect()


move the wines with missing metadata to another table


now create a sampletracker table without the missing data rows

In [3]:
con.sql(
    """--sql
create sequence pk_st_seq start 1;
"""
)


# Create `st`

In [4]:
con.sql(
    f"""--sql
drop table if exists excluded;
drop table if exists matches cascade;
drop table if exists st cascade;
create or replace table st (
    pk integer primary key,
    detection varchar not null,
    wine_key varchar,
    wine varchar,
    vintage integer,
    sampler varchar,
    samplecode varchar not null unique,
    open_date varchar,
    sampled_date varchar,
    added_to_cellartracker bool,
    notes varchar,
    size float,
);
insert into st
    with
        st_loading as (
            select
                nextval('pk_st_seq') as pk,
                detection,
                cast(case when vintage is null or vintage = 'null' then '9999' else vintage end as integer) as vintage,
                trim(lower(sampler)) as sampler,
                trim(lower(samplecode)) as samplecode,
                -- replace null vintages with 9999 so that string slicing operations downstream work
                replace(
                    replace(
                        strip_accents(trim(lower(name))), '"', ''
                        ), '''', ''
                    ) as wine,
                open_date,
                sampled_date,
                case when added_to_cellartracker = 'y' then true else false end as added_to_cellartracker,
                replace(
                    replace(
                        strip_accents(trim(lower(notes))), '"', ''
                        ), '''', ''
                    ) as notes,
                size,
            from
                read_csv('{dirty_st_path}')
        ),
        st_wine_key as (
        select
            pk,
            detection,
            concat(cast(vintage as integer), ' ', trim(lower(wine))) as wine_key,
            wine,
            vintage,
            sampler,
            samplecode,
            open_date,
            sampled_date,
            added_to_cellartracker,
            notes,
            size
        from st_loading
        )
select
    pk,
    detection,
    wine_key,
    wine,
    vintage,
    sampler,
    samplecode,
    open_date,
    sampled_date,
    added_to_cellartracker,
    notes,
    size
from
    st_wine_key;
"""
)

con.sql(
    """--sql
select
    *
from
    st
limit 5
"""
).pl()


pk,detection,wine_key,wine,vintage,sampler,samplecode,open_date,sampled_date,added_to_cellartracker,notes,size
i32,str,str,str,i32,str,str,str,str,bool,str,f32
1,"""raw""","""2016 zema estate family selection cabernet sauvignon""","""zema estate family selection cabernet sauvignon""",2016,"""jonathan""","""00""",,,True,"""freezer storage. sampled at 21:20 20230122, stored for 10h before transport to lab.""",750.0
2,"""raw""","""2022 william downie cathedral pinot noir""","""william downie cathedral pinot noir""",2022,"""jonathan""","""01""",,,True,"""ambient 2 weeks. sampled 20230111.""",750.0
3,"""raw""","""2021 babo chianti""","""babo chianti""",2021,"""jonathan""","""02""",,,True,"""ambient 2 weeks. sampled 20230111.""",750.0
4,"""raw""","""2021 joshua cooper shays flat cabernet sauvignon""","""joshua cooper shays flat cabernet sauvignon""",2021,"""jonathan""","""03""","""2023-02-04""",,True,,750.0
5,"""raw""","""2022 william downie cathedral pinot noir""","""william downie cathedral pinot noir""",2022,"""jonathan""","""05""","""2023-02-04""",,True,,750.0


## Create CT

In [5]:
ct.get_clean_ct(un="OctaneOolong", pw="S74rg4z3r1", con=con, output="db")

if (
    not con.sql(
        """--sql
select * from (show tables)
"""
    )
    .df()["name"]
    .eq("ct")
    .any()
):
    raise ValueError("Execute load_ct.ipynb first")
con.sql(
    """--sql
select
    *
from
    ct
limit 3
"""
).pl()


wine_key,size,vintage,wine,locale,country,region,subregion,appellation,producer,type,color,category,varietal
str,str,i32,str,str,str,str,str,str,str,str,str,str,str
"""2020 agricola punica montessu isola dei nuraghi igt""","""750ml""",2020,"""agricola punica montessu isola dei nuraghi igt""","""italy, sardinia, isola dei nuraghi igt""","""italy""","""sardinia""","""unknown""","""isola dei nuraghi igt""","""agricola punica""","""red""","""red""","""dry""","""carignan blend"""
"""2022 alkina grenache kin""","""750ml""",2022,"""alkina grenache kin""","""australia, south australia, barossa""","""australia""","""south australia""","""barossa""","""barossa""","""alkina""","""red""","""red""","""dry""","""grenache"""
"""2017 anna maria abbona barolo""","""750ml""",2017,"""anna maria abbona barolo""","""italy, piedmont, langhe, barolo""","""italy""","""piedmont""","""langhe""","""barolo""","""anna maria abbona""","""red""","""red""","""dry""","""nebbiolo"""


In [6]:
from fuzzywuzzy import fuzz, process


def build_string_lists(con: db.DuckDBPyConnection) -> tuple[list[str], list[str]]:
    left_strings: list[str] = [
        x[0] for x in con.sql("select lower(wine_key) as wine from st").fetchall()
    ]
    right_strings: list[str] = [
        x[0] for x in con.sql("select lower(wine_key) as wine from ct").fetchall()
    ]

    if not all(isinstance(x, str) for x in left_strings):
        raise TypeError("expected str")
    if not all(isinstance(x, str) for x in right_strings):
        raise TypeError("expected str")

    return left_strings, right_strings


left_strings, right_strings = build_string_lists(con=con)


def match_strings(
    left_strings: list[str], right_strings: list[str]
) -> tuple[list[str], list[int]]:
    matches = []
    scores = []
    for ls in left_strings:
        result = process.extractOne(
            query=ls, choices=right_strings, scorer=fuzz.token_set_ratio
        )
        if result:
            if len(result) == 2:
                match, score = result
                matches.append(match)
                scores.append(score)
    return matches, scores


def construct_match_df(matches: list[str], scores: list[int], pk) -> pd.DataFrame:
    match_df = pd.DataFrame(
        {
            "pk": pk,
            "left_string": left_strings,
            "match": matches,
            "score": scores,
        }
    )
    return match_df


def get_st_pk(con=con):
    return con.sql("select pk from st").df()["pk"]


def match_st_ct_wine_keys(con=con) -> pd.DataFrame:
    pk = get_st_pk(con=con)
    left_strings, right_strings = build_string_lists(con=con)
    matches, scores = match_strings(
        left_strings=left_strings, right_strings=right_strings
    )

    return construct_match_df(matches=matches, scores=scores, pk=pk)


match_df = match_st_ct_wine_keys(con=con)
match_df


Unnamed: 0,pk,left_string,match,score
0,1,2016 zema estate family selection cabernet sau...,2016 zema estate cabernet sauvignon family sel...,100
1,2,2022 william downie cathedral pinot noir,2022 william downie cathedral,100
2,3,2021 babo chianti,2021 babo chianti,100
3,4,2021 joshua cooper shays flat cabernet sauvignon,2021 joshua cooper cabernet sauvignon landsbor...,100
4,5,2022 william downie cathedral pinot noir,2022 william downie cathedral,100
...,...,...,...,...
185,186,2017 andrea oberto barolo,2017 andrea oberto barolo del comune di la morra,100
186,187,2021 john duval wines shiraz concilio,2021 john duval wines shiraz concilio,100
187,188,2021 torbreck shiraz the struie,2021 torbreck shiraz the struie,100
188,189,2020 orlando grenache cellar 13 grenache,2020 orlando grenache cellar 13,100


In [7]:
con.sql(
    """--sql
create or replace table matches (
    pk integer primary key references st(pk),
    left_string varchar not null,
    match varchar not null,
    score integer not null,
    verified bool default false,
    );
"""
)

con.sql(
    """--sql
insert into matches
    select
        pk,
        left_string,
        match,
        score,
        false as verified
    from
        match_df
    order by
        score desc
    on conflict do nothing
        ;
"""
)

con.sql(
    """--sql
select
    *
from
    matches
order by
    pk
limit 5
"""
).pl()


pk,left_string,match,score,verified
i32,str,str,i32,bool
1,"""2016 zema estate family selection cabernet sauvignon""","""2016 zema estate cabernet sauvignon family selection""",100,False
2,"""2022 william downie cathedral pinot noir""","""2022 william downie cathedral""",100,False
3,"""2021 babo chianti""","""2021 babo chianti""",100,False
4,"""2021 joshua cooper shays flat cabernet sauvignon""","""2021 joshua cooper cabernet sauvignon landsborough shays flat""",100,False
5,"""2022 william downie cathedral pinot noir""","""2022 william downie cathedral""",100,False


In [8]:
con.sql(
    """--sql
select
    (select count(*) from matches) as total_count,
"""
).pl()


total_count
i64
190


In [9]:
con.sql(
    """--sql
from histogram(matches, score)
"""
).df().style.set_properties(text_align="right")


Unnamed: 0,bin,count,bar
0,x <= 10,3,█▍
1,10 < x <= 20,0,
2,20 < x <= 30,0,
3,30 < x <= 40,2,▉
4,40 < x <= 50,3,█▍
5,50 < x <= 60,2,▉
6,60 < x <= 70,4,█▉
7,70 < x <= 80,3,█▍
8,80 < x <= 90,6,██▊
9,90 < x <= 100,167,████████████████████████████████████████████████████████████████████████████████


As we can see the majority are above 90.

In [10]:
con.execute(
    """--sql
select
    score,
    count(score)*100/(select count(*) from matches) as count_perc
from
    matches
where
    score > 90
group by
    score
"""
).pl()


score,count_perc
i32,f64
91,0.526316
93,1.052632
95,1.052632
96,0.526316
97,0.526316
100,84.210526


Now lets get rid of the 100 scores.


Now there's nothing for it but to go through each match, bracket by bracket.

the 90's:

In [11]:
con.sql(
    """--sql
select
    *
from
    matches
where
    score > 90
"""
).pl()


pk,left_string,match,score,verified
i32,str,str,i32,bool
1,"""2016 zema estate family selection cabernet sauvignon""","""2016 zema estate cabernet sauvignon family selection""",100,false
2,"""2022 william downie cathedral pinot noir""","""2022 william downie cathedral""",100,false
3,"""2021 babo chianti""","""2021 babo chianti""",100,false
4,"""2021 joshua cooper shays flat cabernet sauvignon""","""2021 joshua cooper cabernet sauvignon landsborough shays flat""",100,false
5,"""2022 william downie cathedral pinot noir""","""2022 william downie cathedral""",100,false
6,"""2021 babo chianti""","""2021 babo chianti""",100,false
8,"""2021 hey malbec""","""2021 matias riccitelli malbec hey malbec!""",100,false
9,"""2018 crawford river cabernets""","""2018 crawford river cabernets""",100,false
…,…,…,…,…
71,"""2020 cascina delle rose dolcetto d’alba ‘a elizabeth’""","""2020 cascina delle rose dolcetto dalba a elizabeth""",97,false


90's look good.

In [12]:
con.sql(
    """--sql
update matches
    set
        verified = true
    where
        score > 75;
select * from matches where verified = false;
"""
).pl()


pk,left_string,match,score,verified
i32,str,str,i32,bool
108,"""2019 rr (?)""","""2019 domaine des ardoisieres argile rouge""",73,False
157,"""1001 merivale white semillon sauvignon blanc""","""2022 greywacke sauvignon blanc""",67,False
162,"""1001 allegra pinot grigio""","""2021 farina pinot grigio delle venezie""",67,False
138,"""2021 cantina orsogana""","""2021 domenica chardonnay""",62,False
168,"""2021 piagre gampania bianco""","""2021 girolamo russo etna bianco nerina""",62,False
166,"""1001 tottis vino bianco""","""2021 nino barraco fior di bianco""",58,False
154,"""1001 tottis vino rosso""","""2019 giovanni rosso etna bianco""",53,False
112,"""9999 leflaive macon-verze blanc le monte""","""2022 vinden estate the vinden headcase pokolbin blanc""",49,False
18,"""9999 empty id, missing wine""","""2021 lethbridge wines pinot noir""",48,False
190,"""9999 mystery barret rady""","""2013 woodlands cabernet merlot""",44,False


It appears that anything below a score of 75 is an incorrect match. These will be added to the 'incorrected_matches' table, and as they are low interest samples, will be exluded from downtrack analyses.

Alright, so in the end we have:

In [13]:
con.sql(
    """--sql
select
    *
from
    matches
where
    verified = true
"""
).pl()


pk,left_string,match,score,verified
i32,str,str,i32,bool
1,"""2016 zema estate family selection cabernet sauvignon""","""2016 zema estate cabernet sauvignon family selection""",100,true
2,"""2022 william downie cathedral pinot noir""","""2022 william downie cathedral""",100,true
3,"""2021 babo chianti""","""2021 babo chianti""",100,true
4,"""2021 joshua cooper shays flat cabernet sauvignon""","""2021 joshua cooper cabernet sauvignon landsborough shays flat""",100,true
5,"""2022 william downie cathedral pinot noir""","""2022 william downie cathedral""",100,true
6,"""2021 babo chianti""","""2021 babo chianti""",100,true
8,"""2021 hey malbec""","""2021 matias riccitelli malbec hey malbec!""",100,true
9,"""2018 crawford river cabernets""","""2018 crawford river cabernets""",100,true
…,…,…,…,…
59,"""2021 billaud dor bourgogne blanc""","""2021 samuel billaud bourgogne dor""",90,true


In [14]:
con.sql(
    """--sql
select
    *
from
    matches
where
    verified = false
"""
).pl()


pk,left_string,match,score,verified
i32,str,str,i32,bool
108,"""2019 rr (?)""","""2019 domaine des ardoisieres argile rouge""",73,False
157,"""1001 merivale white semillon sauvignon blanc""","""2022 greywacke sauvignon blanc""",67,False
162,"""1001 allegra pinot grigio""","""2021 farina pinot grigio delle venezie""",67,False
138,"""2021 cantina orsogana""","""2021 domenica chardonnay""",62,False
168,"""2021 piagre gampania bianco""","""2021 girolamo russo etna bianco nerina""",62,False
166,"""1001 tottis vino bianco""","""2021 nino barraco fior di bianco""",58,False
154,"""1001 tottis vino rosso""","""2019 giovanni rosso etna bianco""",53,False
112,"""9999 leflaive macon-verze blanc le monte""","""2022 vinden estate the vinden headcase pokolbin blanc""",49,False
18,"""9999 empty id, missing wine""","""2021 lethbridge wines pinot noir""",48,False
190,"""9999 mystery barret rady""","""2013 woodlands cabernet merlot""",44,False


as we can see, out of 146 samples, 139 have verified wine name matches, and 7 have not, and have been excluded.

In [15]:
con.sql(
    """--sql
create or replace table excluded (
    pk integer primary key references st(pk),
    left_string varchar not null,
    match varchar not null,
    score varchar not null,
    reason varchar not null,
    );
insert into excluded
    select
        pk,
        left_string,
        match,
        score,
        'missing cellatracker entry' as reason,
    from
        matches
    where
        verified = false
        ;
select
    *
from
    excluded
"""
).pl()


pk,left_string,match,score,reason
i32,str,str,str,str
108,"""2019 rr (?)""","""2019 domaine des ardoisieres argile rouge""","""73""","""missing cellatracker entry"""
157,"""1001 merivale white semillon sauvignon blanc""","""2022 greywacke sauvignon blanc""","""67""","""missing cellatracker entry"""
162,"""1001 allegra pinot grigio""","""2021 farina pinot grigio delle venezie""","""67""","""missing cellatracker entry"""
138,"""2021 cantina orsogana""","""2021 domenica chardonnay""","""62""","""missing cellatracker entry"""
168,"""2021 piagre gampania bianco""","""2021 girolamo russo etna bianco nerina""","""62""","""missing cellatracker entry"""
166,"""1001 tottis vino bianco""","""2021 nino barraco fior di bianco""","""58""","""missing cellatracker entry"""
154,"""1001 tottis vino rosso""","""2019 giovanni rosso etna bianco""","""53""","""missing cellatracker entry"""
112,"""9999 leflaive macon-verze blanc le monte""","""2022 vinden estate the vinden headcase pokolbin blanc""","""49""","""missing cellatracker entry"""
18,"""9999 empty id, missing wine""","""2021 lethbridge wines pinot noir""","""48""","""missing cellatracker entry"""
190,"""9999 mystery barret rady""","""2013 woodlands cabernet merlot""","""44""","""missing cellatracker entry"""


Looks good. Now to replace the sample tracker wine key with the cellar tracker wine key for the verified samples.

In [16]:
con.sql(
    """--sql
alter table st add column if not exists new_wine_key varchar;
update st
    set new_wine_key = match
    from
        matches
    where
        wine_key = left_string
    and
        verified = true
    and
        matches.pk = st.pk;
alter table st add column if not exists new_wine varchar;
update st orig
    set new_wine = new.new_wine_key[6:]
    from
        st new
    where
        new.pk = orig.pk
    ;
select
    wine_key,
    new_wine_key,
    wine,
    new_wine,
from
    st
limit 5
"""
).pl()


wine_key,new_wine_key,wine,new_wine
str,str,str,str
"""2016 zema estate family selection cabernet sauvignon""","""2016 zema estate cabernet sauvignon family selection""","""zema estate family selection cabernet sauvignon""","""zema estate cabernet sauvignon family selection"""
"""2022 william downie cathedral pinot noir""","""2022 william downie cathedral""","""william downie cathedral pinot noir""","""william downie cathedral"""
"""2021 babo chianti""","""2021 babo chianti""","""babo chianti""","""babo chianti"""
"""2021 joshua cooper shays flat cabernet sauvignon""","""2021 joshua cooper cabernet sauvignon landsborough shays flat""","""joshua cooper shays flat cabernet sauvignon""","""joshua cooper cabernet sauvignon landsborough shays flat"""
"""2022 william downie cathedral pinot noir""","""2022 william downie cathedral""","""william downie cathedral pinot noir""","""william downie cathedral"""


Finally, a we've matched on `vintage` + `wine`, we should verify if the matched vintage strings equal the `st.vintage` field:

In [17]:
con.sql(
    """--sql
select
    bool_and(cast(new_wine_key[0:4] as integer) = vintage) all_vintages_equal,
from
    st
"""
).pl()


all_vintages_equal
bool
True


Ok, looks good to me. Can fully replace the fields now.

In [18]:
con.sql(
    """--sql
select
    wine_key,
    new_wine_key
from
    st
"""
).pl()


wine_key,new_wine_key
str,str
"""2016 zema estate family selection cabernet sauvignon""","""2016 zema estate cabernet sauvignon family selection"""
"""2022 william downie cathedral pinot noir""","""2022 william downie cathedral"""
"""2021 babo chianti""","""2021 babo chianti"""
"""2021 joshua cooper shays flat cabernet sauvignon""","""2021 joshua cooper cabernet sauvignon landsborough shays flat"""
"""2022 william downie cathedral pinot noir""","""2022 william downie cathedral"""
"""2021 babo chianti""","""2021 babo chianti"""
"""2020 uva non grata gamay""","""2020 boutinot uva non grata"""
"""2021 hey malbec""","""2021 matias riccitelli malbec hey malbec!"""
…,…
"""2022 das juice pet nat""","""2022 das juice pet nat rose"""


In [19]:
con.sql(
    """--sql
select
    *
from
    st
where
    new_wine_key is null
"""
).pl()


pk,detection,wine_key,wine,vintage,sampler,samplecode,open_date,sampled_date,added_to_cellartracker,notes,size,new_wine_key,new_wine
i32,str,str,str,i32,str,str,str,str,bool,str,f32,str,str
14,"""raw""","""9999 """,,9999,"""jonathan""","""14""",,,True,,750.0,,
18,"""raw""","""9999 empty id, missing wine""","""empty id, missing wine""",9999,"""jonathan""","""18""",,,True,,750.0,,
27,"""raw""","""9999 mystery""","""mystery""",9999,"""jonathan""","""27""","""2023-02-10""",,True,,750.0,,
77,"""raw""","""9999 """,,9999,"""jonathan""","""77""",,,False,,750.0,,
78,"""raw""","""9999 """,,9999,"""jonathan""","""78""",,,False,,750.0,,
108,"""raw""","""2019 rr (?)""","""rr (?)""",2019,"""jonathan""","""109""","""2023-04-21""","""2023-05-08""",False,,750.0,,
111,"""raw""","""9999 clembush""","""clembush""",9999,"""jonathan""","""112""","""2023-05-05""","""2023-05-08""",False,,750.0,,
112,"""raw""","""9999 leflaive macon-verze blanc le monte""","""leflaive macon-verze blanc le monte""",9999,"""jonathan""","""113""","""2023-04-20""","""2023-05-08""",False,,750.0,,
138,"""cuprac""","""2021 cantina orsogana""","""cantina orsogana""",2021,"""colin""","""127""","""2023-05-13""","""2023-05-19""",False,"""2023-06-04 11:05:26 - will have to check signal to id what wine this is""",,,
154,"""cuprac""","""1001 tottis vino rosso""","""tottis vino rosso""",1001,"""davy""","""143""",,,False,"""sangiovese blend italy""",,,


In [20]:
con.sql(
    """--sql
select
    (select count(*) from st) as total_count,
    (select count(*) from st where new_wine_key is null) as null_count,
    (select count(*) from st where new_wine_key is not null) as not_null_count;
"""
).pl()


total_count,null_count,not_null_count
i64,i64,i64
190,15,175


In [21]:
con.sql(
    """--sql
update st as orig
    set wine_key = (
    select
        coalesce(new_wine_key, wine_key)
    from
        st as new
    where
        orig.pk = new.pk
        );
update st as orig
    set wine = (
    select
        wine_key[6:]
    from
        st as new
    where
        orig.pk = new.pk
    );
select
    wine_key,
    wine,
    vintage,
from
    st
"""
).pl()


wine_key,wine,vintage
str,str,i32
"""2016 zema estate cabernet sauvignon family selection""","""zema estate cabernet sauvignon family selection""",2016
"""2022 william downie cathedral""","""william downie cathedral""",2022
"""2021 babo chianti""","""babo chianti""",2021
"""2021 joshua cooper cabernet sauvignon landsborough shays flat""","""joshua cooper cabernet sauvignon landsborough shays flat""",2021
"""2022 william downie cathedral""","""william downie cathedral""",2022
"""2021 babo chianti""","""babo chianti""",2021
"""2020 boutinot uva non grata""","""boutinot uva non grata""",2020
"""2021 matias riccitelli malbec hey malbec!""","""matias riccitelli malbec hey malbec!""",2021
…,…,…
"""2022 das juice pet nat rose""","""das juice pet nat rose""",2022


In [22]:
con.sql(
    """--sql
select * from st limit 3
"""
).pl()


pk,detection,wine_key,wine,vintage,sampler,samplecode,open_date,sampled_date,added_to_cellartracker,notes,size,new_wine_key,new_wine
i32,str,str,str,i32,str,str,str,str,bool,str,f32,str,str
1,"""raw""","""2016 zema estate cabernet sauvignon family selection""","""zema estate cabernet sauvignon family selection""",2016,"""jonathan""","""00""",,,True,"""freezer storage. sampled at 21:20 20230122, stored for 10h before transport to lab.""",750.0,"""2016 zema estate cabernet sauvignon family selection""","""zema estate cabernet sauvignon family selection"""
2,"""raw""","""2022 william downie cathedral""","""william downie cathedral""",2022,"""jonathan""","""01""",,,True,"""ambient 2 weeks. sampled 20230111.""",750.0,"""2022 william downie cathedral""","""william downie cathedral"""
3,"""raw""","""2021 babo chianti""","""babo chianti""",2021,"""jonathan""","""02""",,,True,"""ambient 2 weeks. sampled 20230111.""",750.0,"""2021 babo chianti""","""babo chianti"""


In [23]:
con.sql(
    """--sql
select
    pk, wine_key, added_to_cellartracker
from
    st
where
    added_to_cellartracker != 'y'
"""
).pl()


pk,wine_key,added_to_cellartracker
i32,str,bool
77,"""9999 """,False
78,"""9999 """,False
108,"""2019 rr (?)""",False
111,"""9999 clembush""",False
112,"""9999 leflaive macon-verze blanc le monte""",False
138,"""2021 cantina orsogana""",False
154,"""1001 tottis vino rosso""",False
157,"""1001 merivale white semillon sauvignon blanc""",False
162,"""1001 allegra pinot grigio""",False
168,"""2021 piagre gampania bianco""",False


In [24]:
con.sql(
    """--sql
select
    *
from
    excluded
"""
).pl()


pk,left_string,match,score,reason
i32,str,str,str,str
108,"""2019 rr (?)""","""2019 domaine des ardoisieres argile rouge""","""73""","""missing cellatracker entry"""
157,"""1001 merivale white semillon sauvignon blanc""","""2022 greywacke sauvignon blanc""","""67""","""missing cellatracker entry"""
162,"""1001 allegra pinot grigio""","""2021 farina pinot grigio delle venezie""","""67""","""missing cellatracker entry"""
138,"""2021 cantina orsogana""","""2021 domenica chardonnay""","""62""","""missing cellatracker entry"""
168,"""2021 piagre gampania bianco""","""2021 girolamo russo etna bianco nerina""","""62""","""missing cellatracker entry"""
166,"""1001 tottis vino bianco""","""2021 nino barraco fior di bianco""","""58""","""missing cellatracker entry"""
154,"""1001 tottis vino rosso""","""2019 giovanni rosso etna bianco""","""53""","""missing cellatracker entry"""
112,"""9999 leflaive macon-verze blanc le monte""","""2022 vinden estate the vinden headcase pokolbin blanc""","""49""","""missing cellatracker entry"""
18,"""9999 empty id, missing wine""","""2021 lethbridge wines pinot noir""","""48""","""missing cellatracker entry"""
190,"""9999 mystery barret rady""","""2013 woodlands cabernet merlot""","""44""","""missing cellatracker entry"""


only 6 samples are not "added to cellartracker", due to them not being present there, however 7 samples were excluded. Indicates that 1 sample was added to cellartracker, but the match wasnt able to be made. So which sample is present in the subset added to cellartracker, but also excluded?

## Correct 158 'added_to_cellartracker'


In [25]:
con.sql(
    """--sql
select
    pk, wine_key, match, score
from
    excluded
join
    st
using
    (pk)
where
    st.added_to_cellartracker = 'y'
"""
).pl()


pk,wine_key,match,score
i32,str,str,str
14,"""9999 ""","""2019 deviation road loftia""","""7"""
18,"""9999 empty id, missing wine""","""2021 lethbridge wines pinot noir""","""48"""
27,"""9999 mystery""","""2022 yangarra estate rose""","""32"""
166,"""1001 tottis vino bianco""","""2021 nino barraco fior di bianco""","""58"""


1001 totti's vino bianco..

In [26]:
con.sql(
    """--sql
select
    *
from
    ct
where
    wine like '%tott%'
"""
).pl()


wine_key,size,vintage,wine,locale,country,region,subregion,appellation,producer,type,color,category,varietal
str,str,i32,str,str,str,str,str,str,str,str,str,str,str


so that's incorrect. Time to correct it.

In [27]:
con.sql(
    """--sql
select * from excluded
"""
).pl()


pk,left_string,match,score,reason
i32,str,str,str,str
108,"""2019 rr (?)""","""2019 domaine des ardoisieres argile rouge""","""73""","""missing cellatracker entry"""
157,"""1001 merivale white semillon sauvignon blanc""","""2022 greywacke sauvignon blanc""","""67""","""missing cellatracker entry"""
162,"""1001 allegra pinot grigio""","""2021 farina pinot grigio delle venezie""","""67""","""missing cellatracker entry"""
138,"""2021 cantina orsogana""","""2021 domenica chardonnay""","""62""","""missing cellatracker entry"""
168,"""2021 piagre gampania bianco""","""2021 girolamo russo etna bianco nerina""","""62""","""missing cellatracker entry"""
166,"""1001 tottis vino bianco""","""2021 nino barraco fior di bianco""","""58""","""missing cellatracker entry"""
154,"""1001 tottis vino rosso""","""2019 giovanni rosso etna bianco""","""53""","""missing cellatracker entry"""
112,"""9999 leflaive macon-verze blanc le monte""","""2022 vinden estate the vinden headcase pokolbin blanc""","""49""","""missing cellatracker entry"""
18,"""9999 empty id, missing wine""","""2021 lethbridge wines pinot noir""","""48""","""missing cellatracker entry"""
190,"""9999 mystery barret rady""","""2013 woodlands cabernet merlot""","""44""","""missing cellatracker entry"""


In [28]:
con.sql(
    """--sql
update st
    set added_to_cellartracker = 'n'
    where
        pk = 158;
select
    pk, wine_key, added_to_cellartracker
from
    st
where
    added_to_cellartracker = 'n'
"""
).pl()


pk,wine_key,added_to_cellartracker
i32,str,bool
77,"""9999 """,False
78,"""9999 """,False
108,"""2019 rr (?)""",False
111,"""9999 clembush""",False
112,"""9999 leflaive macon-verze blanc le monte""",False
138,"""2021 cantina orsogana""",False
154,"""1001 tottis vino rosso""",False
157,"""1001 merivale white semillon sauvignon blanc""",False
158,"""2020 leeuwin estate cabernet sauvignon prelude""",False
162,"""1001 allegra pinot grigio""",False


## Correct 14, 18, 27, 155 'added_to_cellartracker'


The named samples are incorrectly stated to be added to the cellartracker, but they arnt. Correct that field.


In [30]:
def diff_st_ct_but_added_to_cellartracker_true(
    con: db.DuckDBPyConnection,
) -> pl.DataFrame:
    """
    anti join ct and st on wine, vintage where 'added_to_cellartracker' = true
    """
    return con.sql(
        """--sql
        select
            *
        from
            st
        anti join
            ct
        on
            st.vintage = ct.vintage
        and
            st.wine = ct.wine
        where
            st.added_to_cellartracker = true
        """
    ).pl()


diff_st_ct_but_added_to_cellartracker_true(con=con)


pk,detection,wine_key,wine,vintage,sampler,samplecode,open_date,sampled_date,added_to_cellartracker,notes,size,new_wine_key,new_wine
i32,str,str,str,i32,str,str,str,str,bool,str,f32,str,str
14,"""raw""","""9999 ""","""""",9999,"""jonathan""","""14""",,,True,,750.0,,
18,"""raw""","""9999 empty id, missing wine""","""empty id, missing wine""",9999,"""jonathan""","""18""",,,True,,750.0,,
27,"""raw""","""9999 mystery""","""mystery""",9999,"""jonathan""","""27""","""2023-02-10""",,True,,750.0,,
166,"""cuprac""","""1001 tottis vino bianco""","""tottis vino bianco""",1001,"""davy""","""155""",,,True,"""tottis vino bianco pinot grigio friuli, it""",,,


In [33]:
con.sql(
    """--sql
update st
    set added_to_cellartracker = false
    where
        samplecode in ['14','18','27','155']
"""
)

assert diff_st_ct_but_added_to_cellartracker_true(con=con).is_empty()


## Output ST


In [39]:
from database_etl.definitions import DATA_DIR

sampletracker_out_path = DATA_DIR / "dirty_sample_tracker_names_corrected.parquet"


def output_name_corrected_st(con: db.DuckDBPyConnection, outpath: str) -> None:
    con.sql(
        """--sql
    select
        *
    from
        st
    limit 3
    """
    ).pl().pipe(display)
    
    overwrite_sample_tracker = True
    if overwrite_sample_tracker:
        con.sql(
            f"""--sql
        copy (
            select
                detection,
                sampler,
                samplecode,
                vintage,
                wine,
                open_date,
                sampled_date,
                added_to_cellartracker,
                notes,
                size,
            from
            st
        ) to '{outpath}' (FORMAT PARQUET)
        """
        )
    print(f"st written to {outpath}")


output_name_corrected_st(con=con, outpath=str(sampletracker_out_path))


pk,detection,wine_key,wine,vintage,sampler,samplecode,open_date,sampled_date,added_to_cellartracker,notes,size,new_wine_key,new_wine
i32,str,str,str,i32,str,str,str,str,bool,str,f32,str,str
1,"""raw""","""2016 zema estate cabernet sauvignon family selection""","""zema estate cabernet sauvignon family selection""",2016,"""jonathan""","""00""",,,True,"""freezer storage. sampled at 21:20 20230122, stored for 10h before transport to lab.""",750.0,"""2016 zema estate cabernet sauvignon family selection""","""zema estate cabernet sauvignon family selection"""
2,"""raw""","""2022 william downie cathedral""","""william downie cathedral""",2022,"""jonathan""","""01""",,,True,"""ambient 2 weeks. sampled 20230111.""",750.0,"""2022 william downie cathedral""","""william downie cathedral"""
3,"""raw""","""2021 babo chianti""","""babo chianti""",2021,"""jonathan""","""02""",,,True,"""ambient 2 weeks. sampled 20230111.""",750.0,"""2021 babo chianti""","""babo chianti"""


st written to /Users/jonathan/mres_thesis/database_etl/database_etl/data/dirty_sample_tracker_names_corrected.parquet


and the cleaned cellartracker..

## Output CT


In [None]:
con.sql(
    """--sql
select
    *
from
    ct
limit 3
"""
).pl().pipe(display)

if overwrite_cellar_tracker:
    con.sql(
        """--sql
        copy (
            select
                pk,
                size,
                vintage,
                wine,
                locale,
                country,
                region,
                subregion,
                appellation,
                producer,
                type,
                color,
                category,
                varietal
            from
                ct
        ) to '/Users/jonathan/mres_thesis/database_etl/data/clean_cellar_tracker.parquet' (format parquet)
        """
    )


# Results

In the end we have ended up with a cleaned sample tracker table that is able to join to the metadata in cellar tracker:

In [None]:
con.sql(
    """--sql
select
    'total st count' as table,
    (select count(*) from st) as count
union
select
    'inner join st to ct' as table,
    count(*) as count
from
    st
inner join
    ct
on
    st.wine = ct.wine
and
    st.vintage = ct.vintage
union
select
    'entries in excluded table' as table,
    (select count(*) from excluded) as count
"""
).pl().pipe(display)
con.close()
del con


out of 190 entries, 175 have corresponding cellartracker metadata, and 15 are missing entries.

The sample tracker names have been cleaned up, and the original has been updated, enabling joins between ct and st. I Have elected to keep the original locally in this dir. Rather than using the code above as a basis, I will keep it isolated and recreate the tables in a core database, including the `excluded` table, which will then be based on an anti join between st and ct on the vintage + wine.