---
project: dataset_EDA
title: Raw Dataset Description
cdt: 2024-09-09T23:44:30
description: a descriptive analysis of the raw dataset
---


# Sample Metadata

We define the metadata as all categorical information pertaining to the character of the data: the wine name, geographic origin, variety, producer, vintage, etc.


In [1]:
%reload_ext autoreload
%autoreload 2
import duckdb as db
import polars as pl
from pca_analysis.experiments.constants import db_path
from great_tables import GT
pl.Config.set_tbl_rows(999).set_tbl_width_chars(2000).set_fmt_str_lengths(99999)

con = db.connect(db_path, read_only=True)


- ~~how many samples~~
- ~~what colors~~
- ~~what varieties~~
- what vintages


In [2]:
sm = con.sql(
"""--sql
SELECT
    *
FROM
    pbl.sample_metadata
ANTI JOIN
    (SELECT sample_num FROM dataset_eda.excluded_samples)
USING
    (sample_num)
WHERE
    detection = 'raw'
"""
).pl()
sm


## Columns


In [4]:
sm.columns


The sample metadata table contains the following columns: 'detection': the detection method of the sample signal, 'acq_date': the date the sample was observed, 'wine': the name of the wine from which the sample was taken, 'color': the color of the wine, 'varietal': the grape variety the wine was made from, 'samplecode': the unique identifier assigned to the sample at the time of observation, 'id' a unique identifer hash generated by the Agilent Chemstation software used to join the metadata to the signal table, 'sample_num': a human-readable unique identifier monotonically increasing from 1 acording to the 'acq_datee' in ascending order. 

## Sample Count

In [5]:
con.sql(
"""--sql
SELECT
    count( distinct sample_num) as sample_num
FROM
    sm
"""
).pl()


In the raw dataset are 96 samples.

## Color


In [6]:
con.sql(
"""--sql
SELECT
    color,
    count(sample_num)
FROM
    sm
GROUP BY color
ORDER BY
    color
"""
).pl()


There are 4 unique colors: 'orange', 'red', 'rosé', and 'white'. The 96 samples can be broken down into the following: 'orange' = 1, 'red' = 68, 'rosè' = 3, 'white' = 24.


## Variety

### Samples per Varietal

In [7]:
varietal_counts = con.sql(
"""--sql
SELECT
    varietal,
    count(sample_num) as count
FROM
    sm
GROUP BY
    varietal
ORDER BY
    varietal
"""
).pl()

varietal_counts.describe()


There are 33 varieties within the 'raw' dataset.


In [27]:
con.sql(
"""--sql
WITH
    binned AS(
        SELECT
            varietal,
            count,
            CASE
                WHEN
                    count = 1
                THEN
                    {'desc':'one', 'bin_rank': 0}
                WHEN
                    count BETWEEN 2 AND 5
                THEN
                    {'desc':'between 2 and 5', 'bin_rank':1}
                WHEN
                    count BETWEEN 5 AND 10
                THEN
                    {'desc':'between 5 and 10', 'bin_rank':2}
                WHEN
                    count > 10
                THEN
                    {'desc':'greater than 10', 'bin_rank':3}
                END AS bin
                    
        FROM
            varietal_counts
        ORDER BY
            bin ASC
        ),
    agg AS (
        SELECT
            bin,
            count(varietal) as var_per_bin,
        FROM
            binned
        GROUP BY
            bin
        ORDER BY
            bin
        ),
    unpacked AS (
        SELECT
            bin.*,
            var_per_bin
        FROM
            agg
        )
SELECT
    bin_rank + 1 as bin_rank,
    "desc",
    var_per_bin
    
FROM
    unpacked
ORDER BY
    bin_rank
"""
).pl().pipe(GT)


There is a variation in numerical representation of varieties within the dataset, ranging from 1 to 11 samples. Four key ranges were identified: 1: $n = 1$, 2: $2 \le n < 5$, 3: $5 \le n < 10$, and 4: $n>10$ (11 samples). bin 1 possessed 13 varietals, bin 2 16 varietals, 3 and 4 both had 2. The full tabulation can be found in the [appendix](#varietal-counts).

### Most Represented Varietals


In [28]:
varietal_counts.sort('count', descending=True).head(8).pipe(GT)


The most represented varieties are Pinot Noir (11), Shiraz (11), Chardonnay (7), Red Bordeaux Blends (6), Gamay (5), Malbec (5), Nebbiolo (5) and riesling (5).

## Vintage

In [46]:
con.sql(
"""--sql
select table_schema, table_name, column_name from information_schema.columns WHERE table_name = 'c_cellar_tracker'
"""
).pl()['column_name'].to_list()


# Signal


# Appendix

## Varietal Counts


In [18]:
from great_tables import GT, md, html

varietal_counts_tbl = con.sql(
"""--sql
WITH 
    agg AS (
        SELECT
            varietal,
            count(varietal) as count,
        FROM
            sm
        GROUP BY 
            varietal
        order by
            varietal
    ),
    tiled AS (
        SELECT
            ntile(2) OVER (ORDER BY varietal) as col,  
            *
        FROM
            agg
        ),
    row_nummed AS (
        SELECT
            row_number() OVER (PARTITION BY col ORDER BY varietal) as row_num,
            *
        FROM
            tiled
        ),
    col_1 AS (
        SELECT
            *,
        FROM
            row_nummed
        WHERE
            col = 1
        ),
    col_2 AS (
        SELECT
            *,
        FROM
            row_nummed
        WHERE
            col = 2
        ),
    joined AS (
        SELECT
            *
        FROM
            col_1 as col1
        JOIN
            col_2 as col2
        USING
            (row_num)
        ORDER BY
            row_num
            
        )
SELECT
    * EXCLUDE (col, row_num, col_1)
FROM
    joined
"""
).pl()
varietal_counts_tbl.pipe(GT).cols_label(
    varietal_1=html('varietal'),
    count_1=html('count')
)
