---
title: Table Descriptions
cdt: 2024-09-05T13:27:05
description: "A description of the key tables: 'chromatogram_spectra_long' and 'sample_metadata'"
notes:
    - The document is arranged such that the summary and conclusions are first and contain hyperlinks to their relevant backing sections. This was done to allow for quick scanning without necessary running of code.
---

The following is a brief description of the 'sample_metadata' and 'chromatogram_spectra_long' tables, including the number of columns and rows, colum names and primary key. In this case we know the primary key is 'id' in each column. That will however need to be confirmed.

# Summary

There are two key tables: 'sample_metadata' and 'chromatogram_spectra_long', with, respectively, the most important sample metadata, and chromatospectral images of each sample.

[sample_metadata](#sample-metadata) possesses 8 columns: 'detection', 'acq_date', 'wine', 'color', 'varietal', 'samplecode, 'id', and 'sample_num', with both 'id' and 'sample_num' as primary keys - 'id' is a UUID derived from a hash, and 'sample_num' is a monotonically ascending key ordered by sample acquisition date.

[chromatogram_spectra_long](#chromatogram-spectra-long) consists of 173,048,494 rows and the following 5 columns: 'idx', 'id', 'mins', 'wavelength', and 'absorbance', with 'id' as the primary key, same as 'sample_metadata'. 'idx'/'mins' ranges from 0 to 15599, 'wavelength' from 190 to 600 with 205 unique values, steps of 2 between each, absorbance ranges from ~-3596 to ~2918. The wide range of 'idx/'mins' indicates an outlier as *a priori* we know that the samples usually run around 4800 observations. Furthermore, absorbances should not venture far beyond -0, indicating more outliers.

# Conclusions

It  appears that chromatogram_spectra_long contains either duplicated samples OR samples observed at ~2x the frequency of the *a priori* known 2.5Hz rate.

To continue, we should investigate the lengths and frequency of each sample and the absorbance ranges.


# Sample Metadata



In [None]:
import duckdb as db
import polars as pl

db_path = "/Users/jonathan/mres_thesis/wine_analysis_hplc_uv/wines.db"

con = db.connect(db_path)


In [None]:
con.sql(
    """
    SELECT
        table_schema,
        table_name,
        column_name
    FROM
        information_schema.columns
    WHERE
        table_schema = 'pbl'
    AND
        table_name = 'sample_metadata'
    """
    ).pl()


In [None]:
con.sql(
"""
SELECT
    CASE WHEN (COUNT(*) / COUNT(DISTINCT sample_num)) = 1 THEN 'yes' ELSE 'no' END as is_sample_num_unique
FROM
    pbl.sample_metadata
"""
).pl()


In [None]:
con.sql(
    """
    SELECT
        CASE WHEN (COUNT(*) / COUNT(DISTINCT id)) = 1 THEN 'yes' ELSE 'no' END as is_id_unique
    FROM
        pbl.sample_metadata
    """
).pl()


In [None]:
con.sql(
    """
    select
        schema_name,
        table_name,
        estimated_size as rows,
        column_count as columns
    from
        duckdb_tables
    where
        schema_name = 'pbl'
    AND
        table_name = 'sample_metadata'
    """
    ).pl()


## Summary

In summary, the sample metadata table consists of the following eight columns: "detection", "acq_date", "wine", "color", "varietal", "samplecode", "id", "sample_num". There are 175 rows, each with a uniuqe "sample_num" and corresponding unique "id".


# Chromatogram Spectra Long

The 'chromatogram_spectra_long' table contains the chromatospectral images of each sample in long form, one column per dimension and one for the values.

In [None]:
con.sql(
"""
SELECT
    table_schema,
    table_name,
    column_name
FROM
    information_schema.columns
WHERE
    table_schema = 'pbl'
AND
    table_name = 'chromatogram_spectra_long'
"""
).pl()


There are 5 columns: "idx", "id", "mins", "wavelength" and "absorbance".


In [None]:
con.sql(
"""
select
    schema_name,
    table_name,
    estimated_size as rows,
    column_count as columns
from
    duckdb_tables
where
    schema_name = 'pbl'
AND
    table_name = 'chromatogram_spectra_long'
"""
).pl()


There are 173,048,494 rows and 5 columns.


In [None]:
# what is 'idx' vs. 'mins'?

con.sql(
    """
    SUMMARIZE pbl.chromatogram_spectra_long
    """
).pl()


As we've now [learnt](./random_notes.md#duckdb-summarize-approx-is-approx), the SUMMARIZE keyword produces an *approximate* unique count. It is still useful for the other statistics however. In lieu of a better way of finding this information, I will manually observe the distinct, or unique values:

In [None]:
con.sql(
    """
    SELECT
        COUNT(distinct idx) as idx,
        COUNT(distinct id) as id,
        COUNT(distinct mins) as mins,
        COUNT(distinct wavelength) as wavelength,
        COUNT(distinct absorbance) as absorbance
    FROM
        pbl.chromatogram_spectra_long
    """
).pl()


and as we can see through visual comparison, almost all the estimations were wrong. I wonder what it's doing.

"idx" ranges from 0 to 15599 (indicating a double up). ~~There are 174 unique id's, compared to the 175 in the sample_metadata table~~ There are 175 unique "id", matching those of "sample_metadata". "wavelength" ranges from 190 to 600 nm with 205 unique values. Interestingly, "absorbance" ranges from ~-3596 to ~2918. Presumably we've got some dud samples in the mix.


## Summary

In summary, chromatogram_spectra_long conists of five columns: 'idx', 'id', 'mins', 'wavelength', and 'absorbance'. The "idx" ranges from 0 15599, "id" contains ~~174~~ 175 unique values, the "mins" range from ~0 to ~52 mins, "wavelength" ranges from 190 to 600 nm with 205 unique values, and absorbance from ~-3596 to ~2918. The idx maxima indicates that there is a double up, or that some samples were observed at a much higher frequency than others. The presence of negative absorbance indicates that there are some erroneous samples.