---
cdt: 2024-09-10T11:02:40
title: Excluding Raw Samples Lacking Metadata
project: database_architecture
description: "It has been noted that some 'raw' samples are missing thier metadata. This is an exploration of why, and how to address it."
conclusion: "Have found that 6 samples in the 'raw' dataset are lacking metadata. As they have been judged to lack value, they are being excluded for now by adding their 'sample_num' to a table data 'dataset_eda.excluded_samples'."
---


In [None]:
%reload_ext autoreload
%autoreload 2

import duckdb as db
import polars as pl
from pca_analysis.experiments.constants import db_path

con = db.connect(db_path, read_only=True)



In [None]:
null_varietals = con.sql(
"""--sql
SELECT
    *
FROM
    pbl.sample_metadata
WHERE
    varietal IS NULL
AND
    detection = 'raw'
"""
).pl()
null_varietals


6 samples are missing 'wine', 'color', and varietal values. Why?


In [None]:
con.sql(
"""--sql
SELECT
    *
FROM
    c_sample_tracker
WHERE
    samplecode IN (SELECT samplecode FROM null_varietals)
"""
).pl()


I assume its because the wines are not in the cellar tracker table. Frankly I dont consider these wines of value due to under-representation of their categories. They can be reintegrated later if necessary.


In [None]:
missing_metadata_excluded_samples = con.sql(
"""--sql
WITH
  commented as (
    SELECT
      sample_num,
      'missing metadata' AS comment,
      'excluding_raw_samples_without_metadata.ipynb' as proof
    FROM
      null_varietals
    )
SELECT
  *
FROM
  commented
"""
).pl()

missing_metadata_excluded_samples


We will thus add them to the excluded list.

In [None]:
#%%script #uncomment to execut this cell

con.close()
con = db.connect(db_path)

if input("warning: this will add values to table 'dataset_eda.excluded_samples'. press 'y' to continue:") == 'y':
    con.sql(
    """--sql
    INSERT INTO
        dataset_eda.excluded_samples
    SELECT
        *
    FROM
        missing_metadata_excluded_samples
    """)

    excluded_samples = con.sql(
    """--sql
    SELECT
        *
    FROM
        dataset_eda.excluded_samples
    """
    ).pl()
    print("added values to 'dataset_eda.excluded_samples'")
    display(excluded_samples)
else:
    print('did not execute SQL')
