---
title: Sample Set nm 254 Creation
description: "creation of the `dataset_eda` schema and nm = 254 sample for the project"
cdt: '2024-09-06T14:32:38'
project: "total_dataset_EDA"
execution_order: "000"
conclusion: Created a schema 'dataset_eda' to contain the work of the 'dataset_EDA' projcet. A table 'dataset_eda.nm_254' was created from a subset of the samples pbl.chromatogram_spectra_long @ 254 nm, which includes the 'sample_num' primary key.
---

# Wavelength Subset Selection

Based on *a priori* knowledge, 254 is one of the maxima wavelengths. Therefore the best representative of the dataset, as every sample has a detection at that channel. We should create a subset there. This will be created as a table under the schema `dataset_eda` within the 'wine.db' database. We will also add the "sample_num" column from "sample_metadata" for ease of life.

In [None]:
%reload_ext autoreload
%autoreload 2

import duckdb as db
import polars as pl
from pca_analysis.experiments.constants import db_path

con = db.connect(db_path, read_only=True)
con.sql("SHOW").pl().head()


In [None]:
with db.connect(db_path) as con:
    con.sql(
    """--sql
    CREATE SCHEMA IF NOT EXISTS dataset_eda;
    CREATE OR REPLACE TABLE
        dataset_eda.nm_254 AS (
    SELECT
        mta.sample_num,
        cs.*
    FROM
        pbl.chromatogram_spectra_long as cs
    JOIN
        pbl.sample_metadata as mta
    USING
        (id)
    WHERE
        wavelength = 254
    ORDER BY
        id, idx);
    SELECT * FROM dataset_eda.nm_254
    """
    ).pl().describe()
