---
cdt: 2024-09-09T16:17:32
title: Separation of CUPRAC and Raw Signal Tables
description: The CUPRAC and Raw data is in one table which is too large for efficient querying. Need to seperate them
project: database_architecture
---

As we have found throughout the project, querying on the combined long `chromatogram_spectra_long` table is slow. Considering that CUPRAC and raw are fundamentally different datasets, there is no reason to keep them in one table that outwieghs the negatives. Thus they will be seperated into two.

Considering that there will be associated tables downtrack - peak tables, possibly baseline corrected, etc. We should make schemas for each. 'raw', and 'cuprac'


In [None]:
%reload_ext autoreload
%autoreload 2

import duckdb as db
import polars as pl
from pca_analysis.experiments.constants import db_path

con = db.connect(db_path)


In [None]:
con.sql(
"""--sql
CREATE SCHEMA IF NOT EXISTS raw;
CREATE SCHEMA IF NOT EXISTS cuprac;
"""
)


create cuprac table first.

In [None]:
con.sql(
"""--sql
CREATE OR REPLACE TABLE cuprac.cs_long AS (
    WITH 
        labeled_data AS (
            SELECT
                *
            FROM
                pbl.chromatogram_spectra_long
            JOIN
                (select id, sample_num, detection FROM pbl.sample_metadata) as mta
            USING
                (id)
        ),
        cuprac_data AS (
            SELECT
                *
            FROM
                labeled_data
            WHERE
                detection = 'cuprac'
        )
    SELECT
        *
    FROM
        cuprac_data
);
SELECT
    *
FROM
    cuprac.cs_long
ORDER BY
    sample_num, wavelength, idx
"""
).pl()


now do the same for raw.

In [None]:
con.sql(
"""--sql
CREATE OR REPLACE TABLE raw.cs_long AS (
    WITH 
        labeled_data AS (
            SELECT
                *
            FROM
                pbl.chromatogram_spectra_long
            JOIN
                (select id, sample_num, detection FROM pbl.sample_metadata) as mta
            USING
                (id)
        ),
        raw_data AS (
            SELECT
                *
            FROM
                labeled_data
            WHERE
                detection = 'raw'
        )
    SELECT
        *
    FROM
        raw_data
);
SELECT
    *
FROM
    raw.cs_long
ORDER BY
    sample_num, wavelength, idx
"""
).pl()


Done.