---
cdt: 2024-09-04T15:56:31
title: "Migrating From DuckDB to NetCDF"
description: "A migration from the DuckDB database to a new NetCDF file. Includes meta and chromatospectral data."
---

In [None]:
import duckdb as db
import xarray as xr
from pathlib import Path
import pandas as pd
import polars as pl


To do:

- [x] create output dir
- [x] export cs data
- [ ] create xarray format from dataarray
- [ ] export metadata
- [ ] join cs data to metadata

In [None]:
db_path = "/Users/jonathan/mres_thesis/wine_analysis_hplc_uv/wines.db"
netcdf_path = "/Users/jonathan/mres_thesis/netcdf"
csv_outpath = "/Users/jonathan/mres_thesis/netcdf/csvs/cs.csv"


In [None]:
con = db.connect(db_path)

metadata_colnames = con.sql("select table_schema, table_name, column_name from information_schema.columns where table_name ='sample_metadata' AND table_schema='pbl'").df()['column_name'].to_list()

query = \
f"""
COPY
    (
    SELECT
            *
    FROM
        chromatogram_spectra
    JOIN
        (
            select
                *
            from
                pbl.sample_metadata
        )
    USING
         (id)
    ORDER BY
        id, mins
    )
TO
    '/Users/jonathan/mres_thesis/netcdf/csvs/cs.csv'
(FORMAT CSV);
"""
if not Path(csv_outpath).exists():
    con.sql(query)


In [None]:
try:
    cs
except NameError:
    cs = pd.DataFrame()

if cs.empty:
    cs = pl.read_csv(csv_outpath, schema_overrides={'samplecode':str}).to_pandas()


In [None]:
con.sql("show")


In [None]:
size_query = \
"""
select
    count(*)
from
    chromatogram_spectra
join
    pbl.sample_metadata
using
    (id)
where
    id = '0aeed887-d8e9-4886-baac-f519c4f44715'
limit 10
"""

con.sql(size_query).show()


## Chunkwise DataFrame to Dataset

In [None]:
# dims are the names of the dimensions, the axes
# coordinates are the tick values of the dimensions
# vars are the values that exist within the dims, labelled by the coordinates.

import numpy as np

cs.columns = [col.replace("nm_","") for col in cs.columns]

cs[metadata_colnames].drop_duplicates()

cs_idxed = cs.set_index(['id','mins'])

grpby_id = cs.groupby('id')

ds_list = []

curr_id = None

group_sizes = []
mean_grp_size = None

for i, (k, v) in enumerate(grpby_id):

    # check the group sizes against the progressive mean, if its an outlier, raise an alarm
    rows = v.shape[0]
    if i==0:
        group_sizes.append(size)
        mean_grp_size = np.mean(group_sizes)
    else:
        if size > mean_grp_size:
            raise RuntimeError(f"outlier size detected: {k}. {mean_grp_size=}, {size=}, ")
        
    metadata_dict = v[metadata_colnames].drop_duplicates().to_dict(orient='list')
    id_vals = v['id'].values
    min_vals = v["mins"].values
    min_vals = np.round(min_vals - min_vals[0], 6)
    wavelength_vals = v.drop(["id","mins"]+metadata_colnames,axis=1).columns.astype(int)

    data = v.drop(["id","mins"]+metadata_colnames, axis=1).values
    
    ds_list.append(xr.Dataset(
        data_vars = {
            "abs":(('mins','nm'),data), 
        },
        coords = {
            'mins': min_vals,
            'nm': wavelength_vals,
            'id': k,
            **metadata_dict
        }
    ))

display(ds_list)


Add the remaining metadata as coords. Do this by adding the keys to the dim tuple and unpack into the coords dict.

In [None]:
def chunks(lst, n_chunks):
    """Yield successive n-sized chunks from lst."""
    length = len(lst)
    step_size = length/n_chunks
    
    assert (length/step_size).is_integer(), (length, step_size, x)
    
    start = int(step_size)
    step = int(step_size)
    end = length - step

    print("start:", start)
    print("end:",end)
    print("step",step)
    
    for idx, i_0 in enumerate(range(0, end, step)):
        print("iteration: ",idx)
        print("\ti_0: ", i_0)

        i_n = i_0 + int(step_size)

        print("\ti_n: ", i_n)
        yield lst[i_0:i_n+1]

chunked_ds = [xr.concat(chunk, dim='id') for chunk in chunks(ds_list,35)]
chunked_ds


In [None]:
ds_2 = [xr.concat(chunk, dim='id') for chunk in chunks(chunked_ds,3)]
ds_2


Alright, going to have to trim each mode down to the mean (wavelength and mins primarily) if I want to fit it all into one Dataset.

The quicker thing right now would be to prepare the raw reds as a dataset and continue the decompositions..

In [None]:
chunked_ds[19]


In [None]:
ds_2[1]


In [None]:
ds = xr.concat(ds_2, dim='id')
ds


In [None]:
ds.isel(id=0)[['mins','abs']].to_dataarray()


In [None]:
# from <https://earth-env-data-science.github.io/lectures/xarray/xarray.html#datasets>, coords are constant values such as nm, mins.
# they designate the space.
# variables change. That would be for example absorbance.

# break ds_list into 5 and concatenate each then concatenate the remainder


Ok so its clear that XArray wants to have all the dimensions be the same across the samples, bar the joining dimesnion ("id" in this case). We're going to need to perform some EDA on the SQL-stored data to identify ways of bringing all the data onto the same dimensions, and whether there should be more than one dataset. The main pain point will be data observed at different frequencies. See [Dataset EDA](dataset_description_wavelength_time.ipynb) for more.