---
title: Create XArray Dataset
description: "create the raw dataset xarray (nc) file through `database_etl`"
project: parafac2
conclusion: "while execution of database_etl to produce the expected dataset was successful, we found that xarray datasets wernt queryable across variables as was expected. Efforts to recreate the img data within duckdb were fruitless, as expected, however we did find that judicious use of double quotes and string formatting enabled us to construct proper tidy tables with wavelength integer column labels, but unfortunately unidentified (probably memory ) problems prevented ingestion of data in this method to be practical as it would have taken 32 minutes if inserted row by row. Recommendation is to construct an intermediary between sql queries and the xarray dataset or the stored parquet files. Probably the latter as it removes one complication."
status: closed
cdt: 2024-09-25T16:17:38
---

In [1]:
%reload_ext autoreload
%autoreload 2

from database_etl import etl_pipeline_raw, get_data
import polars as pl

from pca_analysis.definitions import (
    RAW_LIB_DIR,
    DIRTY_ST,
    CT_UN,
    CT_PW,
    DB_PATH_UV,
    NC_RAW,
)

import duckdb as db
import xarray as xr

con = db.connect(DB_PATH_UV)

overwrite = False

if overwrite:
    etl_pipeline_raw(
        data_dir=RAW_LIB_DIR,
        dirty_st_path=DIRTY_ST,
        ct_pw=CT_PW,
        ct_un=CT_UN,
        con=con,
        overwrite=False,
        run_extraction=False,
        excluded_samples=[
            {
                "samplecode": "2021-debortoli-cabernet-merlot_avantor",
                "reason": "aborted run",
            }
        ],
    )

    dset.to_netcdf(NC_RAW)
else:
    dset = xr.open_dataset(NC_RAW)

dset = dset.assign_coords({"wavelength": dset["wavelength"].astype(int)})
dset


And as a demonstration.. the red shiraz.

In [2]:
shiraz = dset.sel(color="red", varietal="shiraz", wavelength=256)


In [16]:
shiraz = dset.sel(id=["e56c4dcd-2847-4d34-b457-743be10b0608"])
shiraz


It actually appears that you cant actually subset xarray Datasets..

Considering that the data is already setup in the database, I think it would be better to go back to SQL first..

In [4]:
df = (
    dset.img.sel(id="e56c4dcd-2847-4d34-b457-743be10b0608").to_dataframe().reset_index()
)
id = df["id"][0]
df_ = df.drop("id", axis=1)
tidy_df = df_.set_index("mins").pivot(columns="wavelength", values="img")

db.sql(
    """--sql
    select "190" from tidy_df
    """
)


┌──────────────────────┐
│         190          │
│        double        │
├──────────────────────┤
│   0.5998983979225159 │
│   0.5727335810661316 │
│   0.5095899105072021 │
│  0.40775537490844727 │
│  0.27498602867126465 │
│  0.12671947479248047 │
│ -0.03466010093688965 │
│  -0.2740249037742615 │
│  -0.7547661662101746 │
│  -1.7169862985610962 │
│            ·         │
│            ·         │
│            ·         │
│   -6.062276661396027 │
│  -6.0672760009765625 │
│   -6.065897643566132 │
│   -6.056979298591614 │
│   -6.040334701538086 │
│   -6.015762686729431 │
│   -5.982726812362671 │
│   -5.942240357398987 │
│   -5.896382033824921 │
│   -5.847550928592682 │
├──────────────────────┤
│ 7800 rows (20 shown) │
└──────────────────────┘

In [5]:
from database_etl.etl.sql.to_xr.sql_to_xr import get_imgs_as_dict

result = get_imgs_as_dict(con=con, m=7800)
list(result.values())[0]


wavelength,190,192,194,196,198,200,202,204,206,208,...,382,384,386,388,390,392,394,396,398,400
mins,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.000000,0.112936,0.038534,-0.016585,-0.048637,-0.070743,-0.099465,-0.129871,-0.150792,-0.165880,-0.160955,...,-0.019178,-0.017583,-0.012487,-0.010602,-0.009045,-0.010647,-0.013314,-0.010669,-0.014238,-0.021860
0.006667,0.134930,0.053741,-0.008173,-0.045381,-0.069350,-0.098266,-0.129119,-0.150137,-0.165217,-0.160791,...,-0.019558,-0.018388,-0.013761,-0.011183,-0.008777,-0.011638,-0.014618,-0.010245,-0.014208,-0.022694
0.013333,0.126094,0.048272,-0.011943,-0.048190,-0.071011,-0.098616,-0.129074,-0.149839,-0.164889,-0.160657,...,-0.019692,-0.018485,-0.014514,-0.011533,-0.007629,-0.011154,-0.014625,-0.009246,-0.013657,-0.022396
0.020000,0.077926,0.016332,-0.031210,-0.058755,-0.076205,-0.100598,-0.129819,-0.150122,-0.164911,-0.160605,...,-0.019774,-0.017934,-0.014529,-0.011265,-0.006087,-0.009641,-0.013724,-0.008561,-0.012629,-0.020660
0.026667,-0.004746,-0.038467,-0.063561,-0.075921,-0.084385,-0.103958,-0.131361,-0.151120,-0.165358,-0.160702,...,-0.020012,-0.017218,-0.014298,-0.010595,-0.004783,-0.008330,-0.012800,-0.008143,-0.010863,-0.017732
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51.966669,-3.871486,-2.874628,-1.985751,-1.287848,-0.844210,-0.649311,-0.640854,-0.722945,-0.811286,-0.826523,...,-0.144877,-0.117011,-0.086382,-0.078127,-0.066817,-0.061929,-0.077136,-0.072785,-0.080913,-0.122674
51.973336,-3.865503,-2.870120,-1.983717,-1.286566,-0.844121,-0.650182,-0.641249,-0.722989,-0.811771,-0.826806,...,-0.146128,-0.117682,-0.086360,-0.077680,-0.066534,-0.062086,-0.075966,-0.071473,-0.080764,-0.123009
51.980003,-3.857814,-2.864927,-1.981355,-1.285434,-0.844076,-0.651039,-0.641450,-0.722408,-0.811890,-0.827424,...,-0.146292,-0.117615,-0.086144,-0.077218,-0.065856,-0.061788,-0.074580,-0.069760,-0.080794,-0.124313
51.986669,-3.849551,-2.859369,-1.978360,-1.284130,-0.843771,-0.651531,-0.641607,-0.721850,-0.812255,-0.828549,...,-0.145331,-0.117339,-0.086077,-0.076637,-0.064492,-0.061139,-0.073917,-0.068776,-0.081003,-0.125490


In [6]:
wavelengths = list(result.values())[0].columns


In [8]:

# tidy_imgs = (
#     pl.from_pandas(img.assign(**{"id": id}).reset_index()) for id, img in result.items()
# )


UsageError: Line magic function `%skip` not found.


Can insert, but then cant have primary keys.

In [36]:

# del dset
# del df
# del df_
# del result


In [None]:
# def create_t2(con, wavelengths):
#     wavelength_col_decs = ",".join(
#         [f'"{x}" float' for x in wavelengths if x not in ["mins", "id"]]
#     )

#     con.sql(
#         f"""--sql
#     drop table t2;
#     create or replace table t2 (
#         id varchar references chm(id),
#         mins float,
#         {wavelength_col_decs},
#         primary key (id, mins)
#         );
#     """
#     )


# def insert_img(con, img, wavelengths):
#     for x in img.partition_by("mins"):
#         # display(x)
#         con.sql(
#             f"""--sql
#         insert into t2
#             select
#                 id,
#                 mins,
#                 {",".join([f'"{x}"' for x in wavelengths])}
#             from
#                 x
#         """
#         )


# def insert_imgs(con, imgs, wavelengths):
#     for idx, img in enumerate(imgs):
#         print(idx, img["id"][0])
#         insert_img(con=con, img=img, wavelengths=wavelengths)


# def create_tidy_img_tbl(con, imgs, wavelengths) -> None:
#     create_tidy_img_tbl(con=con, imgs=imgs, wavelengths=wavelengths)
#     insert_imgs(con, imgs, wavelengths=wavelengths)


In [None]:
# wavelengths = [x for x in wavelengths if x not in ["id", "mins"]]
# shortened_wavelengths = wavelengths[:5]
# print(shortened_wavelengths)
# create_t2(con=con, wavelengths=shortened_wavelengths)
# insert_imgs(con, tidy_imgs, wavelengths=shortened_wavelengths)

# con.sql(
#     """--sql
# select
#     *
# from
#     t2
# limit 10
# """
# ).pl()


Ok, as promising as this is, its obviosuly making something upset. It's just not worth the effort to make this work.


# Conclusion

While it did look possible to create a tidy image table, insertion of data proved impossible, resulting in the kernel dying more often than not. Evidently the mechanism wasnt intended for this volume of insertion.

We could try reading the parquet files directly, or inserting one row at a time..

Ok look we could play around with it like this all week, but its evident that duckdb doesnt like my data. End of the day, we're not looking to query the raw data anyway, and furthermore 99% of the wavelengths dont contain any useful information anyway.

Create a rough module to translate the results of a query to filepaths then return the selected data as a list/generator. Load it into a tensor and fire away. Anything else is a WASTE OF TIME. Or query on the dset, selecting by ID.
