---
cdt: 2024-09-19T21:55:13
title: Load CS Image File Statimgtics
description: creation of a CS stats table - dimensions, hertz, maxima, etc.
project: database_etl
conclusion:
status: open
---

In [None]:
import duckdb as db

%xmode minimal
from pathlib import Path
import polars as pl
from database_etl.definitions import DB_PATH, RAW_DATA_LIB

pl.Config.set_tbl_rows(99).set_fmt_str_lengths(999)
con = db.connect(DB_PATH)

overwrite: bool = True

We want to iterate through all the .D dirs and measure statistics from all the files. First we need to get the data parquet of each sample


In [None]:
image_files = list(RAW_DATA_LIB.glob("*.D/extract_*/data.parquet"))
len(image_files)

Looking good. Now iterate over them. Do we use duckdb or polars for EDA? lets test it. Iterate over, calculate the cumsum of each.

# Test EDA

## Polars

In [None]:
pl_cumsums = {}

for f in image_files:
    key = f.parts[-3]
    pl_cumsums[key] = pl.read_parquet(f).select(pl.exclude("id").sum())
list(pl_cumsums.values())[0:3]

## Duckdb

In [None]:
db_cumsums = {}

for idx, f in enumerate(image_files):
    key = f.parts[-3]
    result = con.sql(f"select sum(columns(* exclude id)) from read_parquet('{str(f)}')")
    pl_cumsums[key] = result
len(pl_cumsums)

## Result

For that opeation, duckdb wins at 0.1 second. However it really isnt designed for that style of data, and is a paaaaaain to work with. Stick with polars to calculate the aggregates then duckdb

# Calculate Aggregates

We want to obtain the time ranges, hertz, wavelength ranges of all samples.

Looking good, now to load into the database

In [None]:
def image_stats_from_files(image_files: list[Path]) -> pl.DataFrame:
    image_stats_ = []

    for f in list(image_files):
        df = pl.read_parquet(f)

        # nm
        nm_min = df.columns[2]
        nm_max = df.columns[-1]
        nm_count = len(df.columns)

        # time
        mins_min = df.select("time").min().item()
        mins_max = df.select("time").max().item()
        mins_count = df.select("time").count().item()

        # abs @ 256
        abs_min = df.select("256").min().item()
        abs_max = df.select("256").max().item()
        abs_argmin = df.select("256").to_series().arg_min()
        abs_argmax = df.select("256").to_series().arg_max()

        # hertz
        hertz = df.select(pl.col("time").diff().mul(60).pow(-1).mean().round(2))

        image_stats_.append(
            pl.DataFrame(
                {
                    "id": df.select("id")[0].item(),
                    "nm_min": nm_min,
                    "nm_max": nm_max,
                    "nm_count": nm_count,
                    "mins_min": mins_min,
                    "mins_max": mins_max,
                    "mins_count": mins_count,
                    "abs_min": abs_min,
                    "abs_max": abs_max,
                    "abs_argmin": abs_argmin,
                    "abs_argmax": abs_argmax,
                    "hertz": hertz,
                    "path": str(f),
                }
            )
        )
    image_stats_df = pl.concat(image_stats_)

    return image_stats_df


def load_image_stats_to_db(con: db.DuckDBPyConnection, image_stats_df: pl.DataFrame):
    image_stats_df = image_stats_df  # to fool lsp
    con.sql(
        """--sql
    create or replace table image_stats (
        pk integer primary key,
        nm_min integer not null,
        nm_max integer not null,
        nm_count integer not null,
        mins_min float not null,
        mins_max float not null,
        mins_count integer not null,
        abs_min float not null,
        abs_max float not null,
        abs_argmin float not null,
        abs_argmax float not null,
        hertz float not null,
        path varchar not null unique,
    );
    insert into image_stats
        select
            chm.pk,
            img.nm_min,
            img.nm_max,
            img.nm_count,
            img.mins_min,
            img.mins_max,
            img.mins_count as mins_count,
            img.abs_min,
            img.abs_max,
            img.abs_argmin,
            img.abs_argmax,
            img.hertz,
            img.path
        from
            image_stats_df img
        join
            chm
        using
            (id)
        order by
            chm.pk;
    """
    )


def load_image_stats(image_files: list[Path], con: db.DuckDBPyConnection) -> None:
    image_stats_df = image_stats_from_files(image_files=image_files)
    load_image_stats_to_db(con=con, image_stats_df=image_stats_df)


if overwrite:
    load_image_stats(image_files=image_files, con=con)
    con.sql(
        """--sql
    select * from image_stats limit 5
    """
    ).pl().pipe(display)

Looks great. Now find the outliers for each,

# Finding Outliers

In [None]:
con.sql(
    """--sql
select
    count( distinct COLUMNS(* exclude pk))
from
    image_stats
"""
).pl()

Ok so the wavelengths are all the same, but there are 6 distinct mins maximums, and 2 distinct hertz..

In [None]:
con.sql(
    """--sql
select
    distinct(hertz)
from
    image_stats
"""
).pl()

Ok, as expected.

In [None]:
con.close()
del con