## Comparing Distributed ways of processing hdf5 files

It would be extremely helpful if I could create a dataframe instead of looping through every file each time. The issue however is about memory, as these files would be very hard to hold all in memory. In this notebook, I will explore ways of doing this in a distributed, lazy manner. The real question is whether ``dask`` or ``polars`` would be better.

#### Useful imports

In [None]:
import os
import sys
import warnings
sys.path.append("..")  # add project root


import h5py

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from tqdm import tqdm

from src.data_utils import *
from src.constants import *

In [None]:
import tables
import dask
import dask.array as da
import polars as pol

In [None]:
# Filter out the NaturalNameWarning. Raised when attribute cannot be dot indexed because of how it is named.
warnings.filterwarnings('ignore', category=tables.NaturalNameWarning)

In [None]:
pd.options.display.float_format = '{:10,.2f}'.format

In [None]:
np.random.seed(420)

In [None]:
sns.set_theme(context="talk")

In [None]:
# path constants
data_dir = "/home/mr2238/project_pi_np442/mr2238/accelerate/data"
img_dir = "/home/mr2238/project_pi_np442/mr2238/accelerate/imgs/overview"
labels_path = os.path.join(data_dir, "labels")
raw_data_path = os.path.join(data_dir, "raw_data")

In [None]:
global_data_path = "/home/mr2238/project_pi_np442/mr2238/accelerate/data/processed/all_data.hdf5"

In [None]:
# list files
h5py_files = [f for f in os.listdir(raw_data_path) if f.endswith(".icmh5")]
print(f"Number of h5py files: {len(h5py_files)}")
print(f"Example file: {h5py_files[0]}")

In [None]:
example_file = os.path.join(raw_data_path, h5py_files[0])
print(example_file)

#### Opening HDF files through ``tables``

I tried this and it was super annoying to deal with .icmh5 files (would cause kernel crashes). However I realized I do not need it for ``dask``.

In [None]:
# will open with h5py instead
global_f = h5py.File(global_data_path, mode="r")

#### Loading Large Dataframe through ``dask``

In [None]:
from dask.distributed import LocalCluster
cluster = LocalCluster()          # Fully-featured local Dask cluster
client = cluster.get_client()

In [None]:
client

In [None]:
global_f['1002']

In [None]:
# see if I can make a dataframe of all abp values
abp = da.concatenate([da.from_array(global_f[f"{pt}/raw/waves/abp"], chunks=(1e12, )) for pt in global_f.keys()])

In [None]:
abp

In [None]:
mean = da.mean(abp)
mean.compute()

In [None]:
all_vars = ['hr', 'rso2l', 'rso2r', 'abp', 'spo2', 'icp', 'deoxhg_r',
       'sthg_index_r', 'sthg_index_l', 'oxhg_l', 'deoxhg_l', 'oxhg_r',
       'scthg_l', 'scthg_r']

In [None]:
from tqdm import tqdm
batch_n = 10
for v in tqdm(all_vars):
    if v in ['hr', 'rso2l', 'rso2r', 'spo2']:
        key_string = f"numerics/{v}"
    else:
        key_string = f"waves/{v}"
    arr_list = [da.from_array(global_f[f"{pt}/raw/{key_string}"]) for pt in global_f.keys() if key_string in global_f[f"{pt}/raw"] and not global_f[f"{pt}/processed"].attrs["broken_numeric"]]
    v_arr = da.concatenate(arr_list)
    mini = da.concatenate(arr_list[:batch_n]).min().compute()
    maxi = da.concatenate(arr_list)[:batch_n].max().compute()
    hist, bins = da.histogram(v_arr, bins=50, range=[mini, maxi])
    # print(hist)
    hist.compute()

This takes more time than the batched approach in ``data_engineering.ipynb``. There is no need to change the approach there since this is not a speedup. This may come in handy if I ever need to do other complex things on the whole arrays, but the parallelized process appears good enough and there is no time to optimize the ``dask`` approach.

In [None]:
global_f.close()