# Testing different storage methods 
---
See if there's a better way to read/write the data in different formats.
Want to 
1) speed up read/write time 
2) find a way to store metadata in the file without having to save as columns
3) save disk space (compression?)


Try/investigate:
- pytables
- sqlite
- parquet
- apache arrow

In [1]:
import os 
import numpy as np
import pandas as pd 



In [2]:
# Test: saving data plus metadata to the same .hdf file 
# Some dummy data
N = 1_000_000
data = np.random.normal(size=(N, 100))
df_data = pd.DataFrame(data=data, index=list(range(data.shape[0])))

# Metadata 
metadata = {
    "Setting 1": "foo",
    "Setting 2": "bar",
    "Setting 3": 100,
    "Setting 4": "foo",
    "Setting 5": "bar",
    "Setting 6": 100,
    "Setting 7": "positive",
    "Setting 8": "m",
    "Setting 9": -99999,
    "Setting 10": "foo",
    "Setting 11": "bar",
    "Setting 12": 100,
    "Setting 13": "foo",
    "Setting 14": "bar",
    "Setting 15": 3.4235423,
}
df_metadata = pd.DataFrame(metadata.items(), columns=["Setting", "Value"])

df_all = df_data.copy()
for key in metadata.keys():
    df_all[key] = metadata[key]

## Writing to file

In [3]:
%%timeit
fname_old = "hdf_test_file_old.hd5"
os.system(f"rm {fname_old}")

# Method 1: merge DataFrames, save whole thing to single file 
df_all.to_hdf(fname_old, mode="w", key="data")

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->axis0] [items->None]

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block0_items] [items->None]



2.13 s ± 203 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [4]:
%%timeit
fname_new = "hdf_test_file_new.hd5"
os.system(f"rm {fname_new}")

# Save to file 
df_data.to_hdf(fname_new, mode="w", key="data")
df_metadata.to_hdf(fname_new, mode="r+", key="metadata")

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block0_values] [items->Index(['Setting', 'Value'], dtype='object')]



192 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [5]:
%%timeit
fname_new_attrs = "hdf_test_file_new_attrs.hd5"
os.system(f"rm {fname_new_attrs}")

# Save to file 
store = pd.HDFStore(fname_new_attrs)
store.put("data", df_data)
store.get_storer("data").attrs.metadata = metadata
store.close()

210 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Reading from file 

In [6]:
%%timeit
fname_old = "hdf_test_file_old.hd5"
# Load the object-type DataFrame containing the columns 
df_all_out = pd.read_hdf(fname_old, key="data")

1.99 s ± 322 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
%%timeit
fname_new = "hdf_test_file_new.hd5"
# Load the numerical-type DataFrame containing the columns and the metadata DataFrame separately 
df_out = pd.read_hdf(fname_new, key="data")
df_metadata_out = pd.read_hdf(fname_new, key="metadata")
df_out.metadata = df_metadata_out.to_dict()



1.06 s ± 7.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)




In [8]:
%%timeit
fname_new_attrs = "hdf_test_file_new_attrs.hd5"
with pd.HDFStore(fname_new_attrs) as store:
    df_data_out_attrs = store["data"]
    metadata_out_attrs = store.get_storer("data").attrs.metadata

1.07 s ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Findings 
---
**Writing to disk**:
- new method is much faster - saving an object-type DataFrame is very slow.

**Reading from disk**:
- new method is similarly much faster (by a factor of ~2) even when adding the metadata dictionary back to the DataFrame as an attribute, i.e.  
```
df_all.metadata = metadata
```
**Disk space**:
- new method: 772 MB
- old method: 845 MB

**Takeaway**: 
- Try using the .attrs property of a HDFStore. With this, we can store metadata directly paired with the data on-disk - much better than assuming metadata, e.g. settings, etc. at runtime and then manually adding them back in. We can still add the metadata as columns at runtime as we've been doing up until now. 

In [9]:
%%timeit
# How fast is it to just read the metadata?
fname_new_attrs = "hdf_test_file_new_attrs.hd5"
with pd.HDFStore(fname_new_attrs) as store:
    df_data_out_attrs = store["data"]
    metadata_out_attrs = store.get_storer("data").attrs.metadata

1.08 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [10]:
%%timeit
# Just reading the metadata
fname_new_attrs = "hdf_test_file_new_attrs.hd5"
with pd.HDFStore(fname_new_attrs) as store:
    metadata_out_attrs = store.get_storer("data").attrs.metadata

5.29 ms ± 446 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## What columns in the main SAMI DataFrame could be transferred to a metadata dictionary?
---
i.e., what columns contain only 1 unique value?

In [11]:
from spaxelsleuth import load_user_config
try:
    load_user_config("/Users/u5708159/Desktop/spaxelsleuth_test/.myconfig.json")
except FileNotFoundError:
    load_user_config("/home/u5708159/.spaxelsleuthconfig.json")
from spaxelsleuth.loaddata.sami import make_sami_df, load_sami_df

Loading default config file
Loading user config file


In [12]:
DEBUG = False
nthreads = 4

In [13]:
make_sami_df(bin_type="default", 
            ncomponents="recom", 
            eline_SNR_min=5, 
            eline_ANR_min=3, 
            correct_extinction=True,
            metallicity_diagnostics=["R23_KK04"],
            nthreads=nthreads,
            debug=DEBUG)

sami.py (1254) make_sami_df(): INFO: input parameters: bin_type=default, ncomponents=recom, debug=False, eline_SNR_min=5, eline_ANR_min=3, correct_extinction=True
sami.py (1277) make_sami_df(): INFO: saving to file sami_default_recom-comp_extcorr_minSNR=5_minANR=3.hd5...
sami.py (1324) make_sami_df(): INFO: beginning pool...
sami.py (975) _process_gals(): INFO: finished processing 572402 (3)
sami.py (975) _process_gals(): INFO: finished processing 209807 (2)
sami.py (975) _process_gals(): INFO: finished processing 106717 (1)
sami.py (975) _process_gals(): INFO: finished processing 7139 (0)
addcolumns.py (12) add_columns(): INFO: adding columns to the DataFrame...
addcolumns.py (38) add_columns(): INFO: setting & aplying data quality and S/N cuts...
addcolumns.py (77) add_columns(): INFO: correcting emission line fluxes (but not EWs) for extinction...
addcolumns.py (109) add_columns(): INFO: computing emission line ratios and BPT categories...
addcolumns.py (115) add_columns(): INFO: co

In [14]:
# Load the DataFrame
df = load_sami_df(ncomponents="recom",
                bin_type="default",
                eline_SNR_min=5,
                eline_ANR_min=3,
                correct_extinction=True,
                debug=False)


sami.py (1507) load_sami_df(): INFO: Loading DataFrame from file /Users/u5708159/Desktop/spaxelsleuth_test/output/sami_default_recom-comp_extcorr_minSNR=5_minANR=3.hd5 [last modified 2023-09-16 10:53:52.805085]...
sami.py (1535) load_sami_df(): INFO: finished!


In [15]:
for c in df.columns:
    if len(df[c].unique()) == 1:
        print(c)

Galaxy centre x0_px (projected, arcsec)
Galaxy centre y0_px (projected, arcsec)
Bin size (pixels)
Bin size (square arcsec)
Median SNR (B, full field)
Median SNR (R, full field)
Median SNR (B, 1R_e)
Median SNR (R, 1R_e)
Median SNR (B, 1.5R_e)
Median SNR (R, 1.5R_e)
Median SNR (B, 2R_e)
Median SNR (R, 2R_e)
Bad class #
Cluster member
r/R_200
v/sigma_cluster
Good?
Missing flux flag - HALPHA (component 1)
Missing flux flag - HALPHA (component 2)
Missing flux flag - HALPHA (component 3)
Missing flux flag - HALPHA (total)
Low amplitude flag - HBETA (total)
Low amplitude flag - NII6583 (total)
Low amplitude flag - OI6300 (total)
Low amplitude flag - OII3726+OII3729 (total)
Low amplitude flag - OIII5007 (total)
Low amplitude flag - SII6716 (total)
Low amplitude flag - SII6731 (total)
Missing flux flag - SII6731 (total)
Extinction correction applied
log SFR (component 2)
log SFR error (lower) (component 2)
log SFR error (upper) (component 2)
log SFR (component 3)
log SFR error (lower) (componen

In [16]:
metadata_cols = [
    "Galaxy centre x0_px (projected, arcsec)",
    "Galaxy centre y0_px (projected, arcsec)",
    "Bad class #",
    "Good?",
    "correct_extinction",
    "eline_SNR_min",
    "sigma_gas_SNR_min",
    "eline_ANR_min",
    "line_flux_SNR_cut",
    "missing_fluxes_cut",
    "line_amplitude_SNR_cut",
    "flux_fraction_cut",
    "vgrad_cut",
    "sigma_gas_SNR_cut",
    "stekin_cut",
    "survey",
    "as_per_px",
    "N_x",
    "N_y",
    "x0_px",
    "y0_px",
    "ncomponents",
    "bin_type",
    "__use_lzifu_fits",
    "__lzifu_ncomponents",
    "debug",
    "flux units",
    "continuum units",
]
metadata_dict = dict(zip(metadata_cols, [df[c].unique()[0] for c in metadata_cols]))
df_trimmed = df.copy()
df_trimmed = df_trimmed.drop(columns=metadata_cols)
print(f"Before: df.shape = {df.shape}")
print(f"After: df_trimmed.shape = {df_trimmed.shape}")

Before: df.shape = (3409, 413)
After: df_trimmed.shape = (3409, 384)


In [17]:
import sys 
print(f"Memory saved by dropping metadata columns: {(sys.getsizeof(df) - sys.getsizeof(df_trimmed)) / 1064 / 1064:.3f} MB")

Memory saved by dropping metadata columns: 1.650 MB


In [18]:
from spaxelsleuth.config import settings

In [19]:
settings["sami"]

{'as_per_px': 0.5,
 'N_x': 50,
 'N_y': 50,
 'x0_px': 24.5,
 'y0_px': 24.5,
 'sigma_inst_kms': 29.6,
 'eline_list': ['HALPHA',
  'HBETA',
  'NII6583',
  'OI6300',
  'OII3726+OII3729',
  'OIII5007',
  'SII6716',
  'SII6731'],
 'bin_types': ['default', 'adaptive', 'sectors'],
 'data_cube_path': '/Users/u5708159/Desktop/spaxelsleuth_test/sami/dr3',
 'input_path': '/Users/u5708159/Desktop/spaxelsleuth_test/sami/dr3',
 'output_path': '/Users/u5708159/Desktop/spaxelsleuth_test/output',
 'lzifu_products_path': 'sami/lzifu/products/'}