# Validating fsspec reference marker's zarr versus orginal hdf (netcdf4) for DAS dataset
## Creation notes
Original HDF dataset from: https://ds.iris.edu/pub/dropoff/buffer/PH5/das_example.h5
Reference json file created via script: das_h5_to_ref.py (in this repo)

## Methodology
Load both datasets via xarray.  One from json refernce file using zarr and
one from orginal netcdf4 format.  Compared in the following ways:
1. High level using xArray equity operator '=='
2. Comparing individual DataArray "Acoustic"
3. Comparing Dataset attrs (for this example file there are none)
4. Comparaing individual DataArray "Acoustic" attrs

## Future work
1. Add even more ways to compare files like plotting an event from both
   datasets to demostrate how to extract data, units, and plot them.
2. In other notebooks replicate this for other formats like PH5, MTH, ...

In [87]:
import os
import numpy as np
import xarray as xr
import fsspec

In [88]:
FOLDER = '/mnt/hgfs/ssd_tmp/'
h5_filename = os.path.join(FOLDER, 'das_example.h5')

reference_file = 'results_20210809/das_h5/das_example_ref_fs.json'

In [89]:
uri = f'file://{reference_file}'

fs = fsspec.filesystem('reference', fo=uri, remote_protocol="file")
m = fs.get_mapper("")
# ds_zarr = xr.open_dataset(m, engine="zarr") # This caused data array to come in as float32
# Per https://stackoverflow.com/questions/68460507/xarray-loading-int-data-as-float
ds_zarr = xr.open_dataset(m, engine="zarr", mask_and_scale=False)
ds_zarr

1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
  ds_zarr = xr.open_dataset(m, engine="zarr", mask_and_scale=False)


In [90]:
ds_hdf = xr.open_dataset(h5_filename, engine='netcdf4')
# The "mask_and_scale" option has no effect on hdf files
# ds_hdf = xr.open_dataset(h5_filename, engine='netcdf4', mask_and_scale=False)
ds_hdf

In [91]:
is_equal = ds_zarr == ds_hdf
is_equal_np = is_equal.to_array().to_numpy()
f'dataset_zarr == dataset_hdf {np.all(is_equal_np)}'

'dataset_zarr == dataset_hdf True'

In [92]:
is_da_equal = ds_zarr.Acoustic == ds_hdf.Acoustic
is_da_equal_np = is_da_equal.to_numpy()
is_da_equal_np = np.squeeze(is_da_equal_np)
f'All DataArray elements are the same: {np.all(is_equal_np)}'

'All DataArray elements are the same: True'

In [93]:
n_true = np.count_nonzero(is_da_equal_np)
total = is_da_equal_np.size
f'{n_true} are the same out of {total} ({100*n_true/total:.5f}%)'

'23760000 are the same out of 23760000 (100.00000%)'

In [94]:
attrs_zarr = ds_zarr.attrs
attrs_hdf = ds_hdf.attrs
f'No top level atttributes. len zarr attrs: {len(attrs_zarr)}, \
len of hdf attrs: {len(attrs_hdf)}'

'No top level atttributes. len zarr attrs: 0, len of hdf attrs: 0'

In [95]:
darray_attrs_zarr = ds_zarr.Acoustic.attrs
darray_attrs_hdf = ds_hdf.Acoustic.attrs
f'Acoustic atttributes. len zarr attrs: {len(darray_attrs_zarr)}, \
len of hdf attrs: {len(darray_attrs_hdf)}'

'Acoustic atttributes. len zarr attrs: 84, len of hdf attrs: 83'

In [96]:
diff_keys = set(darray_attrs_zarr.keys()) - set(darray_attrs_hdf.keys())
f'The extra key in zarr set is: {diff_keys}'

"The extra key in zarr set is: {'_FillValue'}"

In [97]:
darray_attrs_zarr_fixed = darray_attrs_zarr.copy()
del darray_attrs_zarr_fixed['_FillValue']

f'After removing extra key from zarr attrs they are both equal: \
{darray_attrs_zarr_fixed == darray_attrs_hdf}'

'After removing extra key from zarr attrs they are both equal: True'