# Testing HDF5 diff tools

### Types of change

- These differences are associated with metadata/properties of HDF5 Objects
    - Metadata/properties can be generic (apply to all objects) or specific (only apply to a particular object type)
- The nomenclature mostly follows that of `h5py`

#### Between generic h5 Objects

- `type_h5`: the two Objects have different HDF5 types
    - e.g. obj_a.type_h5 = Dataset, obj_b.type_h5 = Group
    - Called "htype" in `ndiff`
- `attributes`: the attributes of each of the two Objects are different
    - Orthogonal to `type_h5`, i.e. two Objects can have all 4 combinations of same/different `type_h5` and `attributes`

#### Between Datasets

- `ndim`: the two Datasets have different number of dimensions (axes)
    - Called "rank" in `h5diff`
- `shape`: the two Datasets have different dimensions
    - `shape` is a tuple of integers where `len(shape)` = `ndim`
    - Therefore, two arrays can have the same `ndim`, but different `shape`
- `dtype`: the two Datasets have different data types
    - `dtype` ~= numpy dtype (but not always)
    - Differences in `dtype` are orthogonal to differences in structure (`ndim`, `shape`)
- `value`: the content of the two Datasets is different
    - For `value` to be fully comparable (i.e. elementwise delta), `ndim`, `shape`, and `dtype` must be equal
    
    
#### Between Groups

Note: in `h5py`, `File` objects are also `Group`s.

- `num_objs`: the two Groups contain different number of Objects
    - Compare only direct children (i.e. no recursion) for simplicity
    - `h5py.Group.id.num_objs` only counts total number
    - Possible to extend this to have separate counts by `type_h5`, i.e. `num_objs: {Dataset: 3, Group: 2, total: 5}`
    
#### Between Files

- `filename`: the two Files have different filesystem paths

#### Between Attributes

- Attributes in HDF5 are mapping between text keys and values
- Values can be of any supported dtype
- Effectively, Attributes can be considered a flat Group (i.e. no sub-groups) with one or more Datasets
- Comparison between Attributes has similar semantics to comparison between groups
    - Keys can be: only in `A`, only in `B`, in both and values are equal, in both and values are different
    - Types of change between Attribute values are the same as for Datasets
    

## Files used for testing

- `A.h5`, created from scratch in `create-A.ipynb`
- `B.h5`, modified from a copy of `A.h5` in `create-B.ipynb`

In [16]:
h5ls --verbose --full --recursive {A,B}.h5

Opened "A.h5" with sec2 driver.
A.h5//                   Group
    Location:  1:96
    Links:     1
A.h5//analysis           Group
    Location:  1:2536
    Links:     1
A.h5//data_clean         Group
    Location:  1:1832
    Links:     1
A.h5//data_raw           Group
    Location:  1:800
    Links:     1
A.h5//data_raw/exp0      Dataset {6/6}
    Attribute: temperature scalar
        Type:      native double
        Data:  23.2
    Attribute: uid scalar
        Type:      14-byte null-padded ASCII string
        Data:  "2019-07-01_001"
    Location:  1:3240
    Links:     1
    Storage:   48 logical bytes, 48 allocated bytes, 100.00% utilization
    Type:      native long
A.h5//data_raw/exp1      Dataset {8/8}
    Attribute: temperature scalar
        Type:      native double
        Data:  35.9
    Attribute: uid scalar
        Type:      14-byte null-padded ASCII string
        Data:  "2019-07-03_001"
    Location:  1:5984
    Links:     1
    Storage:   8 logical bytes, 8 allocat

---
A basic way to detect changes is to use `h5ls` to extract metadata about the two files as text, and use `diff` to compare the two:

In [18]:
diff <(h5ls --verbose --full --recursive A.h5) <(h5ls --verbose --full --recursive B.h5)

1c1
< Opened "A.h5" with sec2 driver.
---
> Opened "B.h5" with sec2 driver.
10a11,26
> /data_clean/exp0         Dataset {2/2, 3/3}
>     Attribute: threshold scalar
>         Type:      native long
>         Data:  4
>     Location:  1:13368
>     Links:     1
>     Storage:   48 logical bytes, 48 allocated bytes, 100.00% utilization
>     Type:      native double
> /data_clean/exp2         Dataset {7/7}
>     Attribute: threshold scalar
>         Type:      native long
>         Data:  20
>     Location:  1:12768
>     Links:     1
>     Storage:   56 logical bytes, 56 allocated bytes, 100.00% utilization
>     Type:      native double
14c30
< /data_raw/exp0           Dataset {6/6}
---
> /data_raw/exp0           Dataset {2/2, 3/3}
21c37
<     Location:  1:3240
---
>     Location:  1:7712
25c41,44
< /data_raw/exp1           Dataset {8/8}
---
> /data_raw/exp1           Group
>     Location:  1:8776
>     Links:     1
> /data_raw/exp1/run0      Dataset {8/8}
32c51,65
<     Location:  1:5

: 1

## `h5diff`

In [1]:
ls -lh *.h5

-rw-r--r-- 1 ludo ludo 7.6K Jul 16 14:55 A.h5
-rw-r--r-- 1 ludo ludo  14K Jul 16 17:39 B.h5


In [2]:
h5diff -v A.h5 B.h5


file1     file2
---------------------------------------
    x      x    /              
    x      x    /analysis      
    x      x    /data_clean    
           x    /data_clean/exp0
           x    /data_clean/exp2
    x      x    /data_raw      
    x      x    /data_raw/exp0 
    x      x    /data_raw/exp1 
           x    /data_raw/exp1/run0
           x    /data_raw/exp1/run1
    x      x    /data_raw/exp2 
    x      x    /data_raw/exp3 
    x           /data_raw/exp4 
    x      x    /data_raw/exp5 
    x      x    /data_raw/exp6 
           x    /data_raw/experiment_4

group  : </> and </>
0 differences found
group  : </analysis> and </analysis>
0 differences found
group  : </data_clean> and </data_clean>
0 differences found
group  : </data_raw> and </data_raw>
0 differences found
dataset: </data_raw/exp0> and </data_raw/exp0>
Not comparable: </data_raw/exp0> has rank 1, dimensions [6], max dimensions [6]
and </data_raw/exp0> has rank 2, dimensions [2x3], max dimensions [2x3

: 1

In [3]:
h5diff -c A.h5 B.h5

Not comparable: </data_raw/exp0> has rank 1, dimensions [6], max dimensions [6]
and </data_raw/exp0> has rank 2, dimensions [2x3], max dimensions [2x3]
Not comparable: </data_raw/exp1> is of type H5G_DATASET and </data_raw/exp1> is of type H5G_GROUP
Not comparable: </data_raw/exp2> is of class H5T_INTEGER and </data_raw/exp2> is of class H5T_FLOAT
attribute: <temperature of </data_raw/exp3>> and <temperature of </data_raw/exp3>>
1 differences found
Not comparable: </data_raw/exp5> has rank 1, dimensions [5], max dimensions [5]
and </data_raw/exp5> has rank 1, dimensions [10], max dimensions [10]
dataset: </data_raw/exp6> and </data_raw/exp6>
6 differences found


: 1

### Comparing files

- `h5diff` correctly detects "heterogeneous" differences, i.e. `type_h5`, `ndim`, `shape`, `dtype`, but treats them as errors (exit code = 1)
- Only "proper" type of difference is elementwise comparison
    - Applied to values of comparable datasets (same `ndim`, `shape`, and `dtype`), and values of attributes
    - This is because attributes values are implemented as smaller/simplified generic data types
    - In many circumstances (especially when values are strings) changes in attribute values should be considered atomically, i.e. without doing elementwise comparison
    - There are no additional delta metrics calculated from elementwise comparison (i.e. mean/std of delta)
    - Another possible comparison would be between datasets with same `ndim` and `dtype`, but different `shape`, i.e. comparing datasets with similar structure to which elements were removed/added

### Comparing specific objects in different files

In [4]:
h5diff -v A.h5 B.h5 data_raw/exp4 data_raw/experiment_4

dataset: </data_raw/exp4> and </data_raw/experiment_4>
size:           [7]           [7]
position        exp4            experiment_4    difference          
------------------------------------------------------------
[ 0 ]          53              0               53             
1 differences found
attribute: <temperature of </data_raw/exp4>> and <temperature of </data_raw/experiment_4>>
0 differences found
attribute: <uid of </data_raw/exp4>> and <uid of </data_raw/experiment_4>>
0 differences found


: 1


- Works by supplying the name/path directly to identify comparable objects
- Does not support automatically defining correspondences between multiple pairs of objects that have different paths in `A` and `B`

### Comparing objects with different paths in the same file

In [9]:
h5diff -vvv B.h5 B.h5 data_raw/exp2 data_clean/exp2

dataset: </data_raw/exp2> and </data_clean/exp2>
size:           [7]           [7]
position        exp2            exp2            difference          
------------------------------------------------------------
[ 1 ]          1               nan             -nan           
[ 3 ]          3               nan             -nan           
[ 4 ]          8               nan             -nan           
[ 5 ]          5               nan             -nan           
4 differences found


: 1

- `nan`s cause an error, even though NaNs are supported by numpy and (one would expect) by HDF5 as well

## `ndiff`

- This is the script at the basis of the current Dac-Man HDF5 plugin at `dacman.plugins.hdf5diff`

In [10]:
conda activate dev-dacman
which python

/opt/conda/envs/dev-dacman/bin/python


In [21]:
python -m ndiff A.h5 B.h5

args=['A.h5', 'B.h5']
Comparing 'A.h5' and 'B.h5'
------------------------------
Examining /
	analysis
	data_clean
	data_raw
------------------------------
Examining /analysis/
------------------------------
Examining /data_clean/
** Element 'exp0' only in 'B.h5' (DIFF_UNIQUE_B)**
** Element 'exp2' only in 'B.h5' (DIFF_UNIQUE_B)**
------------------------------
Examining /data_raw/
** Element 'exp4' only in 'A.h5' (DIFF_UNIQUE_A)**
** Element 'experiment_4' only in 'B.h5' (DIFF_UNIQUE_B)**
	exp0
	exp1
**  Different element types: 'dataset' and 'group' (DIFF_OBJECTS)
	exp2
	exp3
** Attribute 'temperature_comment' only in 'B.h5' (DIFF_UNIQ_ATTR_B)**
	exp5
	exp6


In [23]:
python -m ndiff A.h5 B.h5 data_raw/

args=['A.h5', 'B.h5', 'data_raw/']
Comparing 'A.h5' and 'B.h5'
------------------------------
Examining /
** Element 'exp4' only in 'A.h5' (DIFF_UNIQUE_A)**
** Element 'experiment_4' only in 'B.h5' (DIFF_UNIQUE_B)**
	exp0
	exp1
**  Different element types: 'dataset' and 'group' (DIFF_OBJECTS)
	exp2
	exp3
** Attribute 'temperature_comment' only in 'B.h5' (DIFF_UNIQ_ATTR_B)**
	exp5
	exp6


- The change in value for attribute `temperature` in `/data_raw/exp3` is not detected
- Attributes of Objects with different `type_h5` are not compared, even though in principle they could
- `num_objs` in Groups (including Files) not compared

### Specifying objects to compare

- Quick modification to `ndiff.py` to support 2, 3 or 4 CLI args
- Same semantics as `h5diff`

As it is, directly comparing Datasets not supported; only Groups

In [33]:
python -m ndiff A.h5 B.h5 data_raw/exp4 data_raw/experiment_4

args=['A.h5', 'B.h5', 'data_raw/exp4', 'data_raw/experiment_4']
Comparing 'A.h5' and 'B.h5'
------------------------------
Examining /
Traceback (most recent call last):
  File "/opt/conda/envs/dev-dacman/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/envs/dev-dacman/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ludo/lbl/deduce/try-hdf5/ndiff.py", line 144, in <module>
    main()
  File "/home/ludo/lbl/deduce/try-hdf5/ndiff.py", line 140, in main
    compare(file1, file2, path_file1=path_file1, path_file2=path_file2)
  File "/home/ludo/lbl/deduce/try-hdf5/ndiff.py", line 124, in compare
    diff_groups(file1, f1[path_file1], file2, f2[path_file2], "/")
  File "/home/ludo/lbl/deduce/try-hdf5/ndiff.py", line 49, in diff_groups
    desc1 = evaluate_group(path, grp1)
  File "/home/ludo/lbl/deduce/try-hdf5/ndiff.py", line 36, in evaluate_group
    for k,v in grp.items():
AttributeError: 'Datas

: 1

In [34]:
python -m ndiff A.h5 B.h5 data_raw/

args=['A.h5', 'B.h5', 'data_raw/']
Comparing 'A.h5' and 'B.h5'
------------------------------
Examining /
** Element 'exp4' only in 'A.h5' (DIFF_UNIQUE_A)**
** Element 'experiment_4' only in 'B.h5' (DIFF_UNIQUE_B)**
	exp0
	exp1
**  Different element types: 'dataset' and 'group' (DIFF_OBJECTS)
	exp2
	exp3
** Attribute 'temperature_comment' only in 'B.h5' (DIFF_UNIQ_ATTR_B)**
	exp5
	exp6


In [32]:
python -m ndiff B.h5 B.h5 data_raw/ data_clean/

args=['B.h5', 'B.h5', 'data_raw/', 'data_clean/']
Comparing 'B.h5' and 'B.h5'
------------------------------
Examining /
** Element 'exp1' only in 'B.h5' (DIFF_UNIQUE_A)**
** Element 'exp3' only in 'B.h5' (DIFF_UNIQUE_A)**
** Element 'exp5' only in 'B.h5' (DIFF_UNIQUE_A)**
** Element 'exp6' only in 'B.h5' (DIFF_UNIQUE_A)**
** Element 'experiment_4' only in 'B.h5' (DIFF_UNIQUE_A)**
	exp0
** Attribute 'temperature' only in 'B.h5' (DIFF_UNIQ_ATTR_A)**
** Attribute 'uid' only in 'B.h5' (DIFF_UNIQ_ATTR_A)**
** Attribute 'threshold' only in 'B.h5' (DIFF_UNIQ_ATTR_B)**
	exp2
** Attribute 'temperature' only in 'B.h5' (DIFF_UNIQ_ATTR_A)**
** Attribute 'uid' only in 'B.h5' (DIFF_UNIQ_ATTR_A)**
** Attribute 'threshold' only in 'B.h5' (DIFF_UNIQ_ATTR_B)**
