# How is size of HDF5 files affected by the way data is assigned?

Main outcome: If data is added during `create_dataset`, the 
resulting files are twice as large than when first assigning 
the shape and then adding the data.

In [None]:
import os
import h5py
import numpy as np

In [None]:
def print_filesize(i):
    s = os.path.getsize(fnm.format(i))
    print(fnm.format(i), s, "Bytes")

## Define shape but assign no data

This results in a very small file. So presumably, the file
"knows" it has no data.

In [None]:
shape = (10000, 10)
data = np.random.rand(*shape)

In [None]:
fnm = "hdf5size_{:d}.hdf5"

In [None]:
i = 1

with h5py.File(fnm.format(1), 'w') as f:

    f.create_dataset('data', shape=shape, fillvalue=np.nan)

print_filesize(i)

## Define shape and assign data

This results in much larger file.

In [None]:
i = 2

with h5py.File(fnm.format(i), 'w') as f:

    f.create_dataset('data', shape=shape, fillvalue=np.nan)
    f['data'][:] = data

print_filesize(i)

## Define shape and partially assign data

Yields same size as if filled completely

In [None]:
i = 3

with h5py.File(fnm.format(i), 'w') as f:

    f.create_dataset('data', shape=shape, fillvalue=np.nan)
    f['data'][:100] = data[:100,:]
    print_filesize(i)

## Assign data during creation

This yields **twice the file size** as if we first only define 
the shape and then assign the data.

In [None]:
i = 4

with h5py.File(fnm.format(i), 'w') as f:
    f.create_dataset('data', data=data, fillvalue=np.nan)
    print_filesize(i)

### Are the data identical?

In [None]:
res = {}

for i in [2, 4]: 

    with h5py.File(fnm.format(i), 'r') as f:
        res[str(i)] = f["data"][:]

Are they equal? --> NO

In [None]:
np.all(np.equal(res["2"], res["4"]))

Are they almost equal? --> YES

In [None]:
np.all(np.isclose(res["2"], res["4"]))

So what's the difference? --> Standard data type for h5py is 
32-byte floats, not 64. Apparantly, is added in two steps, data are
converted whereas assigning directly uses the original dtype.

In [None]:
for k, v in res.items():
    print(k, v.dtype)

In [None]:
print(data.dtype)

## Clean up the files.

In [None]:
%rm hdf5size*.hdf5