## Core concepts
http://docs.h5py.org/en/latest/quick.html

An HDF5 file is a container for two kinds of objects: datasets and groups
datasets, which are array-like collections of data 
groups, which are folder-like containers that hold datasets and other groups. 

The most fundamental thing to remember when using h5py is:
Groups work like dictionaries, and datasets work like NumPy arrays


In [32]:
import h5py
import numpy as np

In [18]:
f = h5py.File('/home/admin-/Download/train.chunk.08')

print(type(f))
print(list(f.keys()))

<class 'h5py._hl.files.File'>
['train']


In [21]:
train_group = f['train']
print(type(train_group))
print(list(train_group.keys()))      # dict.keys()

<class 'h5py._hl.group.Group'>
['bcateid', 'brand', 'dcateid', 'img_feat', 'maker', 'mcateid', 'model', 'pid', 'price', 'product', 'scateid', 'updttm']


In [29]:
dset_pid = train_group['pid']     # dict['key']()
print(type(dset_pid))

<class 'h5py._hl.dataset.Dataset'>


In [30]:
print(dset_pid.shape)       # np.array().shape
print(dset_pid.dtype)       # np.array().dtype

(1000000,)
|S12


In [None]:
dset_pid['seq'] = np.arange(dset_pid.shape[0])

Appendix: Creating a file

mode

r	Readonly, file must exist

r+	Read/write, file must exist

w	Create file, truncate if exists

w- or x	Create file, fail if exists

a	Read/write if exists, create otherwise (default)

In [13]:
f = h5py.File("mytestfile.hdf5", mode="w")

'/train'

In [None]:
dset = f.create_dataset("mydataset", (100,), dtype='i')

## Groups and hierarchical organization

“HDF” stands for “Hierarchical Data Format”.
 Every object in an HDF5 file has a name, and they’re arranged in a POSIX-style hierarchy with /-separators:



In [38]:
train_group.name

'/train'

The “folders” in this system are called groups. 

In [41]:
f.name

'/'

Creating a subgroup is accomplished via the aptly-named create_group. But we need to open the file in the “append” mode first (Read/write if exists, create otherwise)

In [40]:
f = h5py.File('mydataset.hdf5', 'a')
grp = f.create_group("subgroup")

In [42]:
dset2 = grp.create_dataset("another_dataset", (50,), dtype='f')
dset2.name

'/subgroup/another_dataset'

In [44]:
dset3 = f.create_dataset('subgroup2/dataset_three', (10,), dtype='i')

RuntimeError: Unable to create link (name already exists)

In [45]:
dset3.name

'/subgroup2/dataset_three'

In [48]:
list(f.keys())

['subgroup', 'subgroup2']