# Tutorial

In [1]:
import os
import sys
import logging
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
import hurraypy as hurray
import numpy as np

In [3]:
hurray.__version__

'0.0.3'

First, make sure all logging messages are sent to stdout:

In [4]:
logger = logging.getLogger('hurraypy')

# console = logging.StreamHandler()
# console.setLevel(logging.DEBUG)
# console.setFormatter(logging.Formatter('%(levelname)s --- %(message)s'))
# logger.addHandler(console)
# logger.setLevel(logging.DEBUG)

In [5]:
logger.handlers

[<logging.NullHandler at 0x7fc7045df780>]

In [6]:
hurray.log.log.debug("bla")
hurray.log.log.info("bla")

## Connecting to a hurray server

Make sure you have a Hurray server running at `localhost:2222`:

```
$ hurray --host=localhost --port=2222 --logging=debug --debug=1
[I 170619 11:16:50 __main__:180] Listening on localhost:2222
[I 170619 11:16:50 process:132] Starting 8 processes
```

In [7]:
conn = hurray.connect('localhost', '2222')
print(conn)

<Connection (host=localhost, port=2222)>


## Working with files

Let's create a file `test.h5` (`overwrite=True` replaces the file if it already exists):

In [8]:
f = conn.create_file("test.h5", overwrite=True)

Note that Hurray objects (files, datasets, groups) display nicely in Jupyter notebooks.

In [9]:
f

Working with existing files works like this:

In [10]:
f = conn.File("test.h5")
print(f)

with conn.File("test.h5") as f:
    print(f)

<File (db=test.h5, path=/)>
<File (db=test.h5, path=/)>


Deleting and renaming files is also possible:

In [11]:
f.delete()

Note that the object referenced by `f` becomes unusable after deleting the file.

Let's create another file and renamed it to `test.h5`:

In [12]:
f2 = conn.create_file("test2.h5")

In [13]:
f2

In [14]:
f = f2.rename("test.h5")

In [15]:
f

Note that ``rename()`` is not "in place". We must (re-)assign its return value.

## Working with datasets

A file can contain two kinds of objects: *groups* and *datasets*. Essentially, groups work like Python dictionaries and datasets work like NumPy arrays.

Every group and dataset has a **name**. First, let's try to create a dataset:

In [16]:
f.create_dataset("mydata")

ValueError: Either 'data' or 'shape' must be specified

That didn't work. We must specify the dataset either by passing a NumPy array or by passing a shape and a datatype:

In [17]:
dst = f.create_dataset("mydata", shape=(400, 300), dtype=np.float64)

In [18]:
dst

A dataset has a ``shape`` and a ``dtype``, just like NumPy arrays:

In [19]:
dst.shape, dst.dtype

((400, 300), 'float64')

It also has a ``path``, which is the *name* of the dataset, prefixed by the names of containing groups. Our dataset is not contained in a group. It therefore appears under the root node `/` (actually, it **is** in a group: the file itself is the root group).

In [20]:
dst.path

'/mydata'

Let's check what data our dataset contains. Numpy-style indexing allows to read/write from/to a dataset. A `[:]`-index reads the whole dataset into memory. Apparently, our dataset has been initialized with zeros:

In [21]:
dst[:]

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

Let's overwrite this dataset with increasing floating point numbers:

In [22]:
arr = np.linspace(0, 1, num=dst.shape[0] * dst.shape[1]).reshape(dst.shape)
arr.shape == dst.shape

True

In [23]:
dst[:] = arr

In [24]:
dst[:]

array([[  0.00000000e+00,   8.33340278e-06,   1.66668056e-05, ...,
          2.47502063e-03,   2.48335403e-03,   2.49168743e-03],
       [  2.50002083e-03,   2.50835424e-03,   2.51668764e-03, ...,
          4.97504146e-03,   4.98337486e-03,   4.99170826e-03],
       [  5.00004167e-03,   5.00837507e-03,   5.01670847e-03, ...,
          7.47506229e-03,   7.48339569e-03,   7.49172910e-03],
       ..., 
       [  9.92508271e-01,   9.92516604e-01,   9.92524938e-01, ...,
          9.94983292e-01,   9.94991625e-01,   9.94999958e-01],
       [  9.95008292e-01,   9.95016625e-01,   9.95024959e-01, ...,
          9.97483312e-01,   9.97491646e-01,   9.97499979e-01],
       [  9.97508313e-01,   9.97516646e-01,   9.97524979e-01, ...,
          9.99983333e-01,   9.99991667e-01,   1.00000000e+00]])

Creating a dataset has increased file size:

In [25]:
f

Fancy indexing allows allows to read/write only portions of a dataset. In the following example, only columns `50` to `55` of rows `10` and `11` are sent over the wire:

In [26]:
dst[10:12, 50:55]

array([[ 0.02541688,  0.02542521,  0.02543355,  0.02544188,  0.02545021],
       [ 0.0279169 ,  0.02792523,  0.02793357,  0.0279419 ,  0.02795023]])

We can also overwrite the above cells using the same notation:

In [27]:
dst[10:12, 50:55] = 999
dst[9:13, 50:55]

array([[  2.29168576e-02,   2.29251910e-02,   2.29335244e-02,
          2.29418578e-02,   2.29501913e-02],
       [  9.99000000e+02,   9.99000000e+02,   9.99000000e+02,
          9.99000000e+02,   9.99000000e+02],
       [  9.99000000e+02,   9.99000000e+02,   9.99000000e+02,
          9.99000000e+02,   9.99000000e+02],
       [  3.04169201e-02,   3.04252535e-02,   3.04335869e-02,
          3.04419203e-02,   3.04502538e-02]])

Require ... TODO

In [28]:
dst = f.require_dataset("mydata", shape=(400, 300), dtype=np.float64, exact=True)

In [29]:
dst[9:13, 50:55]

array([[  2.29168576e-02,   2.29251910e-02,   2.29335244e-02,
          2.29418578e-02,   2.29501913e-02],
       [  9.99000000e+02,   9.99000000e+02,   9.99000000e+02,
          9.99000000e+02,   9.99000000e+02],
       [  9.99000000e+02,   9.99000000e+02,   9.99000000e+02,
          9.99000000e+02,   9.99000000e+02],
       [  3.04169201e-02,   3.04252535e-02,   3.04335869e-02,
          3.04419203e-02,   3.04502538e-02]])

This shoud result in an error because dtypes do not match:

In [30]:
f.require_dataset("mydata", shape=(400, 300), dtype=np.int16, exact=True)

MessageError: (204, 'incompatible dtype and/or shape ', '')

## Working with groups

Datasets can be organised in groups (and subgroups). A group is like a folder and acts like a Python dictionary. Let's create a group named "data":

In [31]:
f.create_group("mygroup")

Recall that every file object is also a group and therefore acts like a dictionary. Its ``keys()`` now lists are newly created group:

In [32]:
f.keys()

('mydata', 'mygroup')

Let's create a subgroup (note that groups follow POSIX filesystem conventions):

In [33]:
f.create_group("mygroup/subgroup")

In [34]:
subgrp = f["mygroup/subgroup"]
subgrp

Now let's put a dataset in our subgroup:

In [35]:
data = np.random.random((600, 400))

In [36]:
dst = subgrp.create_dataset("randomdata", data=data)

In [37]:
dst

Every group has a ``tree()`` method that displays sub groups and datasets as a tree.

In [38]:
f.tree()

If you're not in a notebook or ipython console, ``tree()`` will give you a text based representation:

In [39]:
print(f.tree())

── /
    ├─ <Dataset (400, 300) float64 (db=/home/rg/hurray_data/test.h5, path=/mydata)>
    └─ mygroup
        └─ subgroup
            └─ <Dataset (600, 400) float64 (db=/home/rg/hurray_data/test.h5, path=/mygroup/subgroup/randomdata)>


## Attributes

Every group and dataset can be assigned a number of key/value pairs, so-called *attributes*:

In [40]:
dst = f["mygroup/subgroup/randomdata"]
dst.attrs["unit"] = "celsius"
dst.attrs["max_value"] = 50

Objects that have attributes get a red "A":

In [41]:
dst

In [42]:
dst.attrs.keys()

('unit', 'max_value')

In [43]:
dst.attrs["unit"], dst.attrs["max_value"]

('celsius', 50)