# Storing Data: Files and HDF5

## Files in Python

#### Reading from a text file:

In [1]:
f = open('matrix.txt')
matrix = []
for line in f.readlines():
    row = [int(x) for x in line.split(',')]
    matrix.append(row)
f.close() # Files should always be closed!

In [2]:
matrix

[[1, 4, 15, 9], [0, 11, 7, 3], [2, 8, 12, 13], [14, 5, 10, 6]]

#### Writing to a file:

In [3]:
f = open('matrix_2.txt', 'w')
for row in matrix:
    line = ",".join([ str(x) for x in row ])
    f.write(line + '\n')
f.close()

#### Useful file modes:

 | Mode | Description                                                    |  
 |:----:| :------------------------------------------------------------- |
 | 'r'  | Open a file for **reading** (read only).                       |
 | 'w'  | Open a file for **writing** (current content will be deleted!).|
 | 'a'  | Open a file for **appending** (writing after current content). |
 | '+'  | **Update** (open file for both reading and writing).           |



#### Updating a text file:

In [4]:
cat matrix_2.txt

1,4,15,9
0,11,7,3
2,8,12,13
14,5,10,6


In [5]:
f = open('matrix_2.txt', 'r+')
orig = f.read()
f.seek(0)
f.write('0,0,0,0\n')
f.write(orig)
f.write('\n1,1,1,1')
f.close()

In [6]:
%cat matrix_2.txt

0,0,0,0
1,4,15,9
0,11,7,3
2,8,12,13
14,5,10,6

1,1,1,1

#### The `with` statement (context manager)

In [7]:
matrix = []
with open('matrix.txt') as f:
    for line in f.readlines():
        row = [int(x) for x in line.split(',')]
        matrix.append(row)
matrix

[[1, 4, 15, 9], [0, 11, 7, 3], [2, 8, 12, 13], [14, 5, 10, 6]]

**This will automatically close the file when leaving the `with` block**

## HDF5 files (Hierachrical Data Format)


In [8]:
import os
import numpy as np
if os.path.isfile('ch10.h5'):
    os.remove('ch10.h5')

In [9]:
import tables as tb
f = tb.open_file('ch10.h5', 'a')

Create a group on the root node with the name `a_group` with the title "My Group" :

In [10]:
f.create_group('/', 'a_group', "My Group")
f.root.a_group

/a_group (Group) 'My Group'
  children := []

In PyTables, arrays are of fixed size. They have to be created with data.
Tables need to have the same datetype (like in NumPy) and have variable length.

In [11]:
# integer array
f.create_array('/a_group', 'arthur_count', [1, 2, 5, 3])

# tables need descriptions
dt = np.dtype([('id', int), ('name', 'S10')])
knights = np.array([(42, 'Lancelot'), (12, 'Bedivere')], dtype=dt)
f.create_table('/', 'knights', dt)
f.root.knights.append(knights)

The hierarchy now looks like:

```
/
|-- a_group/
|   |-- arthur_count
|
|-- knights
```

In [12]:
f.root.a_group.arthur_count[:]

[1, 2, 5, 3]

In [13]:
type(f.root.a_group.arthur_count[:])

list

In [14]:
type(f.root.a_group.arthur_count)

tables.array.Array

In [15]:
f.root.knights[1]

(12, b'Bedivere')

In [16]:
f.root.knights[:1] 

array([(42, b'Lancelot')], 
      dtype=[('id', '<i8'), ('name', 'S10')])

In [17]:
mask = (f.root.knights.cols.id[:] < 28)
f.root.knights[mask]

array([(12, b'Bedivere')], 
      dtype=[('id', '<i8'), ('name', 'S10')])

In [18]:
f.root.knights[([1, 0],)]

array([(12, b'Bedivere'), (42, b'Lancelot')], 
      dtype=[('id', '<i8'), ('name', 'S10')])

In [19]:
# don't forget to close the file
f.close()

## Hierachry Layout

##  In-Core and Out-of-Core operations

### In-Core operations

```python
a = np.array(...)
b = np.array(...)
c = 42 * a + 28 * b + 6
```

is equivalent to :

```python
temp1 = 42 * a
temp2 = 28 * b
temp3 = temp1 + temp2
c = temp3 + 6
```
This can exhaust memory if the arrays are very large.

Alternatively it could be implemented element-wise as:

```python
c = np.empty(...)
for i in range(len(c)):
    c[i] = 42 * a[i] + 28 * b[i] + 6
```

This version needs less memory, but can be extremely slow if each dataset needs to be read from disk one by one.


### Out-of-Core operations

A better strategy is to use a hybrid, loading reasonable sized chunks of several elements into memory an performing the operations on all elements in the chunk, before processing the next chunk.

In Python the `numexpr` library provides a way to perform chunked, element-wise computations on NumPy arrays.  PyTables offers the `tb.Expr` class to do just that.

In [20]:
# clean-up
if os.path.isfile('ch10-1.h5'):
    os.remove('ch10-1.h5')

# open a new file
shape = (10, 10000)
f = tb.open_file('ch10-1.h5', "w")

# create the arrays 
a = f.create_carray(f.root, 'a', tb.Float32Atom(dflt=1.), shape)
b = f.create_carray(f.root, 'b', tb.Float32Atom(dflt=2.), shape)
c = f.create_carray(f.root, 'c', tb.Float32Atom(dflt=3.), shape)

# evaluate the expression, using the c array as the output
expr = tb.Expr("42*a + 28*b + 6")
expr.set_output(c)
expr.eval()

/c (CArray(10, 10000)) ''
  atom := Float32Atom(shape=(), dflt=3.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (1, 10000)

In [21]:
# close the file
f.close()