#Storing Data: Files and HDF5

## Files in Python

There are many situations in
which you may need to interact with a file on your hard drive:
- Your collaborator emails you raw data. You download the attachment and want
to look at the results.
- You want to email your collaborators some of your data, quickly.
- You need to use external code that takes an input or data file. You may need to
run the program thousands of times, so you automate the generation of input
files from data that you have in-memory in Python.
- An external program that you use writes out one (or more) result files, and you
want to read them and perform further analysis.
- You want to keep an intermediate calculation around for debugging or validation.

The first step is to open the file.

In Python, to save or load data you go through a special file handle object. The builtin
open() function will return a file object for you. This takes as its argument the path
to the file as a string. Suppose you have a file called data.txt in the current directory.
You could get a handle, f, to this file in Python with the following:

```python
f = open('data.txt')
```
The open() call implicitly performs the following actions:
1. Makes sure that data.txt exists.
2. Creates a new handle to this file.
3. Sets the cursor position (pos) to the start of the file, pos = 0.
The call to open() does not read into memory any part of the file, write anything out
to the file, or close the file. All of these actions must be done separately and explicitly
and are accomplished through the use of file handle methods.


Suppose that matrix.txt represents a 4×4 matrix of integers. Each line in the file represents
a row in the matrix. Column values are separated by commas. Ideally, we would
be able to read this into Python as a list of lists of integers, since that is the most
Pythonic way to represent a matrix of integers. This is not the most efficient representation
for a matrix—a NumPy array would be better—but it is fairly common to
read in data in a native Python format before continuing to another data structure.
The following snippet of code shows how to read in this matrix. To follow along, first
make sure that you create a matrix.txt file on your computer. You can do this by
copying the contents shown here into your favorite text editor and saving the file with
the right name:

In [1]:
f = open('matrix.txt')

FileNotFoundError: [Errno 2] No such file or directory: 'matrix.txt'

### Exercise: Convert Strings To Ints

1) Determine the functionality of the readlines() method.

2) Show that the lines from a file, opened in this way, are strings. 

3) Write a loop that converts the string versions of the values in matrix 
into integers

In [None]:
matrix = []
for line in f.readlines():
    row = [int(x) for x in line.split(',')]
    matrix.append(row)
f.close()

In [2]:
matrix

[[1, 4, 15, 9], [0, 11, 7, 3], [2, 8, 12, 13], [14, 5, 10, 6]]

### Exercise: Write to the file
1) Read the docs for the open function to learn about file modes.
2) Open the file in a different mode that allows writing.
3) Add a row of zeros to the top of our matrix and a row of ones to the end.

In [12]:
f = open('matrix.txt', 'r+')
orig = f.read()
f.seek(0)
f.write('0,0,0,0\n')
f.write(orig)
f.write('\n1,1,1,1')
f.close()

FileNotFoundError: [Errno 2] No such file or directory: 'matrix.txt'

In [13]:
matrix

[]

### Exercise: Context Manager (with)

Use a context manager to open the file instead.

In [14]:
matrix = []
with open('matrix.txt') as f:
    for line in f.readlines():
        row = [int(x) for x in line.split(',')]
        matrix.append(row)
matrix

FileNotFoundError: [Errno 2] No such file or directory: 'matrix.txt'

## Big Ideas in HDF5

HDF5 is a binary format for storing data.

This section will address some of the following HDF5 constructs:
- Array
- CArray
- EArray
- VLArray
- Table

## File Manipulations
HDF5 files may be opened from Python via the PyTables interface. To get PyTables,
first import tables. Like with numpy and np, it is common to abbreviate the tables
import name to tb:

```python
import tables as tb
f = tb.open_file('/path/to/file', 'a')
```

Files have modes that they may be opened in, similarly to how plain-text files are
opened in Python.

In [18]:
# before we do anything, lets make sure to create a new file.
import os
import numpy as np
if os.path.isfile('ch10.h5'):
    os.remove('ch10.h5')

In [16]:
import tables as tb
f = tb.open_file('ch10.h5', 'a')

In [17]:
f.create_group('/', 'a_group', "My Group")
f.root.a_group

/a_group (Group) 'My Group'
  children := []

In [9]:
# integer array
f.create_array('/a_group', 'arthur_count', [1, 2, 5, 3])

# tables need descriptions
dt = np.dtype([('id', int), ('name', 'S10')])
knights = np.array([(42, 'Lancelot'), (12, 'Bedivere')], dtype=dt)
f.create_table('/', 'knights', dt)
f.root.knights.append(knights)

In [10]:
f.root.a_group.arthur_count[:]

[1, 2, 5, 3]

In [11]:
type(f.root.a_group.arthur_count[:])

list

In [12]:
type(f.root.a_group.arthur_count)

tables.array.Array

In [13]:
f.root.knights[1]

(12, b'Bedivere')

In [14]:
f.root.knights[:1] 

array([(42, b'Lancelot')], 
      dtype=[('id', '<i8'), ('name', 'S10')])

In [15]:
mask = (f.root.knights.cols.id[:] < 28)
f.root.knights[mask]

array([(12, b'Bedivere')], 
      dtype=[('id', '<i8'), ('name', 'S10')])

In [16]:
f.root.knights[([1, 0],)]

array([(12, b'Bedivere'), (42, b'Lancelot')], 
      dtype=[('id', '<i8'), ('name', 'S10')])

In [17]:
# don't forget to close the file
f.close()

## Hierarchy Layout

In [19]:
# clean-up
if os.path.isfile('ch10-1.h5'):
    os.remove('ch10-1.h5')

# open a new file
shape = (10, 10000)
f = tb.open_file('ch10-1.h5', "w")

# create the arrays 
a = f.create_carray(f.root, 'a', tb.Float32Atom(dflt=1.), shape)
b = f.create_carray(f.root, 'b', tb.Float32Atom(dflt=2.), shape)
c = f.create_carray(f.root, 'c', tb.Float32Atom(dflt=3.), shape)

# evaluate the expression, using the c array as the output
expr = tb.Expr("42*a + 28*b + 6")
expr.set_output(c)
expr.eval()

/c (CArray(10, 10000)) ''
  atom := Float32Atom(shape=(), dflt=3.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (1, 10000)

In [19]:
# close the file
f.close()

## Files and HDF5 Wrap-up
At the end of this lesson you should be comfortable with the following tasks:
    
- Saving and loading plain-text files in Python
- Working with HDF5 tables, arrays, and groups
- Manipulating the hierarchy to enable more efficient data layouts

And, you should be familiar with the following concepts:

- HDF5 tables are conceptually the same as NumPy structured arrays.
- All data reading and writing in HDF5 happens per-chunk.
- A contiguous dataset is a dataset with only one chunk.
- Computation can happen per-chunk, so only a small subset of the data ever has
  to be in memory at any given time.
- Going back and forth to and from disk can be very expensive.
- Querying allows you to efficiently read in only part of a table.
- Compressing datasets can speed up reading and writing, even though the processor
  does more work.
- HDF5 and other projects provide tools for inspecting HDF5 files from the command
  line or from a graphical interface.

In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()