In [1]:
import numpy as np
import h5py
import pandas as pd

## Keras Generator
In our previous example we made a generic generator in python. However we aren't interested in using fibonacci numbers to train ANNs we need real data usually stored on disk and not generated on the fly. So next we are going to go through the process of turning a csv file into a hdf5 file and making a generator to pull chunks from our new hdf5 file.

So first the dataset which we will read in with Pandas (cause Pandas makes it easy to look at). The input features are some variables done to a plane simulation to alter magnetometer readings. 

In [2]:
features = pd.read_csv('/opt/data/MagData/EE699_IN.txt')
features

Unnamed: 0,cosX,cosY,cosZ,flap angle(rad),rudder angle(rad)
0,-0.364360,-0.23860,0.900170,0.783300,-0.29378
1,-0.516630,-0.56340,0.644730,0.343350,0.59530
2,-0.037998,-0.99223,-0.118450,0.588880,-0.46642
3,0.629490,-0.18223,0.755340,-0.115410,-0.38319
4,-0.500480,0.11948,0.857470,-0.055621,0.52118
...,...,...,...,...,...
9995,-0.139920,0.91155,0.386650,0.226880,0.33034
9996,0.577470,0.73689,-0.351450,-0.191640,0.68059
9997,0.087447,-0.93206,-0.351580,0.181340,-0.50669
9998,0.206470,0.97796,0.030993,-0.669570,0.13727


Out output target data is the magentic field reading of the magnetometer.

In [3]:
output = pd.read_csv('/opt/data/MagData/EE699_Out.txt')
output

Unnamed: 0,Magnetic Field (nT)
0,-112.69
1,844.88
2,1851.20
3,-1279.80
4,-246.01
...,...
9995,-703.47
9996,-871.67
9997,1929.20
9998,-932.63


Next we would usually split the data into train, valid, test and preprocess it by normalization etc. but for right now we are only interested in making our generator so assume we do that here

In [4]:
# we did some preprocessing
print(features.shape)
print(features.values.nbytes)

(10000, 5)
400000


## Made Dataset as HDF5
Now we will take our data and store it to an `.h5` file which will make it much easier to work with on disk compared to the `.txt` files we read it from

In [5]:
filename = "/opt/data/MagData/EE699_dat.h5"
with h5py.File(filename, 'w') as h5file:
    features_group = h5file.create_dataset(name='magdata/features', data=features)
    output_group = h5file.create_dataset(name='magdata/output', data=output)
    # h5file.create_dataset(name='test', data=np.arange(10))
    features_group.attrs['Info'] = "This is the input features for the magnetometer dataset"
    output_group.attrs['Info'] = "This is the output targets for the magnetometer dataset"
        

`.h5` files are broken into groups which can have subgroups. You can think of it like having folders with data in them. For example you could think of our data in a folder called `magdata` inside the file. You can also add attributed or metadata to the datasets. In this case I added some information to the groups which is pretty simple. The attributes are like dictionaries so they can take many types and not just strings.

So now lets just for fun get our data back out from the file. First we will delete the old data to prove we are getting new data.

In [6]:
del features
del output

try:
    print(features)
except NameError:
    print("features does not exist")

features does not exist


## Open our H5 file
So now we will open the file and list the keys of the file which is like listing the folders/datasets on the root directory. Next we will pull out the features

In [7]:
with h5py.File(filename, 'r') as h5file:
    print(h5file.keys()) # shows the 'folders' or 'datasets' in the file
    magdata = h5file['magdata']
    features = magdata['features'][...]

<KeysViewHDF5 ['magdata']>


So now lets check if our features is intact.

In [8]:
features

array([[-0.36436 , -0.2386  ,  0.90017 ,  0.7833  , -0.29378 ],
       [-0.51663 , -0.5634  ,  0.64473 ,  0.34335 ,  0.5953  ],
       [-0.037998, -0.99223 , -0.11845 ,  0.58888 , -0.46642 ],
       ...,
       [ 0.087447, -0.93206 , -0.35158 ,  0.18134 , -0.50669 ],
       [ 0.20647 ,  0.97796 ,  0.030993, -0.66957 ,  0.13727 ],
       [ 0.24874 , -0.36039 ,  0.89903 , -0.45666 ,  0.71166 ]])

Surprise we actually have a `ndarray` now and not a pandas `DataFrame`. When we gave the data to the `create_dataset` function it converted our pandas `DataFrame` to a python `ndarray`. We could have explicitly given an `ndarray` but I wanted to emphasize the change with a surprise. If you really want to store the `DataFrame` object pandas has built in functions to do so.

## The Actual Generator
Anyway we didn't come here to store and read data (even though that may be useful for future reference) we actually want a generator. So lets do that next.

In [9]:
def HDF5Generator(hdf5_path,
                  n_batches, 
                  hdf5XGroupName='train_set_x', 
                  hdf5YGroupName="train_set_y",
                  batch_size=100):
    """ Makes a generator function that pulls random batches of data from and hdf5 file

    :param hdf5_path: The path to the hdf5 file
    :param n_batches: number of batches in the file
    :param batch_size: the size of each batch
    :param hdf5XGroupName: the path to the dataset for the features, X, or input data
    :param hdf5YGroupName: the path to the dataset for the output targets

    :return: a generator that yields batches of data from an hdf5 file
    """
    trainIndices = list(range(n_batches))
    with h5py.File(name=hdf5_path, mode='r') as f:
        while True:
            np.random.shuffle(trainIndices)
            for batch_idx in trainIndices:
                startIdx = batch_idx * batch_size
                endIdx = (batch_idx + 1) * batch_size
                trainXbatch = f[hdf5XGroupName][startIdx:endIdx]
                trainYbatch = f[hdf5YGroupName][startIdx:endIdx]
                yield (trainXbatch, trainYbatch)

Woah ok thats a lot of code lets break it up part by part. First the input.
We can read the nice docstring to determine what all our inputs are doing. 

`hdf5_path` is the path to the hdf5 file that we saved earlier

`n_batches` is the number of batches in the file and the size of each batch. This will be used to slice the dataset into non overlapping batches.

actually lets just call this and see what happens

In [10]:
gen = HDF5Generator(filename, 
                    n_batches=100, 
                    hdf5XGroupName='magdata/features', 
                    hdf5YGroupName='magdata/output', 
                    batch_size=100)
gen

<generator object HDF5Generator at 0x7f2ccf02c120>

There we go we have a generator called `gen`. What you don't believe me? You say the output just says generator object which is too generic to mean anything? fine lets call it

In [11]:
(batch_input, batch_output) = next(gen)
batch_input[:10,:], batch_output[:10,:]

(array([[ 0.25252  , -0.93161  , -0.26142  ,  0.39357  , -0.43605  ],
        [-0.93147  , -0.35422  ,  0.082949 , -0.39371  , -0.68304  ],
        [-0.65945  , -0.421    ,  0.62281  , -0.16878  ,  0.23097  ],
        [-0.90437  ,  0.41962  , -0.077699 ,  0.1057   , -0.70672  ],
        [-0.30324  , -0.78729  ,  0.53687  ,  0.74315  ,  0.23942  ],
        [ 0.60058  ,  0.68253  , -0.41648  ,  0.6709   ,  0.22038  ],
        [ 0.31358  , -0.62331  ,  0.71635  ,  0.46601  ,  0.23368  ],
        [ 0.077513 , -0.99009  , -0.11708  , -0.46655  ,  0.079592 ],
        [ 0.8654   , -0.22319  ,  0.44863  ,  0.073477 , -0.46992  ],
        [-0.0084025, -0.83156  ,  0.55538  , -0.4282   ,  0.45822  ]]),
 array([[ 1566.6 ],
        [ 1991.2 ],
        [  896.24],
        [ 1222.8 ],
        [  968.02],
        [ -776.7 ],
        [ -337.93],
        [ 1789.2 ],
        [-1051.3 ],
        [  581.32]]))

In [12]:
print(batch_input.shape)
print(batch_output.shape)

(100, 5)
(100, 1)


See I actually got a batch with 100 samples. What you don't believe that I can do it again? You say the file is closed and another call will crash it? OMG well here we go again

In [13]:
(batch_input2, batch_output2) = next(gen)
batch_input2[:10,:], batch_output2[:10,:]

(array([[ 0.22654  , -0.80547  ,  0.54763  ,  0.30697  ,  0.25748  ],
        [-0.3039   ,  0.29177  ,  0.90693  ,  0.5679   ,  0.11319  ],
        [ 0.24899  , -0.55194  ,  0.79584  ,  0.079616 , -0.56072  ],
        [-0.5547   ,  0.54985  ,  0.62447  ,  0.66148  , -0.66118  ],
        [-0.85404  , -0.46113  , -0.24079  , -0.68072  ,  0.0064763],
        [ 0.5899   , -0.79136  , -0.16053  ,  0.72774  ,  0.045258 ],
        [-0.17494  ,  0.73115  ,  0.6594   ,  0.57312  ,  0.39844  ],
        [ 0.54991  , -0.58115  ,  0.59989  , -0.70943  ,  0.37904  ],
        [-0.17393  , -0.90106  ,  0.3973   ,  0.25027  , -0.50694  ],
        [-0.13397  , -0.94666  , -0.29307  ,  0.63367  , -0.31338  ]]),
 array([[ 270.11],
        [-646.01],
        [-458.53],
        [-105.67],
        [2492.7 ],
        [ 734.65],
        [-790.1 ],
        [-583.42],
        [1191.  ],
        [2123.6 ]]))

See the file is still open in the generator and I can make consecutive calls to it. Also check out that I have different data and I didn't just reprint the same batch.

In [14]:
np.allclose(batch_input, batch_input2)

False

In [15]:
np.allclose(batch_output, batch_output2)

False

So to reiterate this may be a little over the top for a dataset with only 10,000 samples that was like under half a megabyte in RAM. However we might imagine a dataset that was very large say 100GB and that we could not fit at once in RAM (especially on a cheap laptop) and would need to process one batch at a time.