## Demo on how to create a HDF5 dataset from raw images

### Contents

1. Prepare your data
2. Define all expected common parameters
3. Call generate_h5(...) to generate HDF5 dataset file
4. Call load_all_data(...) to extract the data as np.array objects into your python env
5. Check on the shape or content of the various np.array
6. Check to visualize a randomly chosen image
7. Sample code for Mini-batch access for bigger dataset
8. References

In [3]:
from random import shuffle
import os
import glob
import scipy
from scipy import ndimage
import numpy as np
import h5py
import matplotlib.pyplot as plt
from data_util import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

#### 1. Prepare your data
Put all images with the same label into each separate folder. Name the folder
with the name of the target class. For example, if you have a binary classification 
task to identify cats vs non-cats. Put all your cats photo into a folder named "cat" and
all the non-cats photos into another folder named "non-cat". 

#### 2. Define all expected common parameters
* __data_path__ is the root directory containing all the folders holding images of each class (see above).
* __shuffled_data__ is a boolean to determine if we should shuffle the data before saving into h5 file
* __height, width__ are the resized height and width. It is assumed that in general, your photos may have different height and width, and they will have to be resized to these fixed values.
* __data_order__ can either be 'tf' or 'th' (tensorflow vs. theano). I have only tested 'tf' for now, so just stick with that. The only difference I am aware of is 'tf' uses "channel last" with shape (m, h, w, c) and 'th' uses "channel first" with shape (m, c, h, w)
* __outfile_path__ is the name of the output HDF5 file name.
* __labels_to_classes_dictionary__ is a python dictionary with numeric string as key and the name of the class as values. E.g. {"0": "non-cat", "1": "cat"}

In [1]:
shuffle_data = True
data_order = 'tf'
#data_path = 'original'
data_path = 'cropped/train'

height = 64
width = 64

outfile_path = "train_cropped_coin_" + str(height) + "_" + str(width) + ".hdf5"

coin_dictionary = {"0": "UNK", "1": "5c", "2": "10c", "3": "25c", "4": "$1", "5": "$2"}

#### 3. Call generate_h5(...) to generate HDF5 dataset file
generate_h5 takes a few argument. Please see data_util.py for all details. Most of the important arguments are explained in Part 2 above. The method will save a file with name __outfile_path__ that is an HDF5 dataset.

In [4]:
generate_h5(data_path, labels_to_classes_dictionary=coin_dictionary, 
            outfile_path=outfile_path, resize_height=height, resize_width=width, 
            train_dev_test_ratio=[1.0, 0.0, 0.0])

Test data: 100/768


ValueError: 'arr' does not have a suitable array shape for any mode.

#### 4. Call load_all_data(...) to extract the data as np.array objects into your python env

In [None]:
train_set_x_orig, train_set_y_orig, dev_set_x_orig, dev_set_y_orig, test_set_x_orig, test_set_y_orig, classes = load_all_data(outfile_path)

#### 5. Check on the shape or content of the various np.array

In [None]:
print("X_train shape: " + str(train_set_x_orig.shape))
print("Y_train shape: " + str(train_set_y_orig.shape))
print("X_dev shape: " + str(dev_set_x_orig.shape))
print("Y_dev shape: " + str(dev_set_y_orig.shape))
print("X_test shape: " + str(test_set_x_orig.shape))
print("Y_test shape: " + str(test_set_y_orig.shape))

print("Classes are: " + str(classes))

#### 6. Check to visualize a randomly chosen image

In [None]:
# Randomly visualize some X

for i in range(5):
    index = np.random.randint(len(train_set_y_orig))
    plt.figure(i, figsize=(5, 5))
    plt.title(target_label[str(train_set_y_orig[index][0])])
    plt.imshow(train_set_x_orig[index])

#### 7. Sample code for Mini-batch access for bigger dataset

In [None]:
from math import ceil

hdf5_file = h5py.File(outfile_path, "r")

batch_size = 32
train_size = hdf5_file["train_set_x"].shape[0]
num_of_classes = len(classes)

# create list of batches
batches_list = list(range(int(ceil(float(train_size) / batch_size))))
shuffle(batches_list)

for n, i in enumerate(batches_list):
    i_s = i * batch_size  # index of the first image in this batch
    i_e = min([(i + 1) * batch_size, train_size])  # index of the last image in this batch
    
    # read batch images and remove training mean
    images = hdf5_file["train_set_x"][i_s:i_e, ...]

    # read labels and convert to one hot encoding
    labels = hdf5_file["train_set_y"][i_s:i_e]
    labels_one_hot = np.zeros((labels.shape[0], num_of_classes))
    labels_one_hot[np.arange(labels.shape[0]), labels] = 1
    
    print n+1, '/', len(batches_list)
    print labels[0], labels_one_hot[0, :]
    plt.imshow(images[0])
    plt.show()
    
    if n == 5:  # break after 5 batches
        break

hdf5_file.close()

#### References:
* Some notation and format inspired by Deep Learning Specialization from Coursera/deeplearning.ai. 
* http://machinelearninguru.com/deep_learning/data_preparation/hdf5/hdf5.html
* https://docs.scipy.org/doc/