# Gap - Next Generation

This notebook is the initiation of redesigning the Gap Computer Vision module to meet the 'bar' to be widely (globally) accepted open source frame for data engineering in computer vision.

## Technical Expections

This section is covers the anticipated "technical" expectations:

1. Continue with OOP style developer API for the targeted audience.
2. Maintain high-speed in-memory performance as close to comparable to hard-coding.
3. Resuse of machine learning ready data.

## Design Direction

Gap Next Generation will be designed from the bottom up (storage => progress to => developer API) utilizing the lessons learned from the top-down design of the first Gap.

1. On-Disk Storage
2. Writing to and Retrieving from Storage
3. Processing Images
4. Transformations / Augmentation
5. Feeding Neural Networks
6. Collection Management
7. Inspection / Debugging

## On-Disk Storage

The HDF5 file system will continue to be the basis for on-disk storage. Storage will consist of:
1. Preprocessed images are stored as numpy arrays in HDF5 datasets.
2. Image collections of the same classification are stored as a HDF5 group.
3. Metadata are stored as HDF5 attributes

In [1]:
# Import HDF5 library
import h5py

### Matrix Storage

Preprocessed machine learning ready data for images are essentially multi-layer matrixes. Storage of matrixes are optimized in HDF5 for Python by storing as numpy arrays.

Numpy arrays are continuous memory storage of bytes, and the Python numpy library is performance optimized by writing it in C with Python linkage (CPython).

To maximize efficiency, the following should be considered:

1. Storing individual images a separate datasets of a continuous dataset. In the later, each image (matrix) has to be the same size. Images (Matrixes) of different sizes are *not* be supported.


2. The datatype of the storage should be indicative of the level of preprocessing:

        A. uint8 - raw 8-bit pixel data  
        B. uint16 - raw 16-bit pixel data  
        C. float16 - normalized pixel data for half-float hardware emulation.  
        D. float32 - normalized pixel data as standard floating point.  
        E. emulated uint12
    
*Note* - Some microscopy systems use 12-bit pixels. Numpy does not natively support 12 bits. This would need to be emulated. I did a quick Google on the topic, and found blogs out there addressing how to emulate 12-bit in numpy.
    
    
3. The matrix shape should be indicative of the image data type and channels. Each image has to be the same shape and number of channels. Images (Matrixes) of different shapes are *not* be supported.


       A. 1D - flatten  
       B. 2D - grayscale image  
       C. 3D - color image (RGB, RGBA, CMYK)
   
For flatten image, the metadata would need to be accessed to know it's unflatten shape. For multi-channel data, the metadata will need to be accessed to know the channel types.

### HDF5 Layout

The following is the HDF5 file layout for image collections that have a single classification per image.

```
hdf5 file  
        group  
            dataset - images have the same classification  
                attributes - image specific attributes
            attributes - attributes that apply to all images in the group
        group  
            dataset  
                attributes
            attributes  
        ...  

        attributes - global attributes that apply to all images (e.g., shape)
```

*Question* How to represent multi-classification images. For example, and image with several objects. We need to identify all the classes and the boundary box (segmentation) where each object (class) is located.

In [1]:
# Open an empty HDF5 file

hf = h5py.File('gap.h5','w')

NameError: name 'h5py' is not defined

#### Global Attributes

Metadata that applies to the entire dataset are stored as global (toplevel) attributes in the HDF5 file. These include:

        1. Collection Name
        2. Author
        3. Source
        4. Description
        5. Date Created (ISO 8601 format)
        6. Count (i.e., Number of Images)
        7. Shape (i.e., Unflattened Shape of Images)
        8. Channels (e.g., 1 = R, 2 = G, 3 = B).
        
##### Key Name Rules

- The keys for the metadata are fully spelled out (i.e., not abbreviated). 
- Keys that combine multiple words, the words are separated by commas.
- Keys that are plural are spelled singular.

In [3]:
# Example of hand setting global attributes
hf.attrs['name'] = 'Flowers'
hf.attrs['author'] = 'ABC Company'
hf.attrs['source'] = 'http://....'
hf.attrs['description'] = 'Images for classifiying types of flowers'
hf.attrs['date'] = '2018-10-01'
hf.attrs['count'] = 50000
hf.attrs['shape'] = (100, 100, 3)
hf.attrs['channel'] = str([ 'C', 'M', 'Y', 'K'])

#### Group

A group is a collection of images which share the same classification (e.g., daisy).

In [4]:
# Example of hand creating a Group
grp_daisy     = hf.create_group("daisy")
grp_sunflower = hf.create_group("sunflower")
grp_rose      = hf.create_group("rose")

#### Group Attributes

Metadata that applies to an entire group of images, which share the same classification (label) are stored in the corresponding group's attributes. These include:

        1. Label (classification)
        2. Number of images

In [5]:
# Exanple of hand setting attributes for a group
grp_daisy.attrs['label'] = 'daisy'
grp_daisy.attrs['count'] = 663

#### Group Dataset

The machine learning ready data for an entire group of images, which are the same classification (label) are stored in the corresponding group's dataset.

In [6]:
# import numpy for C-like in-memory arrays
import numpy as np

# import openCV for image processing
import cv2

In [7]:
import os

# Create a collection of daisy images
subdir = 'flower_photos/daisy'
files = os.listdir('flower_photos/daisy')
collection = []
for file in files:
    # for each image in the subfolder, read in using openCV
    # openCV will decompress the image into a raw bitmap of pixels
    image = cv2.imread(subdir + '/' + file)
    image = cv2.resize(image, (50,50), interpolation=cv2.INTER_AREA)
    # append each in-memory image into a list
    collection.append(image)
    
# once the collection is assembled, convert the list to a multi-dimensional numpy array
daises = np.asarray(collection)

In [8]:
dset = grp_daisy.create_dataset("daisy", data=daises)