# Building High Performance Data Pipelines with tf.Data

In this article we present some recipes on how to build a data input pipeline using Tensorflow, specially tf.data.
It is recommended to read the first part, which defines the overall data and model, and then jump to any topic that interest you the most.

This article uses the Stanford Dogs Dataset with ~20000 images and 120 classes [1].


[1] Stanford Dataset - Dog Breeds

[] https://colab.sandbox.google.com/gist/robieta/9463e86b5501541a441d431b9c4f1a1e/tf_world.ipynb

## Step ONE

Read images one by one from a Google Cloud Storage bucket (gs).

Note that this may not be very efficient as we have a lot of small files to read (random reads).


### Second Attempt

Consolidate images inside a TFRecord file. Use tf.examples protobuf to store all the bytes from images.

### Third Attempt

Consolidate images already pre processed, using Apache Beam and TFRecord IO transform.

### Forth Attempt

Use multiple GPUs to train our model.

### Fifth Attempt

Use Multiple workers to scale up the training.

### Other
 - Mixed Precision (optimize computation, but not IO)
 - Change Batch Size
 - Distributed Training (Strategy)
  - tf.distribute.MirroredStrategy
  - tf.distribute.MultiWorkerMirroredStrategy
  - tf.distribute.TPUStrategy
  

In [1]:
import tensorflow as tf

In [2]:
from numpy import zeros
import numpy as np
from datetime import datetime

In [3]:
# Enable XLA jit graph compilation
# Performance gains for fixed size images
tf.config.optimizer.set_jit(True)

In [4]:
SOURCE = 'gs://renatoleite-tf-datapipeline-poc/*/*'
RESOLUTION = (224,224)
NUM_TOTAL_IMAGES = 24000
IMG_SHAPE=(224,224,3)

AUTOTUNE = tf.data.experimental.AUTOTUNE

In [5]:
# Get labels from folders
path = 'gs://renatoleite-tf-datapipeline-poc/*'
folders_name = tf.io.gfile.glob(path)

labels = []
for folder in folders_name:
    labels.append(folder.split(sep='/')[-1])

In [6]:
# Generate a Label Map
label_map = {labels[i]:i for i in range(len(labels))}

In [7]:
# List all files in bucket
filepath = 'gs://renatoleite-tf-datapipeline-poc/*/*'
filepath = tf.io.gfile.glob(filepath)

In [8]:
# Function to One hot encode the inputs
def one_hot_encode(label_map, filepath):
    dataset = dict()
    
    for i in range(len(filepath)):
        encoding = zeros(len(label_map), dtype='uint8')
        encoding[label_map[filepath[i].split(sep='/')[-2]]] = 1
        
        dataset.update({filepath[i]:list(encoding)})
    
    return dataset

In [9]:
dataset = one_hot_encode(label_map, filepath)
dataset = [[k,v] for k,v in dataset.items()]

features = [i[0] for i in dataset]
labels = [i[1] for i in dataset]

In [10]:
# Create Dataset from Features and Labels
dataset = tf.data.Dataset.from_tensor_slices((features, labels))

In [11]:
# Function to download bytes from Cloud Storage
def get_bytes_label(filepath, label):
    raw_bytes = tf.io.read_file(filepath)
    return raw_bytes, label

In [12]:
# Preprocess Image
def process_image(raw_bytes, label):
    image = tf.io.decode_jpeg(raw_bytes, channels=3)
    image = tf.image.resize(image, RESOLUTION)
    image = tf.image.convert_image_dtype(image, dtype=tf.float32)
    
    return image, label

In [13]:
def build_dataset(dataset, batch_size=32):
    dataset = dataset.shuffle(NUM_TOTAL_IMAGES)
    
    # Extraction: IO Intensive
    dataset = dataset.map(get_bytes_label, num_parallel_calls=AUTOTUNE)

    # Transformation: CPU Intensive
    dataset = dataset.map(process_image, num_parallel_calls=AUTOTUNE)
    dataset = dataset.repeat()
    dataset = dataset.batch(batch_size=batch_size)
    
    # Pipeline next iteration
    dataset = dataset.prefetch(buffer_size=AUTOTUNE)
    
    return dataset

In [14]:
# Start tracing execution
tf.summary.trace_on(profiler=True)

In [15]:
dataset = build_dataset(dataset)

In [16]:
# Define Model
base_model = tf.keras.applications.ResNet50V2(weights='imagenet', 
                                         input_shape=IMG_SHAPE,
                                         include_top=False)

In [17]:
base_model.trainable = False

In [18]:
model = tf.keras.Sequential([
    base_model,
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(len(label_map))
])

In [19]:
model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=0.01),
             loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
             metrics=['accuracy'])

In [20]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
resnet50v2 (Model)           (None, 7, 7, 2048)        23564800  
_________________________________________________________________
flatten (Flatten)            (None, 100352)            0         
_________________________________________________________________
dense (Dense)                (None, 64)                6422592   
_________________________________________________________________
dense_1 (Dense)              (None, 120)               7800      
Total params: 29,995,192
Trainable params: 6,430,392
Non-trainable params: 23,564,800
_________________________________________________________________


In [21]:
log_dir = "logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

In [None]:
model.fit(dataset, epochs=2, callbacks=[tensorboard_callback], steps_per_epoch=644)

Train for 644 steps
Epoch 1/2
 37/644 [>.............................] - ETA: 13:37 - loss: 2.4998 - accuracy: 0.9520

In [None]:
def read_one_image(filepath):
    image = tf.io.read_file(filepath)
    image = tf.io.decode_jpeg(image)
    
    image = tf.image.convert_image_dtype(image, dtype=tf.float32)
    image = tf.image.resize(image, (224,224))
    image = tf.expand_dims(image, 0)
    
    return image

In [None]:
filepath = '/home/jupyter/dog2.jpg'
dog = read_one_image(filepath)

In [None]:
predict_dog = model(dog)

In [None]:
model.predict_classes(dog)

In [None]:
label_map

In [None]:
# Stop tracing execution
tf.summary.trace_export(name='Loading Data', profiler_outdir='/home/jupyter/logs/')

In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [None]:
%tensorboard --logdir /home/jupyter/logs

In [None]:
print(label_map)