# Loading and Preprocessing Data with TensorFlow

In [1]:
import tensorflow as tf
import numpy as np
from tensorflow import keras

So far all our data fits into memory, but often deep learning systems are trained with data sets that don't fit into RAM:

* TensorFlow makes it easy to deal with this with their data API
* Data API can read from text, CSV, binary files that use the TFRecord format

Reading huge datasets efficiently is not the only difficulty; often data needs to be preprocessed and *normalized*. Moreover, sometimes its not always numerical (text, categorical, etc...). One option to deal with this is to write custom preprocessing layers, or use the standard preprocessing layers provided by Keras.

This Chapter looks at

1. Data API
2. TFRecord Format
3. Creating custom preprocessing layers and using Keras versions
4. TF Transform (takes care of preprocessing on the fly when dealing with large datasets)
5. TF Datasets (convenient function to download many common datasets of all kinds including the large ones on ImageNet)

# The Data API

## Intro

The Data API revolves around the concept of a **dataset**.  Usually you use datasets that gradually read in data from disk, but now lets start with a simple one:

In [2]:
X = tf.range(10)
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

The from_tensor_slices() takes a tensor and creates a dataset whose elements are all the slices of X (along the first dimension). In other words the dataset contains 10 itmes: tensors 0,...,9. 

Can iterate through a dataset

In [3]:
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


## Chaining Transformations

Can apply all sorts of transformations to a dataset by calling the transformation methods.

In [4]:
dataset = dataset.repeat(3).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


Here we called the repeat method on the original data set that repeats the items in the original data set 3 times and then splits that "full array" into 7 different batches (7 diff "arrays"). Note these dataset methods do not modify datasets: they return new ones.

Can use lambda functions to map values in the dataset:

In [5]:
dataset = dataset.map(lambda x: x**2)
for item in dataset:
    print(item)

tf.Tensor([ 0  1  4  9 16 25 36], shape=(7,), dtype=int32)
tf.Tensor([49 64 81  0  1  4  9], shape=(7,), dtype=int32)
tf.Tensor([16 25 36 49 64 81  0], shape=(7,), dtype=int32)
tf.Tensor([ 1  4  9 16 25 36 49], shape=(7,), dtype=int32)
tf.Tensor([64 81], shape=(2,), dtype=int32)


This function you will often use to apply any preprocessing you want to your data. Often this can be computationally expensive, so its a good idea to seperate this into multiple threads:

In [6]:
dataset = dataset.map(lambda x: x**2, num_parallel_calls=2)

And while "map" applies transformation to each item, "apply" applies a transform to the whole data set:

In [7]:
dataset = dataset.apply(tf.data.experimental.unbatch())

Instructions for updating:
Use `tf.data.Dataset.unbatch()`.


Can also filter the data set:

In [8]:
dataset = dataset.filter(lambda x: x<10)
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)


And if you only want to look at a few items in the dataset, use the "take" function

In [9]:
for item in dataset.take(3):
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)


## Shuffling the Data

Since Gradient Descent works best when instances are independent and identially distributed, often a good idea to shuffle data first. Works by filling up new data set with random items from the first.

In [11]:
dataset = tf.data.Dataset.range(10).repeat(3)
dataset = dataset.shuffle(buffer_size=5, seed=42).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 2 3 6 7 9 4], shape=(7,), dtype=int64)
tf.Tensor([5 0 1 1 8 6 5], shape=(7,), dtype=int64)
tf.Tensor([4 8 7 1 2 3 0], shape=(7,), dtype=int64)
tf.Tensor([5 4 2 7 8 9 9], shape=(7,), dtype=int64)
tf.Tensor([3 6], shape=(2,), dtype=int64)


The way it works:

1. Take "buffer" many items from first data set. Can't be too small (else only shuffling small many instances at a time) or too large (such that it doesn't fit in RAM)
2. When creating the new data set, takes instances randomly from the buffer
3. Continue buffer after buffer until process is over


Notes: 

* Can even see that with buffer size of 5, things don't get shuffled too well. If buffer size was 1, nothing would be shuffled at all.

* For data set where the file is much bigger than the buffer, shuffling will not be effective. For this case, typically good to use linux commands to shuffle the text file beforehand.

* Typically want to shuffle the data so that the same order does not occur after each epoch (otherwise model might pick up on spurious patterns that just so happen to present themselves in the order of the data).