# Chapter 13: Loading and Preprocessing Data with TensorFlow

## Problem 1

Why would you want to use the Data API?

When handling data sets that do not fit into memory.

## Problem 2

What are the benefits of splitting a large dataset into multiple files?

Multiple files make it easier to shuffle data, without having to load the entire dataset.

## Problem 3

During training, how can you tell that your input pipeline is the bottleneck? What can you do to fix it?

By profiling GPU, CPU, and harddisk usage. If GPU utilization oscilates, it is an indication that data processing rather then model processing is the bottleneck.

## Problem 4

Can you save any binary data to a TFRecord file, or only serialized protocol buffers?

TFRecord files are a sequence of binary records, and any binary format can be used.

## Problem 5

Why would you go through the hassle of converting all your data to the Example protobuf format? Why not use your own protobuf definition?

The Example protoful format comes shipped with tensorflow. This makes it easier to read and write files and the protocols do not have to get compiled.

## Problem 6

When using TFRecords, when would you want to activate compression? Why not do it systematically?

When files need to be transferred between machines. If files are utilized on a single machine, it is preferable not to compress them.

## Problem 7

Data can be preprocessed directly when writing the data files, or within the tf.data pipeline, or in preprocessing layers within your model, or using TF Transform. Can you list a few pros and cons of each option?

Preprocessing data before storing it:
* Preprocessing when writing data to files has the advantage that training becomes faster. Also, the processed data is likely much smaller than the original data.
* The disadvantage is that it is less flexible for experimentation and preprocessing code needs to be maintained in addition to the model code.

tf.data pipeline
* The advantage is that this makes testing different processing steps or parameters much easier. Processing can also run on multiple threads. Prefetching is possible.
* Disadvantage is that preprocessing happens each epoch. The model expects preprocessed data so processing code and model code need to be maintained.

Preprocessing layer
* Processing and modeling logic are kept together.
* Preprocessing happens for every epoch. Preprocessing runs on the GPU, rather than multiple CPU threads

TF Transform
* All benefits above
* But, one needs to know how to use the tool.



## Problem 8

Name a few common techniques you can use to encode categorical features. What about text?

* One-Hot encoding, effects encoding, ordinal encoding, target encoding
* For text, embeddings are more suitable. Options are bag of words, TF-IDF, count n-grams, trained dense embeddings

## Problem 9

Load the Fashion MNIST dataset (introduced in Chapter 10); split it into a training set, a validation set, and a test set; shuffle the training set; and save each
dataset to multiple TFRecord files. Each record should be a serialized Example protobuf with two features: the serialized image (use tf.io.serialize_tensor() to serialize each image), and the label. Then use tf.data to create an efficient dataset for each set. Finally, use a Keras model to train these datasets, including a preprocessing layer to standardize each input feature. Try to make the input pipeline as efficient as possible, using TensorBoard to visualize profiling data.

In [4]:
import tensorflow as tf
from tensorflow import keras

In [2]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
n_train = 5000
X_valid, X_train = X_train_full[:n_train], X_train_full[n_train:]
y_valid, y_train = y_train_full[:n_train], y_train_full[n_train:]

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz


In [5]:
train_set = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(len(X_train))
valid_set = tf.data.Dataset.from_tensor_slices((X_valid, y_valid))
test_set = tf.data.Dataset.from_tensor_slices((X_test, y_test))

## Problem 10

In this exercise you will download a dataset, split it, create a tf.data.Dataset to load it and preprocess it efficiently, then build and train a binary classification model containing an Embedding layer:

### a

### b

### c

### d

### e

### f

### g