# Notes on Chapter 13 of *Hands-On Machine Learning with Scikit-Learn, Keras, & TensorFlow*, 3rd edition, by Aurélien Géron

Reduce the amount of logging messages displayed by TensorFlow

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

In [2]:
import itertools
from pathlib import Path
import time

import keras
from keras import layers
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
import tensorflow as tf
import tensorflow.experimental.numpy as tnp

Datasets are iterable sequences of data items

In [27]:
X = {"a" : tf.reshape(tf.range(9.), (3,3)), "b": tf.range(100.,103)}

dataset = tf.data.Dataset.from_tensor_slices(X)
for x in dataset:
    print(x)

{'a': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([0., 1., 2.], dtype=float32)>, 'b': <tf.Tensor: shape=(), dtype=float32, numpy=100.0>}
{'a': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([3., 4., 5.], dtype=float32)>, 'b': <tf.Tensor: shape=(), dtype=float32, numpy=101.0>}
{'a': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([6., 7., 8.], dtype=float32)>, 'b': <tf.Tensor: shape=(), dtype=float32, numpy=102.0>}


Multiple transformations can be chained together on datasets

In [33]:
dataset2 = dataset.repeat(3).map(
        lambda x: {'c':x['a']+0.1+x['b'], 'd':2*x['b']}
    ).shuffle(buffer_size=5, seed=13).batch(4)
for x in dataset2:
    print(x)

{'c': <tf.Tensor: shape=(4, 3), dtype=float32, numpy=
array([[108.1, 109.1, 110.1],
       [104.1, 105.1, 106.1],
       [100.1, 101.1, 102.1],
       [104.1, 105.1, 106.1]], dtype=float32)>, 'd': <tf.Tensor: shape=(4,), dtype=float32, numpy=array([204., 202., 200., 202.], dtype=float32)>}
{'c': <tf.Tensor: shape=(4, 3), dtype=float32, numpy=
array([[100.1, 101.1, 102.1],
       [100.1, 101.1, 102.1],
       [108.1, 109.1, 110.1],
       [104.1, 105.1, 106.1]], dtype=float32)>, 'd': <tf.Tensor: shape=(4,), dtype=float32, numpy=array([200., 200., 204., 202.], dtype=float32)>}
{'c': <tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[108.1, 109.1, 110.1]], dtype=float32)>, 'd': <tf.Tensor: shape=(1,), dtype=float32, numpy=array([204.], dtype=float32)>}


## Exercises

### 13.1

When datasets are large enough that loading and preprocessing of the data becomes a bottleneck, the tf.data api provides a performant way to take advantage of multiple threads, prefetching, and streaming of the data to reduce this bottleneck.

### 13.2

Splitting the data into multiple files uses the filesystem to provide a high level index into the dataset. This index allows easily jumping to different disjoint chunks of the dataset, and thus simplifies things like reading the data concurrently from multiple threads. While the filesystem may not be the most performant way to index the data like this (e.g. a nosql database can be faster), it's usually good enough and the many tools available for examining and debugging the filesystem can make it the best solution overall.

### 13.3

Profiling the code is the most reliably way of detecting bottlenecks. Like any bottleneck, one first needs to identify the limiting resource (e.g. CPU cycles, SSD bandwidth, etc) and then either reduce the resource demand (e.g. with a better algorithm or data format) or increase the amount of the resource (e.g. by switching to a multi-threaded setup, changing hardware, etc) and/or shift the demand to another resource (e.g. use unused CPU cycles while the GPU is training to prefetch the data rather than using more limited CPU cycles while the GPU is idle).

### 13.4

You can save any data you like (within certain size limits) within a TFRecord.

### 13.5

Tensorflow provides a number of useful predefined parsing routines for the 
`Example` protobuf format, and replicating these routines (in a TensorFlow
compatible and performant way) is often more work than just adapting to
the `Example` format.

### 13.6

Compression makes sense when A) the data is compressible (e.g. not already in
a compressed format like jpg), and B) space and/or bandwidth are more limiting
than the CPU cycles required for compression/decompression.

### 13.7

#### preprocessing when writing the files:
##### pros:
- preprocessing is performed once on data ingestion (most efficient)
##### cons:
- preprocessing code must be repeated in production/deployment of the model

#### preprocessing within tf.data pipeline
##### pros:
- allows more dynamic changes (e.g. shuffling) during training
- allows decompression during training
##### cons:
- preprocessing is repeated for every epoc of training

#### preprocessing in the model
##### pros:
- Allows learning of details of preprocessing (e.g. embedding vectors)
- Allows a unified approach to hyperparameter tuning
##### cons:
- complicates the model, embedding more assuptions about the data
- some types of preprocessing (e.g. shuffling) are hard to implement well

### 13.8

For small cardinalities, categorical integer vectors are often encoded as one-hot vectors. They can also be encoded as a scaled float (which can be harder to learn). For larger cardinalities, embedding vectors are often used.

For text, often it is encoded as a string of embedding vectors for each word/token, as sentence embedding vectors, or even document embedding vectors.

### 13.9

In [3]:
(x_train, y_train), (x_test, y_test) = keras.datasets.fashion_mnist.load_data()
shuffled_index = tf.random.shuffle(tf.range(x_train.shape[0] + x_test.shape[0]), seed=42)
x = tf.gather(tf.concat([x_train, x_test], axis=0), shuffled_index)
y = tf.gather(tf.concat([y_train, y_test], axis=0), shuffled_index)
x_test, x_validation, x_train = x[:10000], x[10000:20000], x[20000:]
y_test, y_validation, y_train = y[:10000], y[10000:20000], y[20000:]

In [4]:
x_train.shape

TensorShape([50000, 28, 28])

In [5]:
?tf.random.shuffle

[0;31mSignature:[0m [0mtf[0m[0;34m.[0m[0mrandom[0m[0;34m.[0m[0mshuffle[0m[0;34m([0m[0mvalue[0m[0;34m,[0m [0mseed[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mname[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Randomly shuffles a tensor along its first dimension.

The tensor is shuffled along dimension 0, such that each `value[j]` is mapped
to one and only one `output[i]`. For example, a mapping that might occur for a
3x2 tensor is:

```python
[[1, 2],       [[5, 6],
 [3, 4],  ==>   [1, 2],
 [5, 6]]        [3, 4]]
```

Args:
  value: A Tensor to be shuffled.
  seed: A Python integer. Used to create a random seed for the distribution.
    See
    `tf.random.set_seed`
    for behavior.
  name: A name for the operation (optional).

Returns:
  A tensor of same shape and type as `value`, shuffled along its first
  dimension.
[0;31mFile:[0m      /usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/random_ops.py
[0;3

In [14]:
x.shape

TensorShape([70000, 28, 28])