# 13 Best practices for the real world

## 7.0.3 Getting the most out of your models

### 7.0.3.1 Advanced architecture patterns

#### BATCH NORMALIZATION

- Normalization
  - make different samples seen by a machine-learning model more similar to each other

    ```
    normalized_data = (data - np.mean(data, axis=...)) / np.std(data, axis=...)
    ```
- Batch normalization
  - a type of layer
  - normalize data even as the mean and variance change over time during training
  - works by internally maintaining an exponential moving average of the batch-wise mean and variance of the data seen during training

#### DEPTHWISE SEPARABLE CONVOLUTION

- depthwise separable convolution layer
  - make your model lighter (fewer trainable weight parameters)
  - faster (fewer floating-point operations)
  - perform a few percentage points better on its task
  - performs a spatial convolution on each channel of its input, independently, before mixing output channels via a pointwise convolution
    - separating the learning of spatial features and the learning of channel-wise features

## 13.1 Getting the most out of your models

- hyperparameters
  - How many layers should you stack? 
  - How many units or filters should go in each layer? 
  - Should you use relu as activation, or a different function? 
  - Should you use BatchNormalization after a given layer? 
  - How much dropout should you use?
- process of optimizing hyperparameters
  1. Choose a set of hyperparameters (automatically).
  2. Build the corresponding model.
  3. Fit it to your training data, and measure performance on the validation data.
  4. Choose the next set of hyperparameters to try (automatically).
  5. Repeat.
  6. Eventually, measure performance on your test data.
- algorithm that analyzes the relationship between validation performance and various hyperparameter values to choose the next set of hyperparameters to evaluate.
  - Bayesian optimization, 
  - genetic algorithms, 
  - simple random search
- Updating hyperparameters
  - The hyperparameter space is typically made up of discrete decisions and thus isn’t continuous or differentiable. Hence, you typically can’t do gradient descent in hyperparameter space. Instead, you must rely on gradient-free optimization techniques, which naturally are far less efficient than gradient descent.
  - Computing the feedback signal of this optimization process (does this set of hyperparameters lead to a high-performing model on this task?) can be extremely expensive: it requires creating and training a new model from scratch on your dataset.
  - The feedback signal may be noisy: if a training run performs 0.2% better, is that because of a better model configuration, or because you got lucky with the initial weight values?

#### USING KERASTUNER

In [1]:
# istallation
!pip install keras-tuner -q

[?25l[K     |██▌                             | 10 kB 24.3 MB/s eta 0:00:01[K     |█████                           | 20 kB 9.6 MB/s eta 0:00:01[K     |███████▍                        | 30 kB 6.5 MB/s eta 0:00:01[K     |█████████▉                      | 40 kB 6.1 MB/s eta 0:00:01[K     |████████████▎                   | 51 kB 3.5 MB/s eta 0:00:01[K     |██████████████▊                 | 61 kB 4.1 MB/s eta 0:00:01[K     |█████████████████▏              | 71 kB 4.3 MB/s eta 0:00:01[K     |███████████████████▋            | 81 kB 4.6 MB/s eta 0:00:01[K     |██████████████████████          | 92 kB 5.1 MB/s eta 0:00:01[K     |████████████████████████▌       | 102 kB 4.2 MB/s eta 0:00:01[K     |███████████████████████████     | 112 kB 4.2 MB/s eta 0:00:01[K     |█████████████████████████████▍  | 122 kB 4.2 MB/s eta 0:00:01[K     |███████████████████████████████▉| 133 kB 4.2 MB/s eta 0:00:01[K     |████████████████████████████████| 133 kB 4.2 MB/s 
[?25h

- lets you replace hard-coded hyperparameter values with a range of possible choices (search space of the hyperparameter tuning process)

In [2]:
# Listing 13.1 A KerasTuner model-building function
from tensorflow import keras 
from tensorflow.keras import layers
 
def build_model(hp):
  units = hp.Int(name="units", min_value=16, max_value=64, step=16)
  model = keras.Sequential([
    layers.Dense(units, activation="relu"),
    layers.Dense(10, activation="softmax")
  ])
  optimizer = hp.Choice(name="optimizer", values=["rmsprop", "adam"])
  model.compile(
      optimizer=optimizer,
      loss="sparse_categorical_crossentropy",
      metrics=["accuracy"])
  return model 

In [3]:
# more modular and configurable approach
# Listing 13.2 A KerasTuner HyperModel

import kerastuner as kt
 
class SimpleMLP(kt.HyperModel):
  def __init__(self, num_classes):
    self.num_classes = num_classes
  def build(self, hp):
    units = hp.Int(name="units", min_value=16, max_value=64, step=16)
    model = keras.Sequential([
      layers.Dense(units, activation="relu"),
    layers.Dense(self.num_classes, activation="softmax")
    ])
    optimizer = hp.Choice(name="optimizer", values=["rmsprop", "adam"])
    model.compile(
        optimizer=optimizer,
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"])
    return model
 
hypermodel = SimpleMLP(num_classes=10)

  after removing the cwd from sys.path.


- tuner
  - Pick a set of hyperparameter values
  - Call the model-building function with these values to create a model
  - Train the model and record its metrics

In [4]:
tuner = kt.BayesianOptimization(
    build_model,
    objective="val_accuracy",
    max_trials=100,
    executions_per_trial=2,
    directory="mnist_kt_test",
    overwrite=True, 
)

In [5]:
tuner.search_space_summary()

Search space summary
Default search space size: 2
units (Int)
{'default': None, 'conditions': [], 'min_value': 16, 'max_value': 64, 'step': 16, 'sampling': None}
optimizer (Choice)
{'default': 'rmsprop', 'conditions': [], 'values': ['rmsprop', 'adam'], 'ordered': False}


##### Objective maximization and minimization

In [8]:
objective = kt.Objective(
    name="val_accuracy",
    direction="max")
tuner = kt.BayesianOptimization(
    build_model,
    objective=objective,
    max_trials=10,
    executions_per_trial=2,
    directory="mnist_kt_test",
    overwrite=True, 
)

In [9]:
# launch search
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = x_train.reshape((-1, 28 * 28)).astype("float32") / 255
x_test = x_test.reshape((-1, 28 * 28)).astype("float32") / 255
x_train_full = x_train[:]
y_train_full = y_train[:]
num_val_samples = 10000
x_train, x_val = x_train[:-num_val_samples], x_train[-num_val_samples:]
y_train, y_val = y_train[:-num_val_samples], y_train[-num_val_samples:]
callbacks = [
  keras.callbacks.EarlyStopping(monitor="val_loss", patience=5),
]
tuner.search(
    x_train, y_train,
    batch_size=128, 
    epochs=100,
    validation_data=(x_val, y_val),
    callbacks=callbacks,
    verbose=2,
)

Trial 10 Complete [00h 00m 59s]
val_accuracy: 0.9750000238418579

Best val_accuracy So Far: 0.9750500023365021
Total elapsed time: 00h 12m 15s
INFO:tensorflow:Oracle triggered exit


In [10]:
# Listing 13.3 Querying the best hyperparameter configurations
top_n = 4
best_hps = tuner.get_best_hyperparameters(top_n)

In [12]:
def get_best_epoch(hp):
  model = build_model(hp)
  callbacks=[
    keras.callbacks.EarlyStopping(
        monitor="val_loss", mode="min", patience=10)
  ]
  history = model.fit(
      x_train, y_train,
      validation_data=(x_val, y_val),
      epochs=15,
      batch_size=128,
      callbacks=callbacks)
  val_loss_per_epoch = history.history["val_loss"]
  best_epoch = val_loss_per_epoch.index(min(val_loss_per_epoch)) + 1
  print(f"Best epoch: {best_epoch}")
  return best_epoch

In [16]:
# train on the full dataset
'''
def get_best_trained_model(hp):
  best_epoch = get_best_epoch(hp)
  model.fit(
      x_train_full, y_train_full,
      batch_size=128, 
      epochs=int(best_epoch * 1.2))
  return model
 
best_models = []

for hp in best_hps:
  model = get_best_trained_model(hp)
'''

'\ndef get_best_trained_model(hp):\n  best_epoch = get_best_epoch(hp)\n  model.fit(\n      x_train_full, y_train_full,\n      batch_size=128, \n      epochs=int(best_epoch * 1.2))\n  return model\n \nbest_models = []\n\nfor hp in best_hps:\n  model = get_best_trained_model(hp)\n'

#### THE ART OF CRAFTING THE RIGHT SEARCH SPACE

#### THE FUTURE OF HYPERPARAMETER TUNING: AUTOMATED MACHINE LEARNING

### 13.1.2 Model ensembling

- Ensembling consists of pooling together the predictions of a set of different models to produce better predictions.
- Ensembling relies on the assumption that different well-performing models trained independently are likely to be good for different reasons

## 13.2 Scaling-up model training

### 13.2.1 Speeding up training on GPU with mixed precision

#### UNDERSTANDING FLOATING-POINT PRECISION

- Precision is to numbers what resolution is to images
-  levels of precision
  - Half precision, or float16 , where numbers are stored on 16 bits
  - Single precision, or float32 , where numbers are stored on 32 bits
  - Double precision, or float64 , where numbers are stored on 64 bits

##### A note on floating-point encoding

```
{sign} * (2 ** ({exponent} - 127)) * 1.{mantissa}
```

- the resolution of floating-point numbers is in terms of the smallest distance between two arbitrary numbers that you’ll be able to safely process. 
  - In single precision, that’s around 1e-7.
  - In double precision, that’s around 1e-16.
  - in half precision, it’s only 1e-3.
- The idea is to leverage 16- bit computations in places where precision isn’t an issue, and to work with 32-bit values in other places to maintain numerical stability

##### Beware of dtype defaults

In [1]:
import tensorflow as tf
import numpy as np
np_array = np.zeros((2, 2))
tf_tensor = tf.convert_to_tensor(np_array)
tf_tensor.dtype

tf.float64

In [2]:
np_array = np.zeros((2, 2))
tf_tensor = tf.convert_to_tensor(np_array, dtype="float32")
tf_tensor.dtype

tf.float32

#### MIXED-PRECISION TRAINING IN PRACTICE

In [None]:
# Turn on mixed precison
from tensorflow import keras
keras.mixed_precision.set_global_policy("mixed_float16")

- Typically, most of the forward pass of the model will be done in float16 (with the exception of numerically unstable operations like softmax)
- the weights of the model will be stored and updated in float32 .

### 13.2.2 Multi-GPU training

- two ways to distribute computation across multiple devices
  - data parallelism
    - a single model is replicated on multiple devices or multiple machines
    - Each of the model replicas processes different batches of data, and then they merge their results
  - model parallelism
    - different parts of a single model run on different devices, processing a single batch of data together at the same time

#### GETTING YOUR HANDS ON TWO OR MORE GPUS

- Acquire 2–4 GPUs, mount them on a single machine (it will require a beefy power supply), and install CUDA drivers, cuDNN, etc. For most people, this isn’t the best option.
- Rent a multi-GPU Virtual Machine (VM) on Google Cloud, Azure, or AWS. You’ll be able to use VM images with preinstalled drivers and software, and you’ll have very little setup overhead. This is likely the best option for anyone who isn’t training models 24/7.

#### SINGLE-HOST, MULTI-DEVICE SYNCHRONOUS TRAINING

In [None]:
# machine with multiple GPUs
'''
strategy = tf.distribute.MirroredStrategy()
print(f"Number of devices: {strategy.num_replicas_in_sync}") 
with strategy.scope():
  model = get_compiled_model()
  model.fit(
      train_dataset,
      epochs=100,
      validation_data=val_dataset,
      callbacks=callbacks)
'''

- MirroredStrategy
  1. A batch of data (called global batch) is drawn from the dataset.
  2. It gets split into four different sub-batches (called local batches). For instance, if the global batch has 512 samples, each of the four local batches will have 128 samples. Because you want local batches to be large enough to keep the GPU busy, the global batch size typically needs to be very large.
  3. Each of the four replicas processes one local batch, independently, on its own device: they run a forward pass, and then a backward pass. Each replica outputs a “weight delta” describing by how much to update each weight variable in the model, given the gradient of the previous weights with respect to the loss of the model on the local batch.
  4. The weight deltas originating from local gradients are efficiently merged across the four replicas to obtain a global delta, which is applied to all replicas. Because this is done at the end of every step, the replicas always stay in sync: their weights are always equal.

### 13.2.3 TPU training

- application-specific integrated circuits

#### USING A TPU VIA GOOGLE COLAB


In [1]:
# connect notebook to tpu
import tensorflow as tf
tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect()
print("Device:", tpu.master())

INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.


INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.


INFO:tensorflow:Initializing the TPU system: grpc://10.71.203.42:8470


INFO:tensorflow:Initializing the TPU system: grpc://10.71.203.42:8470


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Finished initializing TPU system.


Device: grpc://10.71.203.42:8470


In [2]:
# Listing 13.4 Building a model in a TPUStrategy scope
from tensorflow import keras 
from tensorflow.keras import layers
 
strategy = tf.distribute.TPUStrategy(tpu) 
print(f"Number of replicas: {strategy.num_replicas_in_sync}")
 
def build_model(input_size):
  inputs = keras.Input((input_size, input_size, 3))
  x = keras.applications.resnet.preprocess_input(inputs)
  x = keras.applications.resnet.ResNet50(weights=None, include_top=False, pooling="max")(x)
  outputs = layers.Dense(10, activation="softmax")(x)
  model = keras.Model(inputs, outputs)
  model.compile(
      optimizer="rmsprop",
      loss="sparse_categorical_crossentropy",
      metrics=["accuracy"])
  return model
 
with strategy.scope():
  model = build_model(input_size=32)

INFO:tensorflow:Found TPU system:


INFO:tensorflow:Found TPU system:


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


Number of replicas: 8


- two options for data loading:
  - Train from data that lives in the memory of the VM (not on disk). If your data is in a NumPy array, this is what you’re already doing. 
  - Store the data in a Google Cloud Storage (GCS) bucket, and create a dataset that reads the data directly from the bucket, without downloading locally. The TPU runtime can read data from GCS. This is your only option for datasets that are too large to live entirely in memory

In [4]:
# train from NumPy arrays in memory
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
model.fit(x_train, y_train, batch_size=1024,epochs=10) 

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f9e6febc550>

##### Beware of I/O bottlenecks

- If your dataset is small enough, you should keep it in the memory of the VM. You can do so by calling dataset.cache() on your dataset. That way, the data will only be read from GCS once.
- If your dataset is too large to fit in memory, make sure to store it as TFRecord files—an efficient binary storage format that can be loaded very quickly

#### LEVERAGING STEP FUSING TO IMPROVE TPU UTILIZATION

- u need to train with very large batches to keep the TPU cores busy
- working with enormous batches, you should make sure to increase your optimizer learning rate accordingly
- step fusing
  - keep reasonably sized batches while maintaining full TPU utilization
  - run multiple steps of training during each TPU execution step
  - do more work in between two round trips from the VM memory to the TPU