Chapter 12: Custom Models and Training with TensorFlow

Chapter 12 exercises:

1. How would you describe TensorFlow in a short sentence? What are its main features? Can you name other popular deep learning libraries?

-> TensorFlow a powerful library for numerical computations, particularly well suited and fine-tuned for large-scale machine learning

->  TensorFlow core is very similar to Numpy, but with GPU support, optimizations for speed and memory usage, compulation graph analysis, portable graphs, reverse-mode autodiff, excellent optimizers such as RMSProp and Nadam, and powerful APIs including tensorflow Keras implementation (tf.keras), data loading and preprocessing ops (tf.data, tf.io, etc), image processing (tf.image), signal processing ops (tf.signal), and more

-> Other popular Deep Learning libraries include PyTorch, MXNet, Microsoft Cognitive Toolkit, Theano, Caffe2, and Chainer.

page 408: TensorFlow a powerful library for numerical computations, particularly well suited and fine-tuned for large-scale machine learning (but you can use it for anything that requires heavy computations).

page 404: 
TensorFlow core feature summary:
 - core is very similar to Numpy, but with GPU support
 - compulation graph analysis: extracts the computational graph from a python function, optimizes it, and running it efficiently 
 - portable: computation graphs can be exported in a portable format, so you can train a TensorFlow model in one environment (e.g. python on Linux), and run it on another (e.g. using Java on Android)
 - autodiff: implements reverse-mode autodiff and provides excellent optimizers, such as RMSProp and Nadam

TensorFlow additional features built on the core features:
 - keras, data loading and preprocessing ops (tf.data, tf.io, etc), image processing (tf.image), signal processing ops (tf.signal), and more

__book answer__: TensorFlow is an open-source library for numerical computation, particularly well suited and fine-tuned for large-scale Machine Learning. Its core is similar to NumPy, but it also features GPU support, support for distributed computing, computation graph analysis and optimization capabilities (with a portable graph format that allows you to train a TensorFlow model in one environment and run it in another), an optimization API based on reverse-mode autodiff, and several powerful APIs such as tf.keras, tf.data, tf.image, tf.signal, and more. Other popular Deep Learning libraries include PyTorch, MXNet, Microsoft Cognitive Toolkit, Theano, Caffe2, and Chainer.

2. Is TensorFlow a drop-in replacement for Numpy? What are the main differences between the two?

-> TensorFlow offers similar functionality to Numpy, it is not a drop-in replacement for Numpy.

-> Tensflow and Numpy differences include: 
 - a tensor is similar to a ndarray, but it can also hold a scalar;
 - numpy uses 64-bit precision by default, and TensorFlow uses 32-bit by default;
 - function names are not always the same (tensorflow function names: tf.reduce_mean(), tf.reduce_sum(), tf_reduce_max, and tf.math.log() which are generally equivalent to Numpy: np.mean(), np.sum(), np.max(), np,log());
 - some functions do not behave in same way, e.g. tf.transpose(t) creates a new tensor with its own copy of the transposed data while  Numpy t.T just transpose the view on the same data
 - Numpy ndarrays are mutable, while TensorFlow tf.constant() creates an immutable view. Note: tf.variable() can create a mutable values

page 407:
Tensors:
 - TensorFlow API revolve around tensors which flow from operation to operation
 - a tensor is similar to a Numpy ndarray; it is usually a multidimensional array, but it can also hold a scalar (simple value such as 42)

page 408:
Tensor flow operations:
 - create a tensor with 'tensor.constant()'
 - 'shape' and 'dtype' work on tensors
 - indexing works much like Numpy: e.g. t[:, 1:],   t[..., 1, tf.newaxis]
 - math operations include: tf.add(), tf.multiply(), tf.square(), tf.exp(), tf.sqrt(), 
 - Numpy similar operations include: tf.reshape(), tf.squeeze(), tf.tile()
 - some tensor operation functions have different names from Numpy including: tf.reduce_mean(), tf.reduce_sum(), tf_reduce_max, tf.math.log() are generally equivalent to np.mean(),        np.sum(),        np.max(),      np,log()
 - tf.transpose(t) functions similar to numpy t.T, but tensor creates a new tensor

__book answer__: Although TensorFlow offers most of the functionalities provided by NumPy, it is not a drop-in replacement, for a few reasons. First, the names of the functions are not always the same (for example, `tf.reduce_sum()` versus `np.sum()`). Second, some functions do not behave in exactly the same way (for example, `tf.transpose()` creates a transposed copy of a tensor, while NumPy's `T` attribute creates a transposed view, without actually copying any data). Lastly, NumPy arrays are mutable, while TensorFlow tensors are not (but you can use a `tf.Variable` if you need a mutable object).

3. Do you get the same result with tf.range(10) and tf.constant(np.arange(10))?

-> On my 64-bit Windows 10 laptop, both return a tensor containing the integers 0 to 9 of dtype of 32-bit integer (as can be seen below). Although numpy should defaults to 64-bit, it is default to 32-bit on Windows 10 64-bit host.  

https://stackoverflow.com/questions/36278590/numpy-array-dtype-is-coming-as-int32-by-default-in-a-windows-10-64-bit-machine
In Microsoft C, even on a 64 bit system, the size of the long int data type is 32 bits. (See, for example, https://msdn.microsoft.com/en-us/library/9c3yd98k.aspx.) Numpy inherits the default size of an integer from the C compiler's long int.

__book answer__: Both `tf.range(10)` and `tf.constant(np.arange(10))` return a one-dimensional tensor containing the integers 0 to 9. However, the former uses 32-bit integers while the latter uses 64-bit integers. Indeed, TensorFlow defaults to 32 bits, while NumPy defaults to 64 bits.

In [5]:
import numpy as np
import tensorflow as tf

In [6]:
tf.range(10)

<tf.Tensor: shape=(10,), dtype=int32, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])>

In [13]:
np.arange(10).dtype

dtype('int32')

In [11]:
tf.constant(np.arange(10))

<tf.Tensor: shape=(10,), dtype=int32, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])>

In [18]:
import numpy as np
import platform

print(f'sysinfo.platform_bits: {sysinfo.platform_bits}') 
print(f'platform.architecture(): {platform.architecture()}')

sysinfo.platform_bits: 64
platform.architecture(): ('64bit', 'WindowsPE')


4. Can you name six other data structures available in TensorFlow, beyond regular tensors?

-> Six other tensorflow data structures: Spare tensors (tf.SparseTensors), Tensor Arrays (tf.TensorArrays), Ragged tensors (tf.RaggedTensors), String tensors (tf.string), Sets, and Queues (tf.queue). Sets and strings are represented as regular tensors with special functions to manipulate them (in tf.s)


page 411:

>Spare tensors (tf.SparseTensors)
 - efficiently represented tensors containing mostly zeros
 - the tf.sparse package contains operations for sparse tensors
>Tensor Arrays (tf.TensorArrays)
 - Are a lists of tensors. They all have fixed lengths by default, but can optionally be made extensible
>Ragged tensors (tf.RaggedTensors)
 - represent lists of tensors, all of the same rank (dimensions) and data type, but varying sizes
 - the tf.ragged package contains operations for ragged tensors
>String tensors (tf.string)
 - are regular tensors of the type tf.string
 - the represent 'byte strings', not Unicode strings
 - the tf.string package contains operations for byte strings and Unicode strings (convert one into the other)
>Sets:
 - Are represented as regular tensors (or sparse arrays).
 - for example, tf.constant([1,2],[3,4]]) represents two sets [1,2] and [3,4]. More generally, each set is represented by a vector in the tensor's last axis
 - the tf.sets package contains operations for maniplating sets
>Queues
 - stores tensors across multiple steps
 - offers various queues: FIFOQueue, PriortyQueue, RandomShuffleQueue, PaddingFIFOQueue
 - these classes are in the  tf.queue package

__book answer__: Beyond regular tensors, TensorFlow offers several other data structures, including sparse tensors, tensor arrays, ragged tensors, queues, string tensors, and sets. The last two are actually represented as regular tensors, but TensorFlow provides special functions to manipulate them (in tf.strings and tf.sets).

5. You can define custom loss function by writing a function or by subclassing the 'tf.keras.losses.loss class. When would you use each option?

-> In general, you would use a Python custom loss function. If the loss is calculate using labels, predictions, and/or weights, you would implement with a custom loss function. <br>
-> If the custom loss function must support some hyperparameters (or any other state), you would need to implement by subclassing the `tf.keras.losses.Loss` and implement the `__init__()` and `call()` methods. If you want the loss function's hyperparameters to be saved, then you must also implement the `get_config()` method. <br>

page 424: <br>
Defining custom losses base on model internals:
- the previous custom losses and metrics solutions were based on the labels and the predictions (and optionally weights)
- However, there are cases where you need to define custom loss based on model internals, compute it based on any part of the model you want, then pass the result to the 'add_loss()' method
- example case: custom regression MLP composed of stack of 5 hidden layers, plus output layer. 
  It is also includes a auxilary output on top of the upper hidden layer. 
  The auxilary output will have a 'reconstruction loss' (mean squared difference between the 'reconstruction' and the inputs).
  Reconstruction loss will be added to the main loss.

__book answer__:  When you want to define a custom loss function, in general you can just implement it as a regular Python function. However, if your custom loss function must support some hyperparameters (or any other state), then you should subclass the `keras.losses.Loss` class and implement the `__init__()` and `call()` methods. If you want the loss function's hyperparameters to be saved along with the model, then you must also implement the `get_config()` method.

6. Similarly, you can define a custom metric in a function or as a subclass of 'tf.keras.metrics.Metric'. When would you use each? 

-> in general, you would implement a custom metric using a regular Python metric function.<br>
-> However, if your custom metric needs to support some hyperparameters (or any other state),  it would need to implement by subclassing the `tf.keras.metrics.Metric` and implementing the `__init__()`, `update_state()`, and `reset_states()` methods.<br>
-> Additionally, if the custom metric cannot be averaged over each batch and instead needs to be calculate over each epoch (e.g. custom streaming/statefull metrics), then it would also need to implement by subclassing the 'tf.keras.Metrics.metric' and implementing the `__init__()`, `update_state()`, and `reset_states()` methods <br>
-> Finally, if you want the custom metrics function's hyperparameters to be saved, then you must also need implement the get_config() method in addition to implementing the custom metric as a subclassing of `tf.keras.metrics.Metric`

page 419:
Custom metric using a simple function vs Custom streaming metric (or stateful metric)
 - keras automatically calls simple metric function for each batch, and it keeps track of the mean during each epoch
 - The only benefit of our HuberMetric class (subclass of tf.keras.metrics.Metric') is that the 'threshold' (custom) variable will be saved
 - some metrics like 'precision' can not be simply averaged over batches; in those cases, there is no option that to implement a stream metric

__book answer__: Much like custom loss functions, most metrics can be defined as regular Python functions. But if you want your custom metric to support some hyperparameters (or any other state), then you should subclass the `keras.metrics.Metric` class. Moreover, if computing the metric over a whole epoch is not equivalent to computing the mean metric over all batches in that epoch (e.g., as for the precision and recall metrics), then you should subclass the `keras.metrics.Metric` class and implement the `__init__()`, `update_state()`, and `result()` methods to keep track of a running metric during each epoch. You should also implement the `reset_states()` method unless all it needs to do is reset all variables to 0.0. If you want the state to be saved along with the model, then you should implement the `get_config()` method as well.

7. When should you create a custom layer versus a custom model?

-> Layers are reusable blocks (objects) in your model. If you need custom functionality in these reusable blocks used inside your model, these should be implmented using custom layers. <br>
-> The model is the object to be trained. If how the model is trained needs to be customized, then it needs to be implemented with a custom model. For example, if you need to create custom loss metric based on the model's internals, then you need to implement a custom model.  


page 424: <br>
Model Class
 - is a subclass of the 'Layer' class, so models can be used exactly like layers
 - Model have some additional functionalities (than layers), including of course compile(), fit(), evaluate(), and predict() methods, plus 'get_layer()' method and 'save()' method
 - layers should subclass the 'Layer' class and models should subclass the 'Model()' class

page 424: <br>
Defining custom losses base on model internals <br>
 - the previous custom losses and metrics solutions were based on the labels and the predictions (and optionally weights)
 - However, there are cases where you need to define custom loss based on model internals, compute it based on any part of the model you want, then pass the result to the 'add_loss()' method
 - example case: custom regression MLP composed of stack of 5 hidden layers, plus output layer. 
   It is also includes a auxilary output on top of the upper hidden layer. 
   The auxilary output will have a 'reconstruction loss' (mean squared difference between the 
   'reconstruction' and the inputs).
   Reconstruction loss will be added to the main loss.<br>

__book answer__:  You should distinguish the internal components of your model (i.e., layers or reusable blocks of layers) from the model itself (i.e., the object you will train). The former should subclass the `keras.layers.Layer` class, while the latter should subclass the `keras.models.Model` class.

8. What are some use cases that require writing your own custom training loop?

-> Custom training loop should only be used when the models `fit()` method is not flexible enough for what you need to do during training. Custom training loops are complex and error prone, so they should be avoided where possible. It should be noted that there are means to customize Training without custom training loops. These include custom losses, custom regularizers, custom activation funtion, and so. 

page 415: <br>
Custom Keras functionalities
 - Most of Keras functionalities, such as losses, regularizers, constraints, initializers, metrics, activation functions, layers, and even full models can be customized in same way as the previous section showed for loss functions
 - most of the time, you will just need to write a simple function with the appropriate inputs and outputs
 - in the below example: 
 - the activation function will be applied to the output of this Dense layer, and its results will be passed on the to the next layer
 - the layer's weights will be initialized by the value returned by the initializer
 - at each training step the weights will be passed the regularization function to compute the regularization loss, which will be added to the main loss to get the final loss for training
 - the constraint function will be called after each training step, and teh layer's weights will be replaced by the constrained weights <br>

page 430: <br>
Custom loop training
 - used when the 'fit()' method may not be flexible enough for what you need to do
 - example: Wide & Deep uses two different optimizers: one for the wide path and the other for the deep path
 - Unless you really need the extra flexibility, you should use the 'fit()' method <br>

__book answer__: Writing your own custom training loop is fairly advanced, so you should only do it if you really need to. Keras provides several tools to customize training without having to write a custom training loop: callbacks, custom regularizers, custom constraints, custom losses, and so on. You should use these instead of writing a custom training loop whenever possible: writing a custom training loop is more error-prone, and it will be harder to reuse the custom code you write. However, in some cases writing a custom training loop is necessary⁠—for example, if you want to use different optimizers for different parts of your neural network, like in the [Wide & Deep paper](https://homl.info/widedeep). A custom training loop can also be useful when debugging, or when trying to understand exactly how training works.

9. Can custom Keras components contain arbitrary Python code, or must they be convertible to TF functions?

-> For performance, portability, and functionality reasons, python code in Keras components should be convertible to TF functions following the rules provide on pages 437 - 438 (and below). <br>
-> However, you can wrap arbitrary Python code in a `tf.py_function()` operation, but doing so will hinder performance.  You can tell Keras not to convert your python functions to TF functions when creating a custom layer or custom model by setting `dynamic=True` or set `run_eagerly=True` when calling the model's `compile()` method. Again, this will hinder performance. 

page 434: <br>
tf.function() with Accelerated Linear Algebra (XLA) <br>
 - if you set jit_compile=True when calling tf.function(), then the TensorFlow will use Accelerated Linear Algrebra (XLA) to compile dedicated kernels for your graph, often fusing multiple operations
   - for example, XLA can compute tf.reduce_sum(a * b + c) in one step instead of 3 steps

custom functions with kera models
 - when you write a custom loss function, a custom metric function, a custom layer, or any other custom function and use it with a Kera model, Keras automatically converts your function into a TF function (no need to use tf.function())
 - if you want Keras to use XLA, you need to just set jit_compile=True when call the compile() method
 - you can tell Keras not to convert your python functions to TF functions when creating a custom layer or custom model by setting `dynamic=True` or set `run_eagerly=True` when calling the model's `compile()` method <br>


page 437 - 438: <br>
tf.function:
   - converting a Python function that performs TensorFlow operations into a TF function usually just require decorating it with: '@tf.function'

Converting to 'tf.function' rules:
- External libraries
   - external libraries including Numpy and std library will run only during tracing; they will not be part of the graph
   - examples: use tf.reduce_sum() instead of 'np.sum()', tf.sort() instead of built-in 'sorted()' function, and so on

- random numbers
   - if you define a TF function f(x) that just returns 'np.random.rand()', a random number will be generated when the function is traced, so f(tf.constant(2.) and f(tf.constant(3.) will generate the same random number, but f(tf.constant([2., 3.])) will return a different number (because input shape is different)
   - to generate a random number each call, replace 'np.random.rand()' with tf.random.uniform([]) code side effects
   - if your non-TensorFlow code has side-effects (e.g. logging, updating python counter), side effect will occur only when function is traced
- wrap python code in 'tf.py_function()'
   - you can wrap arbitrary Python code in a 'tf.py_function()' operation, but doing so will hinder performance

- Calling other Python Functions or TF functions
   - you can call other python functions or TF functions, but they should follow the same rules as TensorFlow will capture their operations in the computation graphs
   - other functions do NOT need to be decorated

- TensorFlow variables
   - if the function creates a TensorFlow variable (or dataset, queue, etc.), it must do so upoon the very first call 
   - it is preferable to variables outside of the TF function (e.g. in buld() method or custom layer
   - if you want to assign a new value to a variable, may sure you call its 'assign()' method instead of '=' operator

- Python Source code
   - source code of your python function should be available to tensorFlow
   - if source code is unavailable, then the graph generation process will fail or have limited functionality

- Vectorized implementations
   - for performance reasons, you should prefer a vectorized implementation whenever you can, rather than using loops


__book answer__: Custom Keras components should be convertible to TF Functions, which means they should stick to TF operations as much as possible and respect all the rules listed in Chapter 12 (in the _TF Function Rules_ section). If you absolutely need to include arbitrary Python code in a custom component, you can either wrap it in a `tf.py_function()` operation (but this will reduce performance and limit your model's portability) or set `dynamic=True` when creating the custom layer or model (or set `run_eagerly=True` when calling the model's `compile()` method).

10. What are the main rules to respect if you want a function to be convertible to a TF function?

-> see __TF Function Rules__ section on pages 437 - 438 (and summarized in answer to question 9). <br>

__book answer__: Please refer to Chapter 12 for the list of rules to respect when creating a TF Function (in the _TF Function Rules_ section).

11. When would you need to create dynamic Keras model? How do you do that? Why not make all your model dynamic?

-> Creating a dynamic Keras model can be useful for debugging, as it will not compile any custom component to a TF Functions. It can also be useful if you want to include arbitrary Python code in your model (or in your training code), including calls to external libraries. <br>
-> To make a model dynamic, you must set `dynamic=True` when creating it (e.g. model = MyModel(dynamic=True)). Alternatively, you can set `run_eagerly=True` when calling the model's `compile()` method (e.g. model.compile(loss=my_mse, optimizer="nadam", run_eagerly=True). 
-> Making a model dynamic prevents Keras from using any of TensorFlow's graph features, so it will slow down training and inference, and you will not have the possibility to export the computation graph, which will limit your model's portability.


page 434: <br>
custom functions with kera models
 - when you write a custom loss function, a custom metric function, a custom layer, or any other custom function and use it with a Kera model, Keras automatically converts your function into a TF function (no need to use tf.function())
 - if you want Keras to use XLA, you need to just set jit_compile=True when call the compile() method
 - you can tell Keras not to convert your python functions to TF functions when creating a custom layer or custom model by setting 'dynamic=True' or set 'run_eagerly=True' when calling the model's compile() method <br>

https://keras.io/2.15/api/layers/base_layer/<br>
Layer class <br>
tf_keras.layers.Layer(trainable=True, name=None, dtype=None, dynamic=False, **kwargs) <br>
The base Layer class __dynamic__ argument: <br>
dynamic: Set this to True if your layer should only be run eagerly, and should not be used to generate a static computation graph. This would be the case for a Tree-RNN or a recursive network, for example, or generally for any layer that manipulates tensors using Python control flow. If False, we assume that the layer can safely be used to generate a static computation graph.

Model training APIs <br>
compile method <br>
Model.compile( optimizer="rmsprop", loss=None, loss_weights=None, metrics=None, weighted_metrics=None, run_eagerly=False, steps_per_execution=1, jit_compile="auto", auto_scale_loss=True,) <br>

The Model `compile()` method's __run_eagerly__ argument: <br>
run_eagerly: Bool. If True, this model's forward pass will never be compiled. It is recommended to leave this as False when training (for best performance), and to set it to True when debugging.


__book answer__: Creating a dynamic Keras model can be useful for debugging, as it will not compile any custom component to a TF Function, and you can use any Python debugger to debug your code. It can also be useful if you want to include arbitrary Python code in your model (or in your training code), including calls to external libraries. To make a model dynamic, you must set `dynamic=True` when creating it. Alternatively, you can set `run_eagerly=True` when calling the model's `compile()` method. Making a model dynamic prevents Keras from using any of TensorFlow's graph features, so it will slow down training and inference, and you will not have the possibility to export the computation graph, which will limit your model's portability.

__12.__ Implement a custom layer that performs 'layer normalization' (we will use this type of layer in Chapter 15).

In [19]:
import numpy as np
import sklearn
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron
import matplotlib.pyplot as plt
from scipy.special import expit as sigmoid

plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

In [2]:
from pathlib import Path

IMAGES_PATH = Path() / "images" / "deep"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

__12 a.__
_Exercise: The `build()` method should define two trainable weights *α* and *β*, both of shape `input_shape[-1:]` and data type `tf.float32`. *α* should be initialized with 1s, and *β* with 0s._

__12 b.__
_Exercise: The `call()` method should compute the mean_ μ _and standard deviation_ σ _of each instance's features. For this, you can use `tf.nn.moments(inputs, axes=-1, keepdims=True)`, which returns the mean μ and the variance σ<sup>2</sup> of all instances (compute the square root of the variance to get the standard deviation). Then the function should compute and return *α*⊗(*X* - μ)/(σ + ε) + *β*, where ⊗ represents itemwise multiplication (`*`) and ε is a smoothing term (small constant to avoid division by zero, e.g., 0.001)._

In [26]:
class LayerNormalization(tf.keras.layers.Layer):
    def __init__(self, eps=0.001, **kwargs):
        super().__init__(**kwargs)
        self.eps = eps

    def build(self, batch_input_shape):
        self.alpha = self.add_weight(name="alpha", shape=batch_input_shape[-1:], initializer="ones")
        self.beta = self.add_weight(name="beta", shape=batch_input_shape[-1:], initializer="zeros")


    def call(self, X):
        mean, variance = tf.nn.moments(X, axes=-1, keepdims=True)
        return self.alpha * (X - mean) / (tf.math.sqrt(variance + self.eps)) + self.beta 

    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "eps": self.eps}

Note that making ε a hyperparameter (eps) was not compulsory. Also note that it's preferable to compute tf.sqrt(variance + self.eps) rather than tf.sqrt(variance) + self.eps. Indeed, the derivative of sqrt(z) is undefined when z=0, so training will bomb whenever the variance vector has at least one component equal to 0. Adding ε within the square root guarantees that this will never happen.

__12 c.__
_Exercise: Ensure that your custom layer produces the same (or very nearly the same) output as the `tf.keras.layers.LayerNormalization` layer._

 Train it on the California housing dataset:

In [24]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

Let's create one instance of each class, apply them to some data (e.g., the training set), and ensure that the difference is negligeable.

In [27]:
X = X_train.astype(np.float32)

custom_layer_norm = LayerNormalization()
keras_layer_norm = tf.keras.layers.LayerNormalization()

tf.reduce_mean(tf.keras.losses.mean_absolute_error(
    keras_layer_norm(X), custom_layer_norm(X)))




<tf.Tensor: shape=(), dtype=float32, numpy=3.44626e-08>

Yep, that's close enough. To be extra sure, let's make alpha and beta completely random and compare again:

In [35]:
tf.keras.utils.set_random_seed(42)
random_alpha = np.random.rand(X.shape[-1])
random_beta = np.random.rand(X.shape[-1])

custom_layer_norm.set_weights([random_alpha, random_beta])
keras_layer_norm.set_weights([random_alpha, random_beta])

tf.reduce_mean(tf.keras.losses.mean_absolute_error(
    keras_layer_norm(X), custom_layer_norm(X)))

<tf.Tensor: shape=(), dtype=float32, numpy=1.6339667e-08>

Still a negligeable difference! Our custom layer works fine.

__13.__ Train a model using a custom training loop to tackle the Fashion MNIST dataset
_The Fashion MNIST dataset was introduced in Chapter 10._

__13 a.__
_Exercise: Display the epoch, iteration, mean training loss, and mean accuracy over each epoch (updated at each iteration), as well as the validation loss and accuracy at the end of each epoch._

In [38]:
import tensorflow as tf

fashion_mnist = tf.keras.datasets.fashion_mnist.load_data()
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist
X_train_full = X_train_full.astype(np.float32) / 255.
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test.astype(np.float32) / 255.

In [39]:
tf.keras.utils.set_random_seed(42)

In [40]:
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax"),
])

Note: previously installed for lesson 10.

pre-pydot installation message from below 'tf.keras.utils.plot_model()' call:

You must install pydot (pip install pydot) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.

To install:

!pip install pydot

restart kernel (Kernel -> Restart Kernel)

In "06_decision_trees.ipynb', graphviz was installed using:

conda install anaconda::graphviz



In [49]:
def random_batch(X, y, batch_size=32):
    idx = np.random.randint(len(X), size=batch_size)
    return X[idx], y[idx]

In [51]:
n_epochs = 5
batch_size = 32
n_steps = len(X_train) // batch_size
optimizer = tf.keras.optimizers.Nadam(learning_rate=0.01)
loss_fn = tf.keras.losses.sparse_categorical_crossentropy
mean_loss = tf.keras.metrics.Mean()
metrics = [tf.keras.metrics.SparseCategoricalAccuracy()]

In [53]:
from tqdm.notebook import trange
from collections import OrderedDict

In [54]:
with trange(1, n_epochs + 1, desc="All epochs") as epochs:
    for epoch in epochs:
        with trange(1, n_steps + 1, desc=f"Epoch {epoch}/{n_epochs}") as steps:
            for step in steps:
                X_batch, y_batch = random_batch(X_train, y_train)
                with tf.GradientTape() as tape:
                    y_pred = model(X_batch)
                    main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
                    loss = tf.add_n([main_loss] + model.losses)
                gradients = tape.gradient(loss, model.trainable_variables)
                optimizer.apply_gradients(zip(gradients, model.trainable_variables))
                for variable in model.variables:
                    if variable.constraint is not None:
                        variable.assign(variable.constraint(variable))
                status = OrderedDict()
                mean_loss(loss)
                status["loss"] = mean_loss.result().numpy()
                for metric in metrics:
                    metric(y_batch, y_pred)
                    status[metric.name] = metric.result().numpy()
                steps.set_postfix(status)
            y_pred = model(X_valid)
            status["val_loss"] = np.mean(loss_fn(y_valid, y_pred))
            status["val_accuracy"] = np.mean(tf.keras.metrics.sparse_categorical_accuracy(
                tf.constant(y_valid, dtype=np.float32), y_pred))
            steps.set_postfix(status)
        for metric in [mean_loss] + metrics:
            metric.reset_states()


All epochs:   0%|          | 0/5 [00:00<?, ?it/s]

Epoch 1/5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 2/5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 3/5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 4/5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 5/5:   0%|          | 0/1718 [00:00<?, ?it/s]

__13 b.__
_Exercise: Try using a different optimizer with a different learning rate for the upper layers and the lower layers._

In [56]:
tf.keras.utils.set_random_seed(42)

In [57]:
lower_layers = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu"),
])
upper_layers = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation="softmax"),
])
model = tf.keras.Sequential([
    lower_layers, upper_layers
])

In [58]:
lower_optimizer = tf.keras.optimizers.SGD(learning_rate=1e-4)
upper_optimizer = tf.keras.optimizers.Nadam(learning_rate=1e-3)

In [59]:
n_epochs = 5
batch_size = 32
n_steps = len(X_train) // batch_size
loss_fn = tf.keras.losses.sparse_categorical_crossentropy
mean_loss = tf.keras.metrics.Mean()
metrics = [tf.keras.metrics.SparseCategoricalAccuracy()]

In [60]:
with trange(1, n_epochs + 1, desc="All epochs") as epochs:
    for epoch in epochs:
        with trange(1, n_steps + 1, desc=f"Epoch {epoch}/{n_epochs}") as steps:
            for step in steps:
                X_batch, y_batch = random_batch(X_train, y_train)
                with tf.GradientTape(persistent=True) as tape:
                    y_pred = model(X_batch)
                    main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
                    loss = tf.add_n([main_loss] + model.losses)
                
                for layers, optimizer in ((lower_layers, lower_optimizer),
                                          (upper_layers, upper_optimizer)):
                    gradients = tape.gradient(loss, layers.trainable_variables)
                    optimizer.apply_gradients(zip(gradients, layers.trainable_variables))
                del tape
                
                for variable in model.variables:
                    if variable.constraint is not None:
                        variable.assign(variable.constraint(variable))
                
                status = OrderedDict()
                mean_loss(loss)
                status["loss"] = mean_loss.result().numpy()
                for metric in metrics:
                    metric(y_batch, y_pred)
                    status[metric.name] = metric.result().numpy()
                
                steps.set_postfix(status)
            
            y_pred = model(X_valid)
            status["val_loss"] = np.mean(loss_fn(y_valid, y_pred))
            status["val_accuracy"] = np.mean(tf.keras.metrics.sparse_categorical_accuracy(
                tf.constant(y_valid, dtype=np.float32), y_pred))
            
            steps.set_postfix(status)
        
        for metric in [mean_loss] + metrics:
            metric.reset_states()

All epochs:   0%|          | 0/5 [00:00<?, ?it/s]

Epoch 1/5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 2/5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 3/5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 4/5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 5/5:   0%|          | 0/1718 [00:00<?, ?it/s]