1. How would you describe TensorFlow in a short sentence? What are its
main features? Can you name other popular deep learning libraries?
>TensorFlow is an open-source library for numerical computation, particularly well suited and fine-tuned for large-scale Machine Learning. Its core is similar to NumPy, but it also features GPU support, support for distributed computing, computation graph analysis and optimization capabilities (with a portable graph format that allows you to train a TensorFlow model in one environment and run it in another), an optimization API based on reverse-mode autodiff, and several powerful APIs such as tf.keras, tf.data, tf.image, tf.signal, and more. Other popular Deep Learning libraries include PyTorch, MXNet, Microsoft Cognitive Toolkit, Theano, Caffe2, and Chainer.

2. Is TensorFlow a drop-in replacement for NumPy? What are the main
differences between the two?
> Although TensorFlow offers most of the functionalities provided by NumPy, it is not a drop-in replacement, for a few reasons. First, the names of the functions are not always the same (for example, tf.reduce_sum() versus np.sum()). Second, some functions do not behave in exactly the same way (for example, tf.transpose() creates a transposed copy of a tensor, while NumPy's T attribute creates a transposed view, without actually copying any data). Lastly, NumPy arrays are mutable, while TensorFlow tensors are not (but you can use a tf.Variable if you need a mutable object).

3. Do you get the same result with tf.range(10) and tf.constant(np.
arange(10))?
> Both tf.range(10) and tf.constant(np.arange(10)) return a one-dimensional tensor containing the integers 0 to 9. However, the former uses 32-bit integers while the latter uses 64-bit integers. Indeed, TensorFlow defaults to 32 bits, while NumPy defaults to 64 bits.

4. Can you name six other data structures available in TensorFlow,
beyond regular tensors?
> Beyond regular tensors, TensorFlow offers several other data structures, including sparse tensors, tensor arrays, ragged tensors, queues, string tensors, and sets. The last two are actually represented as regular tensors, but TensorFlow provides special functions to manipulate them (in tf.strings and tf.sets).

5. You can define a custom loss function by writing a function or by
subclassing the tf.keras.losses.Loss class. When would you use
each option?
> When you want to define a custom loss function, in general you can just implement it as a regular Python function. However, if your custom loss function must support some hyperparameters (or any other state), then you should subclass the keras.losses.Loss class and implement the \__init__() and call() methods. If you want the loss function's hyperparameters to be saved along with the model, then you must also implement the get_config() method.

6. Similarly, you can define a custom metric in a function or as a subclass
of tf.keras.metrics.Metric. When would you use each option?
> Much like custom loss functions, most metrics can be defined as regular Python functions. But if you want your custom metric to support some hyperparameters (or any other state), then you should subclass the keras.metrics.Metric class. Moreover, if computing the metric over a whole epoch is not equivalent to computing the mean metric over all batches in that epoch (e.g., as for the precision and recall metrics), then you should subclass the keras.metrics.Metric class and implement the \__init__(), update_state(), and result() methods to keep track of a running metric during each epoch. You should also implement the reset_state() method unless all it needs to do is reset all variables to 0.0. If you want the state to be saved along with the model, then you should implement the get_config() method as well.

7. When should you create a custom layer versus a custom model?
> You should distinguish the internal components of your model (i.e., layers or reusable blocks of layers) from the model itself (i.e., the object you will train). The former should subclass the keras.layers.Layer class, while the latter should subclass the keras.models.Model class.

8. What are some use cases that require writing your own custom training
loop?
> Writing your own custom training loop is fairly advanced, so you should only do it if you really need to. Keras provides several tools to customize training without having to write a custom training loop: callbacks, custom regularizers, custom constraints, custom losses, and so on. You should use these instead of writing a custom training loop whenever possible: writing a custom training loop is more error-prone, and it will be harder to reuse the custom code you write. However, in some cases writing a custom training loop is necessary⁠—for example, if you want to use different optimizers for different parts of your neural network, like in the Wide & Deep paper. A custom training loop can also be useful when debugging, or when trying to understand exactly how training works.

9. Can custom Keras components contain arbitrary Python code, or must
they be convertible to TF functions?
> Custom Keras components should be convertible to TF Functions, which means they should stick to TF operations as much as possible and respect all the rules listed in Chapter 12 (in the TF Function Rules section). If you absolutely need to include arbitrary Python code in a custom component, you can either wrap it in a tf.py_function() operation (but this will reduce performance and limit your model's portability) or set dynamic=True when creating the custom layer or model (or set run_eagerly=True when calling the model's compile() method).

10. What are the main rules to respect if you want a function to be
convertible to a TF function?
> RULES:

* If you call any external library, including NumPy or even the standard
    library, this call will run only during tracing; it will not be part of the
    graph. Indeed, a TensorFlow graph can only include TensorFlow
    constructs (tensors, operations, variables, datasets, and so on). So,
    make sure you use tf.reduce_sum() instead of np.sum(),
    tf.sort() instead of the built-in sorted() function, and so on
    (unless you really want the code to run only during tracing). This has a
    few additional implications:
    *    If you define a TF function f(x) that just returns
        np.random.rand(), a random number will only be generated
        when the function is traced, so f(tf.constant(2.)) and
        f(tf.constant(3.)) will return the same random number, but
        f(tf.constant([2., 3.])) will return a different one. If you
        replace np.random.rand() with tf.random.uniform([]), then
        a new random number will be generated upon every call, since
        the operation will be part of the graph.
    *   If your non-TensorFlow code has side effects (such  as logging
    something or updating a Python counter), then you should not
    expect those side effects to occur every time you call the TF
    function, as they will only occur when the function is traced.

    * You can wrap arbitrary Python code in a tf.py_function()
    operation, but doing so will hinder performance, as TensorFlow
    will not be able to do any graph optimization on this code. It will
    also reduce portability, as the graph will only run on platforms
    where Python is available (and where the right libraries are
    installed).

* You can call other Python functions or TF functions, but they should
follow the same rules, as TensorFlow will capture their operations in
the computation graph. Note that these other functions do not need to
be decorated with @tf.function.

* If the function creates a TensorFlow variable (or any other stateful
TensorFlow object, such as a dataset or a queue), it must do so upon
the very first call, and only then, or else you will get an exception. It is
usually preferable to create variables outside of the TF function (e.g.,
in the build() method of a custom layer). If you want to assign a new
value to the variable, make sure you call its assign() method instead
of using the = operator.

* The source code of your Python function should be available to
TensorFlow. If the source code is unavailable (for example, if you
define your function in the Python shell, which does not give access to
the source code, or if you deploy only the compiled *.pyc Python files
to production), then the graph generation process will fail or have
limited functionality.

* TensorFlow will only capture for loops that iterate over a tensor or a
tf.data.Dataset (see Chapter 13). Therefore, make sure you use for
i in tf.range(x) rather than for i in range(x), or else the loop
will not be captured in the graph. Instead, it will run during tracing.
(This may be what you want if the for loop is meant to build the
graph; for example, to create each layer in a neural network.)

* As always, for performance reasons, you should prefer a vectorized
implementation whenever you can, rather than using loops.


11. When would you need to create a dynamic Keras model? How do you
do that? Why not make all your models dynamic?

> Creating a dynamic Keras model can be useful for debugging, as it will not compile any custom component to a TF Function, and you can use any Python debugger to debug your code. It can also be useful if you want to include arbitrary Python code in your model (or in your training code), including calls to external libraries. To make a model dynamic, you must set dynamic=True when creating it. Alternatively, you can set run_eagerly=True when calling the model's compile() method. Making a model dynamic prevents Keras from using any of TensorFlow's graph features, so it will slow down training and inference, and you will not have the possibility to export the computation graph, which will limit your model's portability.


12. Implement a custom layer that performs layer normalization (we will
use this type of layer in Chapter 15):

### a.
_Exercise: The `build()` method should define two trainable weights *α* and *β*, both of shape `input_shape[-1:]` and data type `tf.float32`. *α* should be initialized with 1s, and *β* with 0s._

### b.
_Exercise: The `call()` method should compute the mean_ μ _and standard deviation_ σ _of each instance's features. For this, you can use `tf.nn.moments(inputs, axes=-1, keepdims=True)`, which returns the mean μ and the variance σ<sup>2</sup> of all instances (compute the square root of the variance to get the standard deviation). Then the function should compute and return *α*⊗(*X* - μ)/(σ + ε) + *β*, where ⊗ represents itemwise multiplication (`*`) and ε is a smoothing term (small constant to avoid division by zero, e.g., 0.001)._

In [1]:
import tensorflow as tf
class LayerNormalization(tf.keras.layers.Layer):
    def __init__(self,eps=0.001,**kwargs):
        super().__init__(**kwargs)
        self.eps = eps

    def build(self,batch_input_shape):
        self.alpha = self.add_weight(
            name="alpha",shape=batch_input_shape[-1:],
            initializer="ones")
        self.beta = self.add_weight(
            name="beta",shape=batch_input_shape[-1:],
            initializer="zeros")
    
    def call(self,X):
        mean, variance = tf.nn.moments(X,axes=-1,keepdims=True)
        return self.alpha * (X - mean) / (tf.sqrt(variance + self.eps)) + self.beta
    
    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "eps":self.eps}


>Note that making ε a hyperparameter (eps) was not compulsory. Also note that it's preferable to compute tf.sqrt(variance + self.eps) rather than tf.sqrt(variance) + self.eps. Indeed, the derivative of sqrt(z) is undefined when z=0, so training will bomb whenever the variance vector has at least one component equal to 0. Adding ε within the square root guarantees that this will never happen.

### c.
_Exercise: Ensure that your custom layer produces the same (or very nearly the same) output as the `tf.keras.layers.LayerNormalization` layer._

Let's create one instance of each class, apply them to some data (e.g., the training set), and ensure that the difference is negligeable.

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()

X_train_full,X_test,y_train_full,y_test = train_test_split(
    housing.data, housing.target.reshape(-1,1),random_state=42
)

X_train,X_valid,y_train,y_valid = train_test_split(
    X_train_full,y_train_full,random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(X_test)

In [3]:
import numpy as np

X = X_train.astype(np.float32)

custom_layer_norm = LayerNormalization()
keras_layer_norm = tf.keras.layers.LayerNormalization()

tf.reduce_mean(tf.keras.losses.MeanAbsoluteError()(
    keras_layer_norm(X), custom_layer_norm(X)
))

<tf.Tensor: shape=(), dtype=float32, numpy=2.9945699253630664e-08>

Yep, that's close enough. To be extra sure, let's make alpha and beta completely random and compare again:

In [4]:
tf.keras.utils.set_random_seed(42)

random_alpha = np.random.rand(X.shape[-1])
random_beta = np.random.rand(X.shape[-1])

custom_layer_norm.set_weights([random_alpha,random_beta])
keras_layer_norm.set_weights([random_alpha,random_beta])

tf.reduce_mean(tf.keras.losses.MeanAbsoluteError()(
    keras_layer_norm(X),custom_layer_norm(X)
))


<tf.Tensor: shape=(), dtype=float32, numpy=1.6172254646562578e-08>

Still a negligeable difference! Our custom layer works fine.

13. Train a model using a custom training loop to tackle the Fashion
MNIST dataset

### a.
_Exercise: Display the epoch, iteration, mean training loss, and mean accuracy over each epoch (updated at each iteration), as well as the validation loss and accuracy at the end of each epoch._

In [5]:
(X_train_full,y_train_full),(X_test,y_test) = tf.keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full.astype(np.float32) / 255.
X_valid,X_train = X_train_full[:5000],X_train_full[5000:]
y_valid,y_train = y_train_full[:5000],y_train_full[5000:]
X_test = X_test.astype(np.float32) / 255.

In [6]:
tf.keras.utils.set_random_seed(42)

In [8]:
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28,28]),
    tf.keras.layers.Dense(100,activation="relu"),
    tf.keras.layers.Dense(10,activation="softmax"),
])

In [9]:
n_epochs = 5
batch_size = 32
n_steps = len(X_train) // batch_size
optimizer = tf.keras.optimizers.Nadam(learning_rate=0.01)
loss_fn = tf.keras.losses.sparse_categorical_crossentropy
mean_loss = tf.keras.metrics.Mean()
metrics = [tf.keras.metrics.SparseCategoricalAccuracy()]

In [10]:
import numpy as np
def random_batch(X,y,batch_size=32):
    idx = np.random.randint(len(X),size =batch_size)
    return X[idx],y[idx]

In [11]:
from tqdm.notebook import trange
from collections import OrderedDict
with trange(1,n_epochs+1,desc="All epochs") as epochs:
    for epoch in epochs:
        with trange(1, n_steps+1,desc=f"Epoch {epoch} / {n_epochs}") as steps:
            for step in steps:
                X_batch,y_batch = random_batch(X_train,y_train)
                with tf.GradientTape() as tape:
                    y_pred = model(X_batch)
                    main_loss = tf.reduce_mean(loss_fn(y_batch,y_pred))
                    loss = tf.add_n([main_loss] + model.losses)
                gradients = tape.gradient(loss,model.trainable_variables)
                optimizer.apply_gradients(zip(gradients,model.trainable_variables))
                for variable in model.variables:
                    if variable.constraint is not None:
                        variable.assign(variable.constraint(variable))
                status = OrderedDict()
                mean_loss(loss)
                status["loss"] = mean_loss.result().numpy()
                for metric in metrics:
                    metric(y_batch,y_pred)
                    status[metric.name] = metric.result().numpy()
                steps.set_postfix(status)

            y_pred = model(X_valid)
            status["val_loss"] = np.mean(loss_fn(y_valid,y_pred))
            status["val_accuracy"] = np.mean(tf.keras.metrics.sparse_categorical_accuracy(
                tf.constant(y_valid,dtype=np.float32),y_pred))
            steps.set_postfix(status)

        for metric in [mean_loss] + metrics:
            metric.reset_state()


All epochs:   0%|          | 0/5 [00:00<?, ?it/s]

Epoch 1 / 5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 2 / 5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 3 / 5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 4 / 5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 5 / 5:   0%|          | 0/1718 [00:00<?, ?it/s]

### b.
_Exercise: Try using a different optimizer with a different learning rate for the upper layers and the lower layers._

In [12]:
tf.keras.utils.set_random_seed(42)

In [14]:
lower_layers = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28,28]),
    tf.keras.layers.Dense(100,activation="relu"),
])

upper_layers = tf.keras.Sequential([
    tf.keras.layers.Dense(10,activation="softmax"),
])

model = tf.keras.Sequential([
    lower_layers,upper_layers
])

  super().__init__(**kwargs)


In [15]:
lower_optimizer  = tf.keras.optimizers.SGD(learning_rate=1e-4)
upper_optimizer = tf.keras.optimizers.Nadam(learning_rate=1e-3)

In [16]:
n_epochs = 5
batch_size = 32
n_steps = len(X_train) // batch_size
loss_fn = tf.keras.losses.sparse_categorical_crossentropy
mean_loss = tf.keras.metrics.Mean()
metrics = [tf.keras.metrics.SparseCategoricalAccuracy()]

In [18]:
with trange(1, n_epochs+1 ,desc="All epochs") as epochs:
    for epoch in epochs:
        with trange(1, n_steps + 1, desc=f"Epoch {epoch}/{n_epochs}") as steps:
            for step in steps:
                X_batch, y_batch = random_batch(X_train,y_train)
                with tf.GradientTape(persistent=True) as tape:
                    y_pred = model(X_batch)
                    main_loss = tf.reduce_mean(loss_fn(y_batch,y_pred))
                    loss = tf.add_n([main_loss] + model.losses)
                for layers, optimizer in ((lower_layers,lower_optimizer),
                                          (upper_layers,upper_optimizer)):
                    gradients = tape.gradient(loss,layers.trainable_variables)
                    optimizer.apply_gradients(zip(gradients,layers.trainable_variables))
                del tape
                for variable in model.variables:
                    if variable.constraint is not None:
                        variable.assign(variable.constraint(variable))

                status = OrderedDict()
                mean_loss(loss)
                status["loss"] = mean_loss.result().numpy()
                for metric in metrics:
                    metric(y_batch,y_pred)
                    status[metric.name] = metric.result().numpy()
                steps.set_postfix(status)
            y_pred = model(X_valid)
            status["val_loss"] = np.mean(loss_fn(y_valid,y_pred))
            status["val_accuracy"] = np.mean(tf.keras.metrics.sparse_categorical_accuracy(
                tf.constant(y_valid,dtype=np.float32),y_pred))
            steps.set_postfix(status)

        for metric in [mean_loss] + metrics:
            metric.reset_state()

All epochs:   0%|          | 0/5 [00:00<?, ?it/s]

Epoch 1/5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 2/5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 3/5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 4/5:   0%|          | 0/1718 [00:00<?, ?it/s]

Epoch 5/5:   0%|          | 0/1718 [00:00<?, ?it/s]