This is a companion notebook for the book [Deep Learning with Python, Second Edition](https://www.manning.com/books/deep-learning-with-python-second-edition?a_aid=keras&a_bid=76564dff). For readability, it only contains runnable code blocks and section titles, and omits everything else in the book: text paragraphs, figures, and pseudocode.

**If you want to be able to follow what's going on, I recommend reading the notebook side by side with your copy of the book.**

This notebook was generated for TensorFlow 2.6.

# The mathematical building blocks of neural networks

To provide sufficient context for introducing tensors and gradient descent, we’ll begin the
chapter with a practical example of a neural network. Then we’ll go over every new concept
that’s been introduced, point by point.

## The engine of neural networks: gradient-based optimization

In [None]:
output = relu(dot(input, W) + b)

In this expression, $W$ and $b$ are tensors that are attributes of the layer. They’re called the **weights** or **trainable parameters** of the layer (the kernel and bias attributes, respectively). These weights contain the information learned by the model from exposure to training data.

What comes next is to gradually adjust these weights, based on a feedback signal. This gradual adjustment, also called **training**, is basically the learning that machine learning is all about.

This happens within what’s called a **training loop**, which works as follows. Repeat these steps in a loop, until the loss seems sufficiently low:

1. Draw a batch of training samples `x` and corresponding targets `y_true`.
2. Run the model on `x` (a step called the forward pass) to obtain predictions `y_pred`.
3. Compute the loss of the model on the batch, a measure of the mismatch between `y_pred` and `y_true`.
4. Update all weights of the model in a way that slightly reduces the loss on this batch.

You’ll eventually end up with a model that has a very low loss on its training data: a low mismatch between predictions `y_pred` and expected targets `y_true`. The model has "learned" to map its inputs to correct targets. From afar, it may look like magic, but when you reduce it to elementary steps, it turns out to be simple.


### What's a derivative?

**Gradient descent is the optimization technique that powers modern neural networks**. Here’s the gist of it. All of the functions used in our models (such as dot or +) transform their input in a smooth and continuous way: if you look at $z = x + y$, for instance, a small change in $y$ only results in a small change in $z$, and if you know the direction of the change in $y$, you can infer the direction of the change in $z$. Mathematically, you’d say these functions are **differentiable**. If you chain together such functions, the bigger function you obtain is still differentiable. In particular, this applies to the function that maps the model’s coefficients to the loss of the model on a batch of data: a small change of the model’s coefficients results in a small, predictable change of the loss value. This enables you to use a mathematical operator called the **gradient** to describe how the loss varies as you move the model’s coefficients in different directions. If you compute this gradient, you can use it to move the coefficients (all at once in a single update, rather than one at a time) in a direction that decreases the loss.

![](https://drek4537l1klr.cloudfront.net/chollet2/v-7/Figures/function.png)
![](https://drek4537l1klr.cloudfront.net/chollet2/v-7/Figures/derivation.png)

### Derivative of a tensor operation: the gradient

The concept of derivation can be applied to any such function, as long as the surfaces they describe are continuous and smooth. The derivative of a tensor operation (or tensor function) is called a **gradient**. Gradients are just the generalization of the concept of derivatives to functions that take tensors as inputs. Remember how, for a scalar function, the derivative represents the local slope of the curve of the function? In just the same way, the **gradient of a tensor function represents the curvature of the multidimensional surface described by the function**.

![](https://miro.medium.com/max/1200/1*7030GXGlVD-u9VyqVJdTyw.png)
![](https://blog.paperspace.com/content/images/2018/05/challenges-1.png)

![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1200%2F1*kbP1fkMFrcjwBO3NVP9OfQ.png&f=1&nofb=1)

### Stochastic gradient descent

Easy enough! What we just described is called **mini-batch stochastic gradient descent** (mini-batch SGD). The term stochastic refers to the fact that each batch of data is drawn at random (stochastic is a scientific synonym of random). Figure 2.18 illustrates what happens in 1D, when the model has only one parameter and you have only one training sample.

![](https://drek4537l1klr.cloudfront.net/chollet2/v-7/Figures/ch02-sgd_explained_1.png)


As you can see, intuitively it’s important to pick a reasonable value for the **learning_rate** factor. If it’s too small, the descent down the curve will take many iterations, and it could get stuck in a local minimum. If learning_rate is too large, your updates may end up taking you to completely random locations on the curve.

Note that a variant of the mini-batch SGD algorithm would be to draw a single sample and target at each iteration, rather than drawing a batch of data. This would be **true SGD** (as opposed to mini-batch SGD). Alternatively, going to the opposite extreme, you could run every step on all data available, which is called **batch gradient descent**. Each update would then be more accurate, but far more expensive. The efficient compromise between these two extremes is to use mini-batches of reasonable size.


### Chaining derivatives: the Backpropagation algorithm

#### The chain rule

#### Automatic differentiation with computation graphs

**Computation graphs** have been an extremely successful abstraction in computer science because they enable us to treat computation as data: a computable expression is encoded as a machine-readable data structure that can be used as the input or output of another program. For instance, you could imagine a program that receives a computation graph and returns a new computation graph that implements a large-scale distributed version of the same computation—this would mean that you could distribute any computation without having to write the distribution logic yourself. Or imagine… a program that receives a computation graph and can automatically generate the derivative of the expression it represents. It’s much easier to do these things if your computation is expressed as an explicit graph data structure rather than, say, lines of ASCII characters in a .py file.

The computation graph representation of our two-layer model.

A useful way to think about backpropagation is in terms of computation graphs. A computation graph is the data structure at the heart of TensorFlow and the deep learning revolution in general. It’s a directed acyclic graph of operations—in our case, tensor operations. For instance, this is the graph representation of our first model:

![](https://drek4537l1klr.cloudfront.net/chollet2/v-7/Figures/a_first_computation_graph.png)


To explain backpropagation clearly, let’s look at a really basic example of a computation graph. We’ll consider a simplified version of the graph above, where we only have one linear layer and where all variables are scalar. We’ll take two scalar variables $w$, $b$, a scalar input $x$, and apply some operations to them to combine into an output $y$. Finally, we’ll apply an absolute value error loss function: `loss_val = abs(y_true - y)`. Since we want to update $w$ and $b$ in a way that would minimize `loss_val`, we are interested in computing `grad(loss_val, b)` and `grad(loss_val, w)`.

![](https://drek4537l1klr.cloudfront.net/chollet2/v-7/Figures/basic_computation_graph.png)

Let’s set at concrete values for the "input nodes" in the graph $x$, that is to say the input $x$, the target `y_true`, $w$ and $b$. We propagate these values to all nodes in the graph, from top to bottom, until we reach `loss_val`. This is the **forward pass**.

![](https://drek4537l1klr.cloudfront.net/chollet2/v-7/Figures/basic_computation_graph_with_values.png)

Now let’s "reverse" the graph: for each edge in the graph going from $a$ to $b$, we will create an opposite edge from $b$ to $a$, and ask "how much does $b$ vary when $a$ varies?" That is to say, what is `grad(b, a)`? We’ll annotate each inverted edge with this value. This backward graph represents the **backward pass**.

We have:
- `grad(loss_val, x2) = 1`, because as $x2$ varies by an amount epsilon, `loss_val = abs(4 - x2)` varies by the same amount.
- `grad(x2, x1) = 1`, because as $x1$ varies by an amount epsilon, $x2 = x1 + b = x1 + 1$ varies by the same amount.
- `grad(x2, b) = 1`, because as $b$ varies by an amount epsilon, $x2 = x1 + b = 6 + b$ varies by the same amount.
- `grad(x1, w) = 2`, because as $w$ varies by an amount epsilon, $x1 = x * w = 2 * w$ varies by `2 * epsilon`.

![](https://drek4537l1klr.cloudfront.net/chollet2/v-7/Figures/basic_computation_graph_backward.png)

What the chain rule says about this backward graph is that you can obtain the derivative of a node with respect to another node by **multiplying the derivatives for each edge along the path linking the two nodes**. 
For instance, 

`grad(loss_val, w) = grad(loss_val, x2) * grad(x2, x1) * grad(x1, w)`

![](https://drek4537l1klr.cloudfront.net/chollet2/v-7/Figures/path_in_backward_graph.png)

#### The Gradient Tape in TensorFlow

The API through which you can leverage TensorFlow’s powerful automatic differentiation capabilities is the `GradientTape`. It’s a Python scope that will "record" the tensor operations that run inside it, in the form of a computation graph (sometimes called a "tape"). This graph can then be used to retrieve the gradient of any output with respect to any variable or set of variables (instances of the `tf.Variable` class). A `tf.Variable` is a specific kind of tensor meant to hold mutable state—for instance, the weights of a neural network are always tf.Variable instances.

In [24]:
import tensorflow as tf

x = tf.Variable(0.)
with tf.GradientTape() as tape:
    y = 2 * x + 3
    
grad_of_y_wrt_x = tape.gradient(y, x)

grad_of_y_wrt_x

<tf.Tensor: shape=(), dtype=float32, numpy=2.0>

In [29]:
x = tf.Variable(tf.random.uniform((2, 2)))
with tf.GradientTape() as tape:
    y = 2 * x + 3
    
grad_of_y_wrt_x = tape.gradient(y, x)

grad_of_y_wrt_x

<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[2., 2.],
       [2., 2.]], dtype=float32)>

In [26]:
W = tf.Variable(tf.random.uniform((2, 2)))
b = tf.Variable(tf.zeros((2,)))
x = tf.random.uniform((2, 2))
with tf.GradientTape() as tape:
    y = tf.matmul(x, W) + b
    
grad_of_y_wrt_W_and_b = tape.gradient(y, [W, b])

grad_of_y_wrt_W_and_b

[<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
 array([[0.46594226, 0.46594226],
        [0.98007166, 0.98007166]], dtype=float32)>,
 <tf.Tensor: shape=(2,), dtype=float32, numpy=array([2., 2.], dtype=float32)>]

## Looking back at our first example

![](https://drek4537l1klr.cloudfront.net/chollet2/v-7/Figures/deep-learning-in-3-figures-3_alt.png)

In [0]:
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype("float32") / 255

In [0]:
model = keras.Sequential([
    layers.Dense(512, activation="relu"),
    layers.Dense(10, activation="softmax")
])

In [0]:
model.compile(optimizer="rmsprop",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

In [0]:
model.fit(train_images, train_labels, epochs=5, batch_size=128)

### Reimplementing our first example from scratch in TensorFlow

#### A simple Dense class

In [1]:
import tensorflow as tf

class NaiveDense:
    def __init__(self, input_size, output_size, activation):
        self.activation = activation

        w_shape = (input_size, output_size)
        w_initial_value = tf.random.uniform(w_shape, minval=0, maxval=1e-1)
        self.W = tf.Variable(w_initial_value)

        b_shape = (output_size,)
        b_initial_value = tf.zeros(b_shape)
        self.b = tf.Variable(b_initial_value)

    def __call__(self, inputs):
        return self.activation(tf.matmul(inputs, self.W) + self.b)

    @property
    def weights(self):
        return [self.W, self.b]

#### A simple Sequential class

In [2]:
class NaiveSequential:
    def __init__(self, layers):
        self.layers = layers

    def __call__(self, inputs):
        x = inputs
        for layer in self.layers:
           x = layer(x)
        return x

    @property
    def weights(self):
       weights = []
       for layer in self.layers:
           weights += layer.weights
       return weights

In [3]:
model = NaiveSequential([
    NaiveDense(input_size=28 * 28, output_size=512, activation=tf.nn.relu),
    NaiveDense(input_size=512, output_size=10, activation=tf.nn.softmax)
])
assert len(model.weights) == 4

#### A batch generator

In [7]:
import math

class BatchGenerator:
    def __init__(self, images, labels, batch_size=128):
        assert len(images) == len(labels)
        self.index = 0
        self.images = images
        self.labels = labels
        self.batch_size = batch_size
        self.num_batches = math.ceil(len(images) / batch_size)

    def next(self):
        images = self.images[self.index : self.index + self.batch_size]
        labels = self.labels[self.index : self.index + self.batch_size]
        self.index += self.batch_size
        return images, labels

### Running one training step

In [8]:
def one_training_step(model, images_batch, labels_batch):
    with tf.GradientTape() as tape:
        predictions = model(images_batch)
        per_sample_losses = tf.keras.losses.sparse_categorical_crossentropy(
            labels_batch, predictions)
        average_loss = tf.reduce_mean(per_sample_losses)
    gradients = tape.gradient(average_loss, model.weights)
    update_weights(gradients, model.weights)
    return average_loss

In [9]:
learning_rate = 1e-3

def update_weights(gradients, weights):
    for g, w in zip(gradients, model.weights):
        w.assign_sub(g * learning_rate)

In [10]:
from tensorflow.keras import optimizers

optimizer = optimizers.SGD(learning_rate=1e-3)

def update_weights(gradients, weights):
    optimizer.apply_gradients(zip(gradients, weights))

### The full training loop

In [11]:
def fit(model, images, labels, epochs, batch_size=128):
    for epoch_counter in range(epochs):
        print(f"Epoch {epoch_counter}")
        batch_generator = BatchGenerator(images, labels)
        for batch_counter in range(batch_generator.num_batches):
            images_batch, labels_batch = batch_generator.next()
            loss = one_training_step(model, images_batch, labels_batch)
            if batch_counter % 100 == 0:
                print(f"loss at batch {batch_counter}: {loss:.2f}")

In [12]:
from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype("float32") / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype("float32") / 255

fit(model, train_images, train_labels, epochs=10, batch_size=128)

Epoch 0
loss at batch 0: 5.38
loss at batch 100: 2.29
loss at batch 200: 2.24
loss at batch 300: 2.10
loss at batch 400: 2.27
Epoch 1
loss at batch 0: 1.93
loss at batch 100: 1.93
loss at batch 200: 1.86
loss at batch 300: 1.72
loss at batch 400: 1.87
Epoch 2
loss at batch 0: 1.60
loss at batch 100: 1.63
loss at batch 200: 1.54
loss at batch 300: 1.44
loss at batch 400: 1.54
Epoch 3
loss at batch 0: 1.34
loss at batch 100: 1.38
loss at batch 200: 1.28
loss at batch 300: 1.22
loss at batch 400: 1.30
Epoch 4
loss at batch 0: 1.14
loss at batch 100: 1.19
loss at batch 200: 1.07
loss at batch 300: 1.05
loss at batch 400: 1.13
Epoch 5
loss at batch 0: 0.99
loss at batch 100: 1.05
loss at batch 200: 0.92
loss at batch 300: 0.93
loss at batch 400: 1.01
Epoch 6
loss at batch 0: 0.87
loss at batch 100: 0.94
loss at batch 200: 0.81
loss at batch 300: 0.84
loss at batch 400: 0.92
Epoch 7
loss at batch 0: 0.79
loss at batch 100: 0.85
loss at batch 200: 0.73
loss at batch 300: 0.77
loss at batch 40

### Evaluating the model

In [14]:
import numpy as np

predictions = model(test_images)
predictions = predictions.numpy()
predicted_labels = np.argmax(predictions, axis=1)
matches = predicted_labels == test_labels
print(f"accuracy: {matches.mean():.2f}")

accuracy: 0.81
