In [64]:
import numpy as np
import time

import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras import models, layers
from tensorflow.keras.models import Sequential
from tensorflow.keras import optimizers

from tensorflow.keras.layers import Dense

Much as any computer program can be ultimately reduced to a small set of binary operations on binary inputs (AND, OR, NOR, and so on), all transformations learned by deep neural networks can be reduced to a handful of tensor operations applied to tensors of numeric data. For instance, it’s possible to add tensors, multiply tensors, and so on.

A Keras layer instance looks like this

In [2]:
Dense(512, activation='relu')

<tensorflow.python.keras.layers.core.Dense at 0x7f521d507e10>

This layer can be interpreted as a function, which takes as input a matrix and returns another matrix — a new representation for the input tensor. Specifically, the function is as follows (where W is a matrix and b is a vector, both attributes of the layer).

We have three tensor operations here: a dot product (dot) between the input tensor and a tensor named W; an addition (+) between the resulting matrix and a vector b; and, finally, a relu operation. relu(x) is max(x, 0)

In [3]:
# output = relu(dot(W, input) + b)

### Element-wise operations

The **relu** operation and **addition** are element-wise operations: operations that are applied independently to each entry in the tensors being considered. This means these operations are highly amenable to massively parallel implementations.

If you want to write a naive Python implementation of an element-wise operation, you use a for loop, as in this naive implementation of an element-wise **relu** operation:

In [4]:
def naive_relu(x):
    assert len(x.shape) == 2
    x = x.copy()
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            x[i, j] = max(x[i, j], 0)
    return x

In [5]:
def naive_add(x, y):
    assert len(x.shape) == 2
    assert x.shape == y.shape
    x = x.copy()
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            x[i, j] += y[i, j]
    return x

On the same principle, you can do element-wise multiplication, subtraction, and so on.

In practice, when dealing with NumPy arrays, these operations are available as well-optimized built-in NumPy functions, which themselves delegate the heavy lifting to a Basic Linear Algebra Subprograms (BLAS) implementation if you have one installed. BLAS are low-level, highly parallel, efficient tensor-manipulation routines that are typically implemented in Fortran or C.

In NumPy, you can do the following element-wise operation, and it will be blazing fast

In [6]:
# z = x + y
# z = np.maximum(z, 0)

Time the difference:


In [7]:
x = np.random.random((20, 100))
y = np.random.random((20, 100))

time_start = time.time()

for _ in range(1000):
  z = x + y
  z = np.maximum(z, 0)

duration = time.time() - time_start
print(f"Duration: {duration} sec")

Duration: 0.009510040283203125 sec


In [8]:
time_start = time.time()
for _ in range(1000):
  z = naive_add(x, y)
  z = naive_relu(z)

duration = time.time() - time_start
print(f"Duration: {duration} sec")

Duration: 2.2906622886657715 sec


### Broadcasting

When possible, and if there’s no ambiguity, the smaller tensor will be broadcasted to match the shape of the larger tensor. Broadcasting consists of two steps:

Axes (called broadcast axes) are added to the smaller tensor to match the ndim of the larger tensor.
The smaller tensor is repeated alongside these new axes to match the full shape of the larger tensor.

Example - Consider X with shape (32, 10) and y with shape (10,). First, we add an empty first axis to y, whose shape becomes (1, 10). Then, we repeat y 32 times alongside this new axis, so that we end up with a tensor Y with shape (32, 10), where Y[i, :] == y for i in range(0, 32). At this point, we can proceed to add X and Y, because they have the same shape.

In [9]:
def naive_add_matrix_and_vector(x, y):
    assert len(x.shape) == 2
    assert len(y.shape) == 1
    assert x.shape[1] == y.shape[0]
    x = x.copy()
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            x[i, j] += y[j]
    return x

In [10]:
x = np.random.random((64, 3, 32, 10))
y = np.random.random((32, 10))
z = np.maximum(x, y)

### Tensor product
The tensor product, or dot product (not to be confused with an element-wise product, the * operator) is one of the most common, most useful tensor operations.

In NumPy, a tensor product is done using the np.dot function (because the mathematical notation for tensor product is usually a dot).

In [11]:
x = np.random.random((32,))
y = np.random.random((32,))
z = np.dot(x, y)

In [12]:
z

7.009199012317854

In [13]:
# naive interpretation of two vectors
def naive_vector_dot(x, y):
    assert len(x.shape) == 1
    assert len(y.shape) == 1
    assert x.shape[0] == y.shape[0]
    z = 0.
    for i in range(x.shape[0]):
        z += x[i] * y[i]
    return z

In [14]:
zz = naive_vector_dot(x, y)

In [15]:
zz

7.009199012317853

In [16]:
# naive interpretation of matrix and vector
def naive_matrix_vector_dot(x, y):
    assert len(x.shape) == 2
    assert len(y.shape) == 1
    assert x.shape[1] == y.shape[0]
    z = np.zeros(x.shape[0])
    for i in range(x.shape[0]):
        for j in range(x.shape[1]):
            z[i] += x[i, j] * y[j]
    return z

As soon as one of the two tensors has an ndim greater than 1, dot is no longer symmetric, which is to say that dot(x, y) isn’t the same as dot(y, x)

The most common applications may be the dot product between two matrices. You can take the dot product of two matrices x and y (dot(x, y)) if and only if x.shape[1] == y.shape[0] (mn nm). The result is a matrix with shape (x.shape[0], y.shape[1]), where the coefficients are the vector products between the rows of x and the columns of y. Here’s the naive implementation:

In [17]:
def naive_matrix_dot(x, y):
    assert len(x.shape) == 2
    assert len(y.shape) == 2
    assert x.shape[1] == y.shape[0]
    z = np.zeros((x.shape[0], y.shape[1]))
    for i in range(x.shape[0]):
        for j in range(y.shape[1]):
            row_x = x[i, :]
            column_y = y[:, j]
            z[i, j] = naive_vector_dot(row_x, column_y)
    return z

### Tensor reshaping

Reshaping a tensor means rearranging its rows and columns to match a target shape. Naturally, the reshaped tensor has the same total number of coefficients as the initial tensor. Reshaping is best understood via simple examples:

In [18]:
x = np.array([[0., 1.],
              [2., 3.],
              [4., 5.]])
print(x.shape)

(3, 2)


In [19]:
x = x.reshape((6, 1))
x

array([[0.],
       [1.],
       [2.],
       [3.],
       [4.],
       [5.]])

In [20]:
x = x.reshape((2, 3))
x

array([[0., 1., 2.],
       [3., 4., 5.]])

A special case of reshaping that’s commonly encountered is transposition. Transposing a matrix means exchanging its rows and its columns, so that x[i, :] becomes x[:, i]:

In [21]:
x = np.zeros((300, 20))
print(x.shape)

(300, 20)


In [22]:
x = np.transpose(x)
print(x.shape)

(20, 300)


### Geometric interpretation of tensor operations

Because the contents of the tensors manipulated by tensor operations can be interpreted as coordinates of points in some geometric space, all tensor operations have a geometric interpretation. For instance, let’s consider addition. We’ll start with the following vector:

### The engine of neural networks: gradient-based optimization

Derivative of a tensor operation: the gradient

Stochastic gradient descent

Chaining derivatives: the Backpropagation algorithm

The chain rule

The Gradient Tape in TensorFlow - The API through which you can leverage TensorFlow’s powerful automatic differentiation capabilities is the GradientTape.



In [23]:
x = tf.Variable(0.)
with tf.GradientTape() as tape:
  y = 2 * x + 3
grad_of_y_wrt_x = tape.gradient(y, x)

In [27]:
grad_of_y_wrt_x

<tf.Tensor: shape=(), dtype=float32, numpy=2.0>

In [25]:
W = tf.Variable(tf.random.uniform((2, 2)))
b = tf.Variable(tf.zeros((2,)))
x = tf.random.uniform((2, 2))
with tf.GradientTape() as tape:
    y = tf.matmul(W, x) + b
grad_of_y_wrt_W_and_b = tape.gradient(y, [W, b])

In [26]:
grad_of_y_wrt_W_and_b

[<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
 array([[0.9923886, 0.8159368],
        [0.9923886, 0.8159368]], dtype=float32)>,
 <tf.Tensor: shape=(2,), dtype=float32, numpy=array([2., 2.], dtype=float32)>]

In [30]:
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [39]:
model = models.Sequential([
  layers.Dense(512, activation='relu'),
  layers.Dense(10, activation='softmax')
])

In [40]:
model.compile(optimizer="rmsprop",
              loss="sparse_categorical_crossentropy",
              metrics="accuracy")

In [41]:
model.fit(train_images, train_labels, epochs=5, batch_size=128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f51e1ce2e10>

Implementing from scratch in TensorFlow

Let’s implement a simple Python class NaiveDense that creates two TensorFlow variables W and b, and exposes a call method that applies the above transformation.

In [68]:
class NaiveDense:

    def __init__(self, input_size, output_size, activation):
        self.activation = activation

        w_shape = (input_size, output_size) # create a matrix W of shape "(input_size, output_size)", initialized with random values
        w_initial_value = tf.random.uniform(w_shape, minval=0, maxval=1e-1)
        self.W = tf.Variable(w_initial_value)

        b_shape = (output_size,)  # create a vector b os shape (output_size, ), initialized with zeros
        b_initial_value = tf.zeros(b_shape)
        self.b = tf.Variable(b_initial_value)

    def __call__(self, inputs): # apply the forward pass
        return self.activation(tf.matmul(inputs, self.W) + self.b)

    @property
    def weights(self):  # convinience method for rettrieving the layer weights
        return [self.W, self.b]

A simple Sequential class - create a NaiveSequential class to chain these layers. It wraps a list of layers, and exposes a call methods that simply call the underlying layers on the inputs, in order. It also features a weights property to easily keep track of the layers' parameters.

In [69]:
class NaiveSequential:

    def __init__(self, layers):
        self.layers = layers

    def __call__(self, inputs):
        x = inputs
        for layer in self.layers:
           x = layer(x)
        return x

    @property
    def weights(self):
       weights = []
       for layer in self.layers:
           weights += layer.weights
       return weights

Using this NaiveDense class and this NaiveSequential class, we can create a mock Keras model:

In [70]:
model = NaiveSequential([
    NaiveDense(input_size=28 * 28, output_size=512, activation=tf.nn.relu),
    NaiveDense(input_size=512, output_size=10, activation=tf.nn.softmax)
])
assert len(model.weights) == 4

A batch generator

Next, we need a way to iterate over the MNIST data in mini-batches. This is easy:

In [71]:
class BatchGenerator:

    def __init__(self, images, labels, batch_size=128):
        self.index = 0
        self.images = images
        self.labels = labels
        self.batch_size = batch_size

    def next(self):
        images = self.images[self.index : self.index + self.batch_size]
        labels = self.labels[self.index : self.index + self.batch_size]
        self.index += self.batch_size
        return images, labels

Running one training step

The most difficult part of the process is the “training step”: updating the weights of the model after running it on one batch of data. We need to:

1. Compute the predictions of the model for the images in the batch

2. Compute the loss value for these predictions given the actual labels

3. Compute the gradient of the loss with regard to the model’s weights

4. Move the weights by a small amount in the direction opposite to the gradient

To compute the gradient, we will use the TensorFlow GradientTape object

In [78]:
learning_rate = 1e-3

def update_weights(gradients, weights):
    for g, w in zip(gradients, weights):
        w.assign_sub(w * learning_rate) # assign_sub is the equivalent of -= for TensorFlow variables

In [79]:
def one_training_step(model, images_batch, labels_batch):
    with tf.GradientTape() as tape:   # run the "forward pass" (compute the model's predictions under the GradientTape scope)
      predictions = model(images_batch)
      per_sample_losses = tf.keras.losses.sparse_categorical_crossentropy(
          labels_batch, predictions)
      average_loss = tf.reduce_mean(per_sample_losses)
    gradients = tape.gradient(average_loss, model.weights)  # compute the gradient of the loss with regard to the weights. The output gradients is a list where each entry corresponds to a weight from the models.weights list
    update_weights(gradients, model.weights)  # update the weights using the gradients
    return average_loss

In practice, you will almost never implement a weight update step like this by hand. Instead, you would use an Optimizer instance from Keras. Like this:

In [80]:
optimizer = optimizers.SGD(learning_rate=1e-3)

def update_weights(gradients, weights):
    optimizer.apply_gradients(zip(gradients, weights))

The full training loop

An epoch of training simply consists of the repetition of the training step for each batch in the training data, and the full training loop is simply the repetition of one epoch:

In [81]:
def fit(model, images, labels, epochs, batch_size=128):
    for epoch_counter in range(epochs):
      print('Epoch %d' % epoch_counter)
      batch_generator = BatchGenerator(images, labels)
      for batch_counter in range(len(images) // batch_size):
          images_batch, labels_batch = batch_generator.next()
          loss = one_training_step(model, images_batch, labels_batch)
          if batch_counter % 100 == 0:
              print('loss at batch %d: %.2f' % (batch_counter, loss))

In [82]:
from tensorflow.keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

fit(model, train_images, train_labels, epochs=10, batch_size=128)

Epoch 0
loss at batch 0: 6.83
loss at batch 100: 2.24
loss at batch 200: 2.21
loss at batch 300: 2.12
loss at batch 400: 2.22
Epoch 1
loss at batch 0: 1.93
loss at batch 100: 1.89
loss at batch 200: 1.84
loss at batch 300: 1.75
loss at batch 400: 1.84
Epoch 2
loss at batch 0: 1.61
loss at batch 100: 1.59
loss at batch 200: 1.52
loss at batch 300: 1.46
loss at batch 400: 1.53
Epoch 3
loss at batch 0: 1.33
loss at batch 100: 1.35
loss at batch 200: 1.26
loss at batch 300: 1.24
loss at batch 400: 1.29
Epoch 4
loss at batch 0: 1.12
loss at batch 100: 1.16
loss at batch 200: 1.05
loss at batch 300: 1.07
loss at batch 400: 1.12
Epoch 5
loss at batch 0: 0.97
loss at batch 100: 1.02
loss at batch 200: 0.90
loss at batch 300: 0.95
loss at batch 400: 1.00
Epoch 6
loss at batch 0: 0.86
loss at batch 100: 0.91
loss at batch 200: 0.80
loss at batch 300: 0.85
loss at batch 400: 0.91
Epoch 7
loss at batch 0: 0.77
loss at batch 100: 0.83
loss at batch 200: 0.72
loss at batch 300: 0.78
loss at batch 40

 Evaluating the model

 We can evaluate the model by taking the argmax of its predictions over the test images, and comparing it to the expected labels:

In [88]:
predictions = model(test_images)
predictions = predictions.numpy() # calling .numpy() to a TensorFlow tensor converts it to a NumPy tensor
predicted_labels = np.argmax(predictions, axis=1)
matches = predicted_labels == test_labels
# print('accuracy: %.2f' % matches.average())
print(f"Accuracy: {np.average(matches)}")

Accuracy: 0.8318
