# TensorFlow

We don't have to code up back propagation for every possible function or neural network architecture that we want to fit. There are lots of libraries targeted towards machine learning that make this task easy and computationally efficient. One of the most popular libraries is [TensorFlow](https://www.tensorflow.org/). It was developed by Google Brain and is now open source under the Apache License 2.0.

(Other popular choices in 2022 are [PyTorch](https://pytorch.org/) and [JAX](https://jax.readthedocs.io/))

The workflow consists of building a computational graph where "operations" act on "tensors" that can be automatically differentiated. Starting from tensorflow version 2 the operations are by default executed "eagerly" such that one can work with tensors in a similar way as with numpy arrays and typically does not have to worry about building the graph.

The TensorFlow website contains a much more [detailed introduction](https://www.tensorflow.org/guide/low_level_intro) if you want to learn more.

## Numpy-like syntax

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

Tensors can be created via `tf.constant` from python lists or numpy arrays. Similar to numpy arrays, they have a `shape` and a `dtype`.

In [None]:
tf.constant([1, 2, 3], dtype=tf.float32)

In [None]:
tf.constant(np.array([1, 2, 3]), dtype=tf.float32)

In [None]:
tf.constant([[1, 2], [3, 4], [5, 6]])

In [None]:
tf.constant([[[1, 2, 3, 4],
              [5, 6, 7, 8],
              [9, 10, 11, 12]],
             [[13, 14, 15, 16],
              [17, 18, 19, 20],
              [21, 22, 23, 24]]])

There are also convenience functions, e.g. to create equidistant or random values and all sorts of mathematical functions that represent operations on tensors.

In [None]:
tf.random.uniform((10, 2))

In [None]:
t = tf.linspace(0., 2.*np.pi, 10)
t

In [None]:
2 * t

In [None]:
tf.sin(t)

Tensors can be plotted like numpy arrays

In [None]:
plt.plot(t, tf.sin(t))

Or explicitely converted via `.numpy()`

In [None]:
t.numpy()

In [None]:
tf.sin(t).numpy()

## Auto differentiation
The real power comes from tracing operations that allows automatic backpropagation to calculate gradients. This can be done using `tf.GradientTape`. By default the gradients w.r.t. tensors (constants) are not recorded, but only for `tf.Variable`. A `tf.Variable` represents a mutable state - this makes sense, since in many cases we want to modify the values on which we calculate gradients (e.g. training a neural network).

In [None]:
t = tf.Variable(tf.linspace(0., 2.*np.pi, 100))
t

We can now calculate the derivative of the `sin` function w.r.t. `t` using `tf.GradientTape` in a context manager

In [None]:
with tf.GradientTape() as tape:
    f = tf.sin(t)
df = tape.gradient(f, t)

In [None]:
# Note: for plotting tf.Variable one always has to explicitely convert via .numpy()
# (not nescessary for Tensors/tf.constant)
plt.plot(t.numpy(), f, label="sin(t)")
plt.plot(t.numpy(), df, label="sin'(t)")
plt.legend()

To calculate gradients w.r.t. Tensors (`tf.constant`) instead of `tf.Variable`, use `tape.watch`:

In [None]:
t_const = tf.linspace(0., 2.*np.pi, 100)
with tf.GradientTape() as tape:
    tape.watch(t_const)
    f = tf.sin(t_const)
plt.plot(t_const, f, label="sin(t)")
plt.plot(t_const, tape.gradient(f, t_const), label="sin'(t)")
plt.legend()

The computation of the gradient can also be recorded and we can calculate the gradient of the gradient to get the second derivative.

In [None]:
with tf.GradientTape() as tape0:
    with tf.GradientTape() as tape1:
        f = tf.sin(t)
    df = tape1.gradient(f, t)
ddf = tape0.gradient(df, t)

The two gradient tapes are nescessary since tensorflow by default only allows one gradient to be calculated from a tape. If recording gradients themselves to the tape is intended one has to pass `persistent=True` - so the following works as well:

In [None]:
with tf.GradientTape(persistent=True) as tape:
    f = tf.sin(t)
    # this is inside the with block, so the gradient itself will also be recorded to the gradient tape
    df = tape.gradient(f, t)
# now we can calculate the gradient of the gradient
ddf_alternative = tape.gradient(df, t)

In [None]:
plt.plot(t.numpy(), f.numpy(), label="sin(t)")
plt.plot(t.numpy(), df.numpy(), label="sin'(t)")
plt.plot(t.numpy(), ddf.numpy(), label="sin''(t)")
plt.legend()

## Manually build a NN in TensorFlow

Let's build a 1-hidden-layer NN, similar to what we did in [NNFromScratchNumpy.ipynb](NNFromScratchNumpy.ipynb) now with TensorFlow.

First, we define the weights and biases for the hidden layer and the output layer via `tf.Variable`. We initialize the weights randomly (normal distribution) and the biases to 0. We will again use the convention with column vectors.

In [None]:
# hidden layer parameters
W = tf.Variable(tf.random.normal((16, 2)), name="W")
b = tf.Variable(tf.zeros((16, 1)))
print(W)
print(b)

In [None]:
# output layer parameters
W2 = tf.Variable(tf.random.normal((1, 16)))
b2 = tf.Variable(tf.zeros((1, 1)))
print(W2)
print(b2)

Now, let's propagate some inputs through the neural network.

In [None]:
inp = tf.random.normal((10, 2, 1))
inp

The output of the first hidden layer.

In [None]:
z = tf.add(tf.matmul(W, inp), b)
z

In [None]:
a = tf.nn.relu(z)
a

In [None]:
z2 = tf.add(tf.matmul(W2, a), b2)
z2

We will skip the activation function, since we will use a loss function that already applies the sigmoid transformation. This is numerically more stable.

But first, we need to define the labels. For this experiment, let's choose them randomly.

In [None]:
y = tf.Variable(np.random.randint(0, 2, size=inp.shape[0]).reshape(-1, 1, 1), dtype=tf.float32)
y

Now the binary cross entropy with a sigmoid transformation of the input values that don't have the sigmoid applied already.

In [None]:
L = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=z2)
L

Now - **and this is the whole point of this tutorial** - to get the gradients w.r.t. all parameters, we can record the parameters to a `tf.GradientTape` and get the gradients w.r.t. all parameters.

In [None]:
def forward_NN(inp):
    z = tf.add(tf.matmul(W, inp), b)
    a = tf.nn.relu(z)
    z2 = tf.add(tf.matmul(W2, a), b2)
    return tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=z2)

In [None]:
with tf.GradientTape() as tape:
    L = forward_NN(inp)
parameters = dict(W=W, b=b, W2=W2, b2=b2)
# the gradient will have the same structure (dict, tuple) as the parameters
grad_NN = tape.gradient(L, parameters)
grad_NN

## Inspect the computation graph

Another way to trace computation graphs (other than `tf.GradientTape`) is by wrapping the operations into a `tf.function`. We won't go into detail here - have a look at the [`tf.function` tutorials](https://www.tensorflow.org/guide/intro_to_graphs) for further information.

For illustration we use it here to display the computation graph with the `tensorboard` extension for Jupyter notebooks:

In [None]:
%load_ext tensorboard

In [None]:
# from https://www.tensorflow.org/tensorboard/graphs

from datetime import datetime
logdir = f"logs/{datetime.now().strftime('%Y%m%d-%H%M%S')}"
writer = tf.summary.create_file_writer(logdir)

# Bracket the function call with
# tf.summary.trace_on() and tf.summary.trace_export().
tf.summary.trace_on(graph=True, profiler=True)
# Call only one tf.function when tracing.
L = tf.function(forward_NN)(inp)
with writer.as_default():
    tf.summary.trace_export(
          name="my_func_trace",
          step=0,
          profiler_outdir=logdir
    )

If you run this notebook remotely at CIP you need to forward the port for the tensorboard process (6006), e.g. via (in a terminal on your local machine)
```
ssh -L 6006:localhost:6006 <your-username>@<your-cip-host>
```

In [None]:
!hostname

In [None]:
%tensorboard --logdir $logdir

In [None]:
# kill the tensorboard process
!sleep 3
!killall tensorboard

## Compare to NNFromScratch

Let's see if we can reproduce that with the formulas we used in [NNFromScratchNumpy.ipynb](NNFromScratchNumpy.ipynb)

Here a copy paste of the relevant functions:

In [None]:
def sigmoid(Z):
    return 1/(1+np.exp(-Z))

def relu(Z):
    return np.maximum(0,Z)

def sigmoid_derivative(Z):
    sig = sigmoid(Z)
    return sig * (1 - sig)

def relu_derivative(Z):
    dZ = (Z >= 0)
    return dZ

In [None]:
def single_layer_forward_propagation(A_prev, W_curr, b_curr, activation="relu"):
    Z_curr = np.matmul(W_curr, A_prev) + b_curr
    
    if activation == "relu":
        activation_func = relu
    elif activation == "sigmoid":
        activation_func = sigmoid
    else:
        raise Exception('Non-supported activation function')
        
    return activation_func(Z_curr), Z_curr

In [None]:
def single_layer_backward_propagation(dA_curr, W_curr, b_curr, Z_curr, A_prev, activation="relu"):
    
    if activation == "relu":
        derivative_activation_func = relu_derivative
    elif activation == "sigmoid":
        derivative_activation_func = sigmoid_derivative
    else:
        raise Exception('Non-supported activation function')
            
    dZ_curr = dA_curr * derivative_activation_func(Z_curr)
    dW_curr = np.matmul(
        dZ_curr,
        # need to transpose only the last 2 dimensions, 
        # since the first dimension is the training example index
        np.transpose(A_prev, (0, 2, 1))
    )
    db_curr = dZ_curr
    dA_prev = np.matmul(W_curr.T, dZ_curr)

    return dA_prev, dW_curr, db_curr

In [None]:
def get_loss_value(Y_hat, Y):
    return - np.mean(Y * np.log(Y_hat) + (1 - Y) * np.log(1 - Y_hat))

In [None]:
def get_loss_derivative(Y_hat, Y):
    return - (np.divide(Y, Y_hat) - np.divide(1 - Y, 1 - Y_hat));

First let's calculate the forward pass.

First, lets store the initialized values of the NN parameters, inputs and labels in python variables

In [None]:
inp_val, W_val, b_val, W2_val, b2_val, y_val = inp.numpy(), W.numpy(), b.numpy(), W2.numpy(), b2.numpy(), y.numpy()

And then run our manual `numpy` forward propagation

In [None]:
a_val, z_val = single_layer_forward_propagation(inp_val, W_val, b_val)
print(a_val[0].ravel())
print(z_val[0].ravel())

compared to what `tf` gave us

In [None]:
print(a[0].numpy().ravel())
print(z[0].numpy().ravel())

Next layer:

In [None]:
a2_val, z2_val = single_layer_forward_propagation(
    a_val, W2_val, b2_val, activation="sigmoid"
)
print(a2_val.ravel())
print(z2_val.ravel())

For `tf`, we don't have `a2` because we used a definition of the loss function where the sigmoid activation is already included. But we have `z2`:

In [None]:
z2.numpy().ravel()

Great, we implemented the forward pass correctly, so now lets do the backward pass and check if we get the same gradients like Tensorflow

In [None]:
dL = get_loss_derivative(a2_val, y_val)
dL.ravel()

Propagate back into the output layer

In [None]:
da, dW2, db2 = single_layer_backward_propagation(
    dL, W2_val, b2_val, z2_val, a_val, activation="sigmoid"
)
print(np.sum(dW2, axis=0).ravel())
print(np.sum(db2, axis=0).ravel())

In [None]:
print(grad_NN["W2"].numpy().ravel())
print(grad_NN["b2"].numpy().ravel())

And from there into the hidden layer

In [None]:
dinp, dW, db = single_layer_backward_propagation(
    da, W_val, b_val, z_val, inp_val, activation="relu"
)
print(np.sum(dW, axis=0))
print(np.sum(db, axis=0).ravel())

In [None]:
print(grad_NN["W"].numpy())
print(grad_NN["b"].numpy().ravel())

Great! Tensorflow does the same thing we attempted to do before.