# TensorFlow

We don't have to code up back propagation for every possible function or neural network architecture that we want to fit. There are lot's of libraries targetted towards machine learning that make this task easy and computationally efficient. One of the most popular libraries is [TensorFlow](https://www.tensorflow.org/). It was developed by Google Brain and is now open source under the Apache License 2.0.

The workflow consists of building a computational graph where each "operations" acting on "Tensors" that can be automatically differentiated. The Tensors themselves don't hold values, but instead are "initialized" or "fed" when actually running the computation. We will see how that works in this tutorial.

The TensorFlow website contains a much more [detailed Introduction](https://www.tensorflow.org/guide/low_level_intro) if you want to learn more.

## Manually build a NN in TensorFlow

Let's build a 1-hidden-layer NN, similar to what we did in [NNFromScratch.ipynb](NNFromScratch.ipynb) now with TensorFlow.

In [1]:
import tensorflow as tf
import numpy as np

Placeholders are used for values that should be fed in later. Dimensions of size `None` are meant to be of arbitray size, in this case this will be the training example index. 

When we create Tensors, they will be added to the current graph of TensorFlow. To identify them later, in case we haven't assigned them to a python variable it is useful to give them a name. TensorFlow will attach an index to the name if it is already taken.

The first Tensor will hold the input values that we will feed in later.

In [2]:
inp = tf.placeholder(tf.float32, (None, 2), name='input')
inp

<tf.Tensor 'input:0' shape=(?, 2) dtype=float32>

Next, we define the weights an biases for the hidden layer. We could use a placeholder as well and feed it the inital weights, but to illustrate a different concept, lets use a variable. We use the `tf.get_variable` method. Note that this method won't add indices to the names when they are already taken, but throw an exception instead.

Variables have to be initialized. We could e.g. specify that the weights are initalized to some fixed values:

In [3]:
W = tf.get_variable("W", dtype=tf.float32, initializer=np.random.randn(2, 16).astype(np.float32))

In [4]:
b = tf.get_variable("b", dtype=tf.float32, initializer=tf.zeros(16))

In [5]:
print(W)
print(b)

<tf.Variable 'W:0' shape=(2, 16) dtype=float32_ref>
<tf.Variable 'b:0' shape=(16,) dtype=float32_ref>


No lets define the output of the first hidden layer

In [6]:
z = tf.add(tf.matmul(inp, W), b, name="z")
z

<tf.Tensor 'z:0' shape=(?, 16) dtype=float32>

In [7]:
a = tf.nn.relu(z, name="a")
a

<tf.Tensor 'a:0' shape=(?, 16) dtype=float32>

And the weights and outputs of the output layer.

Let's try another method for variable initialization here - using tensorflows initializers:

In [8]:
W2 = tf.get_variable("W2", dtype=tf.float32, initializer=tf.random_normal_initializer()(shape=(16, 1)))
b2 = tf.get_variable("b2", dtype=tf.float32, initializer=tf.zeros(1))
print(W2)
print(b2)

<tf.Variable 'W2:0' shape=(16, 1) dtype=float32_ref>
<tf.Variable 'b2:0' shape=(1,) dtype=float32_ref>


In [9]:
z2 = tf.add(tf.matmul(a, W2), b2)
z2

<tf.Tensor 'Add:0' shape=(?, 1) dtype=float32>

We will skip the activation function, since we will use a Loss function that already applies the sigmoid transformation. This is numerically more stable.

But first, we need to define a Tensor that will hold the labels

In [10]:
y = tf.placeholder(tf.float32, (None,1))

Now the binary cross entropy with a sigmoid transformation of the input values that don't have the sigmoid applied already.

In [11]:
L = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=z2)
L

W0825 19:06:20.646105 140337554827008 deprecation.py:323] From /home/n/Nikolai.Hartmann/conda/mlkurs/lib/python3.7/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


<tf.Tensor 'logistic_loss:0' shape=(?, 1) dtype=float32>

Now - **and this is the whole point of this tutorial** - to get the gradients w.r.t. all parameters, we can simply create the gradients as a `tf` operation and done!!

In [12]:
grad = tf.gradients(L, [W, b, W2, b2])
grad

[<tf.Tensor 'gradients/MatMul_grad/MatMul_1:0' shape=(2, 16) dtype=float32>,
 <tf.Tensor 'gradients/z_grad/Reshape_1:0' shape=(16,) dtype=float32>,
 <tf.Tensor 'gradients/MatMul_1_grad/MatMul_1:0' shape=(16, 1) dtype=float32>,
 <tf.Tensor 'gradients/Add_grad/Reshape_1:0' shape=(1,) dtype=float32>]

## Run the NN and calculate the gradient

Getting the output of the NN and the gradient w.r.t. the parameters is now simply a matter of feeding in the values and running the actual `tf` graph. Lets create the feed values with numpy:

In [13]:
inp_val = np.random.randn(10, 2)
inp_val

array([[ 0.2877745 , -0.45729127],
       [-0.05789639,  1.69953029],
       [ 1.52304573,  0.49659072],
       [-2.54455919,  1.03957212],
       [ 0.94309014,  0.82602736],
       [-1.39240318, -2.55734321],
       [ 0.23819814,  0.34620872],
       [ 0.82564916,  0.27619472],
       [ 1.3555375 , -0.08659935],
       [-0.25835985,  0.05114919]])

In [14]:
y_val = np.random.randint(0, 2, size=10).reshape(-1, 1)
y_val

array([[0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1]])

To run the graph, we have to create a `tf.Session` and pass a Tensor to the `run` method and a dictionary with the values for all placeholder tensors - in our case the input and targets. But before that, the values of the variables have to be initialized.

In [15]:
sess = tf.Session()

In [16]:
sess.run(tf.global_variables_initializer())

Now, lets just output the values of the gradients

In [17]:
feed_dict = {inp : inp_val, y : y_val}
grad_val = sess.run(grad, feed_dict=feed_dict)
grad_val

[array([[ 2.8025308 , -3.071056  , -3.1624866 , -0.2209892 ,  0.6038144 ,
         -3.3342512 , -2.829316  ,  1.104261  , -0.40069878,  0.52425086,
          0.8893442 ,  0.03627827,  2.3906946 ,  1.065886  , -0.00704979,
          1.1612515 ],
        [-0.7958876 ,  0.38841072, -1.244313  ,  0.44525224,  0.19730797,
          0.42169818, -1.009977  , -0.17122361, -4.085935  , -0.84847534,
          2.8433928 , -0.05871465, -0.3023622 ,  3.0012853 , -0.01985054,
          0.37946135]], dtype=float32),
 array([-1.1988204 ,  0.40968737,  1.984129  ,  0.19213468, -0.35252774,
         0.4447983 ,  1.8328683 , -0.35094273,  2.689114  , -0.8303844 ,
        -1.1616665 , -0.05746276, -0.31892526, -0.9777993 ,  0.00646718,
        -0.6779789 ], dtype=float32),
 array([[ -2.0471401 ],
        [ -3.6832035 ],
        [ -3.1061494 ],
        [  0.07250378],
        [ -6.486447  ],
        [ -0.96950436],
        [-11.957878  ],
        [  0.5033768 ],
        [ -2.3783982 ],
        [ -0.6327585

And store them for later use

In [18]:
dW_tf, db_tf, dW2_tf, db2_tf = grad_val

Let's see if we can reproduce that with the formulas we used in [NNFromScratch.ipynb](NNFromScratch.ipynb)

Here a copy paste of the relevant functions:

In [19]:
def sigmoid(Z):
    return 1/(1+np.exp(-Z))

def relu(Z):
    return np.maximum(0,Z)

def sigmoid_derivative(Z):
    sig = sigmoid(Z)
    return sig * (1 - sig)

def relu_derivative(Z):
    dZ = (Z >= 0)
    return dZ

In [20]:
def single_layer_forward_propagation(A_prev, W_curr, b_curr, activation="relu"):
    Z_curr = np.matmul(W_curr, A_prev) + b_curr
    
    if activation is "relu":
        activation_func = relu
    elif activation is "sigmoid":
        activation_func = sigmoid
    else:
        raise Exception('Non-supported activation function')
        
    return activation_func(Z_curr), Z_curr

In [21]:
def single_layer_backward_propagation(dA_curr, W_curr, b_curr, Z_curr, A_prev, activation="relu"):
    
    if activation is "relu":
        derivative_activation_func = relu_derivative
    elif activation is "sigmoid":
        derivative_activation_func = sigmoid_derivative
    else:
        raise Exception('Non-supported activation function')
            
    dZ_curr = dA_curr * derivative_activation_func(Z_curr)
    dW_curr = np.matmul(
        dZ_curr,
        # need to transpose only the last 2 dimensions, 
        # since the first dimension is the training example index
        np.transpose(A_prev, (0, 2, 1))
    )
    db_curr = dZ_curr
    dA_prev = np.matmul(W_curr.T, dZ_curr)

    return dA_prev, dW_curr, db_curr

In [22]:
def get_loss_value(Y_hat, Y):
    return - np.mean(Y * np.log(Y_hat) + (1 - Y) * np.log(1 - Y_hat))

In [23]:
def get_loss_derivative(Y_hat, Y):
    return - (np.divide(Y, Y_hat) - np.divide(1 - Y, 1 - Y_hat));

First let's calculate the forward pass. Unfortunately, tf and numpy have a bit different conventions for matmul, so i'm very sorry for all the confusing transposes and reshapes in the following section. If you have to do something like that, usually it's best to experiment and see if output dimensions are correct after each step.

First, lets store the initialized values of the NN parameters in python variables

In [24]:
W_val, b_val, W2_val, b2_val = sess.run([W, b, W2, b2], feed_dict=feed_dict)

And then run our manual `numpy` forward propagation

In [25]:
a_val, z_val = single_layer_forward_propagation(inp_val.reshape(-1, 2, 1), W_val.T, b_val.reshape(-1, 1))
print(a_val[0].reshape(-1))
print(z_val[0].reshape(-1))

[0.         0.         0.         0.         0.13885445 0.
 0.         0.         0.50344369 0.65172233 0.23873353 0.90670911
 0.         0.60770566 0.86829198 0.09452994]
[-0.39402405 -1.46525636 -0.27215844 -0.03547932  0.13885445 -0.34297248
 -0.62451726 -0.54651447  0.50344369  0.65172233  0.23873353  0.90670911
 -0.27551562  0.60770566  0.86829198  0.09452994]


compared to what `tf` gives:

In [26]:
print(sess.run(a, feed_dict=feed_dict)[0])
print(sess.run(z, feed_dict=feed_dict)[0])

[-0.         -0.         -0.         -0.          0.13885446 -0.
 -0.         -0.          0.5034437   0.6517223   0.23873353  0.90670913
 -0.          0.60770565  0.868292    0.09452995]
[-0.39402404 -1.4652563  -0.27215844 -0.03547932  0.13885446 -0.3429725
 -0.62451726 -0.5465145   0.5034437   0.6517223   0.23873353  0.90670913
 -0.27551562  0.60770565  0.868292    0.09452995]


Next layer:

In [27]:
a2_val, z2_val = single_layer_forward_propagation(
    a_val.reshape(-1, 16, 1), W2_val.T, b2_val.reshape(-1, 1), activation="sigmoid"
)
print(a2_val.reshape(-1))
print(z2_val.reshape(-1))

[0.224269   0.08674905 0.21573556 0.00045921 0.17649171 0.00893797
 0.37715723 0.32941938 0.11585327 0.33824911]
[-1.24095959 -2.35399121 -1.29069288 -7.68554703 -1.54029973 -4.7084686
 -0.50163194 -0.71081228 -2.0322986  -0.67110654]


For `tf`, we don't have `a2` because we used a definition of the loss function where the sigmoid activation is already included. But we have `z2`:

In [28]:
print(sess.run(z2, feed_dict=feed_dict).reshape(-1))

[-1.2409596  -2.353991   -1.2906929  -7.685547   -1.5402997  -4.7084684
 -0.501632   -0.71081245 -2.0322988  -0.6711066 ]


Great, we implemented the forward pass correctly, so now lets do the backward pass and check if we get the same gradients like Tensorflow

In [29]:
dL = get_loss_derivative(a2_val, y_val.reshape(-1, 1, 1))
dL.reshape(-1)

array([ 1.28910666e+00,  1.09498928e+00,  1.27508012e+00, -2.17766034e+03,
        1.21431685e+00, -1.11882225e+02,  1.60554165e+00,  1.49124501e+00,
       -8.63160824e+00, -2.95640096e+00])

Propagate back into the output layer

In [30]:
da, dW2, db2 = single_layer_backward_propagation(
    dL, W2_val.T, b2_val.reshape(-1, 1), z2_val, a_val, activation="sigmoid"
)
print(np.sum(dW2, axis=0).reshape(-1))
print(np.sum(db2, axis=0).reshape(-1))

[ -2.04714007  -3.68320363  -3.10614896   0.07250378  -6.48644681
  -0.96950433 -11.95787755   0.50337676  -2.37839827  -0.63275843
  -0.80057187  -0.880386    -0.79289815  -1.10625534  -2.06411813
  -5.66798132]
[-2.12667851]


In [31]:
print(dW2_tf.reshape(-1))
print(db2_tf.reshape(-1))

[ -2.0471401   -3.6832035   -3.1061494    0.07250378  -6.486447
  -0.96950436 -11.957878     0.5033768   -2.3783982   -0.6327585
  -0.8005719   -0.8803862   -0.7928982   -1.1062554   -2.0641184
  -5.6679816 ]
[-2.1266787]


And from there into the hidden layer

In [32]:
dinp, dW, db = single_layer_backward_propagation(
    da, W_val.T, b_val.reshape(-1, 1), z_val, inp_val.reshape(-1, 2, 1), activation="relu"
)
print(np.sum(dW, axis=0).T)
print(np.sum(db, axis=0).reshape(-1))

[[ 2.80253083 -3.07105562 -3.16248658 -0.22098917  0.60381441 -3.33425114
  -2.8293158   1.10426104 -0.40069876  0.52425093  0.88934434  0.03627826
   2.39069446  1.06588626 -0.00704979  1.16125159]
 [-0.79588766  0.38841069 -1.24431291  0.44525222  0.19730798  0.42169825
  -1.00997679 -0.17122365 -4.08593471 -0.84847531  2.84339269 -0.05871465
  -0.30236225  3.00128546 -0.01985054  0.37946132]]
[-1.19882035  0.40968728  1.98412887  0.19213473 -0.35252772  0.44479829
  1.83286822 -0.35094271  2.68911412 -0.83038462 -1.16166641 -0.05746277
 -0.31892523 -0.9777993   0.00646718 -0.67797882]


In [33]:
print(dW_tf)
print(db_tf)

[[ 2.8025308  -3.071056   -3.1624866  -0.2209892   0.6038144  -3.3342512
  -2.829316    1.104261   -0.40069878  0.52425086  0.8893442   0.03627827
   2.3906946   1.065886   -0.00704979  1.1612515 ]
 [-0.7958876   0.38841072 -1.244313    0.44525224  0.19730797  0.42169818
  -1.009977   -0.17122361 -4.085935   -0.84847534  2.8433928  -0.05871465
  -0.3023622   3.0012853  -0.01985054  0.37946135]]
[-1.1988204   0.40968737  1.984129    0.19213468 -0.35252774  0.4447983
  1.8328683  -0.35094273  2.689114   -0.8303844  -1.1616665  -0.05746276
 -0.31892526 -0.9777993   0.00646718 -0.6779789 ]


Great! Tensorflow does the same thing we attempted to do before.