# Numpy VS Tensorflow Neural Network

In this excersis you will build a neural network from scratch using Numpy. For a sake of simplicity, we will not use a OOP approach, instead we will create the NN with simple functions.

Follow the instructions and descriptions for each function and let's create a NN from scratch witn numpy. Then, we will build the same NN with the tensorflow library to compare them.

## Loading data from file

In [1]:
import numpy as np
import pickle
from sklearn.model_selection import train_test_split

with open("data.pickle", "rb") as f:
    data = pickle.load(f)

features = data["features"]
labels = data["labels"]

train_x, test_x, train_y, test_y = train_test_split(features, labels, test_size=0.2)

## NN with numpy 

We will implement all the functions necessary to train a fully connected NN using only the numpy library. The objective is to be able to train a neural network with *any number of layers* in which the last layer will have a **single neuron** with a **sigmoid activation** function and the other layers any number of neurons with a **relu activation** function.

The following figure shows a diagram of how we will implement the NN training process (*take your time to understand it!*):

<img src="diag.png" alt="Neural network training diagram" style="height: 550px;"/>

The code will be *structured in basic functions* that are composed according to the following scheme:

- L_layer_model
  - initialize_parameters
  - L_model_forward
    - linear_activation_forward
      - linear_forward
      - sigmoid
      - relu
  - compute_cost
  - L_model_backward
    - linear_activation_backward
      - linear_backward
      - sigmoid_backward
      - relu_backward
  - update_parameters
- accuracy

**Notation**:
- We denote $L$ the number of layers of the neural network.
- We denote the weight matrix that connects one layer to the next with the letter $W$, whereas we denote the bias vector with the letter $b$.
- Superscript $[l]$ denotes a quantity associated with layer number $l$.
     - Example: $a^{[L]}$ denotes the output of layer number $L$.
     - Example: The variables $W^{[L]}$ and $b^{[L]}$ denote the weight matrix and the bias vector that connect layer $L-1$ with layer $L$ respectively .
- Superscript $(i)$ denotes a quantity associated with the $i$-th example.
     - Example: $x^{(i)}$ is the $i$-th element of the training set.

### Initialize parameters

The weight matrices must be initialized using the normal distribution and the bias vectors must be initialized with zeros.

In [2]:
def initialize_parameters(layer_dims):
    """
    Inputs:
    layer_dims -- list with the dimension of each layer: e.g. [10,5,1]
    
    Returns:
    parameters -- dic with parameters "W1", "b1", ..., "WL", "bL":
                    Wl -- weight matrix (layer_dims[l], layer_dims[l-1])
                    bl -- bias vector with shape (layer_dims[l], 1)
    """
    
    parameters = {}
    L = len(layer_dims)

    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.1
        parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
    
    return parameters

### Forward propagation

At each layer of a neural network, the neuron inputs are combined linearly before passing through the activation function according to the following formula:

$$Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}$$

In [3]:
def linear_forward(A, W, b):
    """
    Inputs:
     A -- output of previous layer (or input data): (number of neurons in previous layer, number of examples)
     W -- weight matrix: (number of neurons in the current layer, number of neurons in the previous layer)
     b -- bias vector: (number of neurons in the current layer, 1)

    Returns:
    Z -- the entry to the activation function
    cache -- a triplet containing "A", "W", and "b", used later for backward propagation
    """
    
    Z = np.dot(W, A) + b
    cache = (A, W, b)
    
    return Z, cache

Once the linear combination of the inputs of a layer has been calculated, a non-linear activation function must be applied before sending the outputs to the next layer. If we denote $g$ the activation function (in our case relu or sigmoid), we have the following formula:

$$A^{[l]} = g(Z^{[l]}) = g(W^{[l]}A^{[l-1]} + b^{[l]})$$

In [4]:
def sigmoid(Z):
    """
    Inputs:
     Z -- output of linear forward
     
    Returns:
     A -- g(Z), activation function value
     cache -- Z, used later for backward propagation
    """
    A = 1 / (1 + np.exp(-Z))
    cache = Z
    return A, cache

def relu(Z):
    """
    Inputs:
     Z -- output of linear forward
     
    Returns:
     A -- g(Z), activation function value
     cache -- Z, used later for backward propagation
    """
    A = np.maximum(0, Z)
    cache = Z
    return A, cache

In [5]:
def linear_activation_forward(A_prev, W, b, activation):
    """
    Implements forward propagation of a layer including the activation function

    Inputs:
    A_prev -- output of the previous layer (or input data):(number of neurons in the previous layer, number of examples)
    W -- weight matrix: (number of neurons in the current layer, number of neurons in the previous layer)
    b -- bias vector: (number of neurons in the current layer, 1)
    activation -- the name of the activation function to use in the layer: "sigmoid" or "relu"
    
    Outputs:
    A -- the output of the layer after applying the activation function
    cache -- a pair containing "linear_cache" and "activation_cache", then used for backpropagation
    """
    
    Z, linear_cache = linear_forward(A_prev, W, b)
    
    if activation == "sigmoid":    
        A, activation_cache = sigmoid(Z)
    elif activation == "relu":
        A, activation_cache = relu(Z)
        
    cache = (linear_cache, activation_cache)

    return A, cache

Given the input data, the output of the NN is calculated by applying different layers one after another. If we denote the last layer as $L$, the output of the NN corresponds to the output of the last layer $A^{[L]}$.

In [6]:
def L_model_forward(X, parameters):
    """
    Implement forward propagation of the entire neural network
    
    Inputs:
    X -- data: size array (number of variables, number of examples)
    parameters -- output of initialize_parameters() function
    
    Returns:
    AL -- neural network output
    caches -- list of caches containing all the caches of the linear_activation_forward() function, the caches
                indexed from 0 to L-2 correspond to the relu activation function caches and the indexed cache
                as L-1 corresponds to the cache of the sigmoid activation function
    """

    caches = []
    A = X
    L = len(parameters) // 2

    # deep layers
    for l in range(1, L):
        A_prev = A 
        A, cache = linear_activation_forward(A_prev, parameters["W" + str(l)], parameters["b" + str(l)], "relu")
        caches.append(cache)
    
    # output layer
    AL, cache = linear_activation_forward(A, parameters["W" + str(L)], parameters["b" + str(L)], "sigmoid")
    caches.append(cache)
    
    return AL, caches

### Cost function

Now, we can obtain a value that measures the performance of the NN using a cost function $\mathcal{L}$. We will use the log-loss cost function, which is defined by the following formula:

$$\mathcal{L} = -\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right))$$

In [7]:
def compute_cost(AL, Y):
    """
    Calculate the cost function

    Inputs:
    AL -- vector containing the output of the network, corresponding to the probabilities predicted by the neural network
            for each example: (1, number of examples)
    Y -- vector with the correct labels for the input data to the network: (1, number of examples)

    Returns:
    cost -- value of the log-loss cost function
    """
    
    m = Y.shape[1]

    cost = -1 * np.mean(np.multiply(np.log(AL), Y) + np.multiply(np.log(1 - AL), (1 - Y)))
    cost = np.squeeze(cost)
    
    return cost

### Backward propagation

To train a neural network it is necessary to calculate the gradient of the cost function with respect to the network parameters, for which we will use backward propagation. Backpropagation consists of applying the chain rule to calculate the gradient of the cost function step by step in each layer.

To apply the chain rule to the linear part of the neuron, suppose we have already calculated the derivative $dZ^{[l]} = \frac{\partial \mathcal{L} }{\partial Z^{[l ]}}$. So, to calculate the derivatives $(dW^{[l]}, db^{[l]}, dA^{[l-1]})$ we can use the following formulas:

$$ dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T}$$
$$ db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}$$
$$ dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]}$$

In [8]:
def linear_backward(dZ, cache):
    """
    Implements the linear part of backpropagation for a single layer

    Inputs:
    dZ -- derivative of the cost function with respect to the linear output of the current layer
    cache -- triple containing the values (A_prev, W, b), coming from the linear_forward function

    Returns:
    dA_prev -- derivative of the cost function with respect to the output of the previous layer (l-1): has the same size as A_prev
    dW -- derivative of the cost function with respect to the weight matrix W of the current layer (l): has the same size as W
    db -- derivative of the cost function with respect to the bias vector b of the current layer (l): has the same size as b
    """
    
    A_prev, W, b = cache
    m = A_prev.shape[1]

    dW = np.dot(dZ, A_prev.T) / m
    db = np.mean(dZ, axis=1, keepdims=True)
    dA_prev = np.dot(W.T, dZ)

    return dA_prev, dW, db

The next step is to apply the chain rule to the nonlinear part of the neurons, that is, to the activation functions. For this, if we denote $g$ as the activation function, we can use the following formula:

$$dZ^{[l]} = dA^{[l]} * g'(Z^{[l]}),$$

where $*$ indicates the product component by component.

In [9]:
def sigmoid_backward(dA, cache):
    """
    Inputs:
     dA -- derivative of the cost function with respect to the output of the current layer (l)
     cache -- "activation_cache", coming from the linear_activation_forward function
     
    Returns:
     dZ -- derivative of the activation function
    """
    Z = cache
    s = 1 / (1 + np.exp(-Z))
    dZ = dA * s * (1 - s)
    return dZ

def relu_backward(dA, cache):
    """
    Inputs:
     dA -- derivative of the cost function with respect to the output of the current layer (l)
     cache -- "activation_cache", coming from the linear_activation_forward function
     
    Returns:
     dZ -- derivative of the activation function
    """
    Z = cache
    dZ = np.array(dA, copy=True)
    dZ[Z <= 0] = 0
    return dZ

In [10]:
def linear_activation_backward(dA, cache, activation):
    """
    Implements single layer backpropagation including activation function
    
    Arguments:
    dA -- derivative of the cost function with respect to the output of the current layer (l)
    cache -- pair containing "linear_cache" and "activation_cache", coming from the linear_activation_forward function
    activation -- the name of the activation function used in the current layer (l): "sigmoid" or "relu"
    
    Bring back:
    dA_prev -- derivative of the cost function with respect to the output of the previous layer (l-1): has the same size as A_prev
    dW -- derivative of the cost function with respect to the weight matrix W of the current layer (l): has the same size as W
    db -- derivative of the cost function with respect to the bias vector b of the current layer (l): has the same size as b
    """
    
    linear_cache, activation_cache = cache
    
    if activation == "relu":
        dZ = relu_backward(dA, activation_cache)
    elif activation == "sigmoid":
        dZ = sigmoid_backward(dA, activation_cache)
        
    dA_prev, dW, db = linear_backward(dZ, linear_cache)
    
    return dA_prev, dW, db

Finally, it is possible to calculate the derivative of the cost function with respect to any of the parameters by applying the newly implemented functions starting with the last layer. Let's note that to initialize the back propagation it is necessary to first calculate the value of $\frac{\partial \mathcal{L}}{\partial A^{[L]}}$.

In [11]:
def L_model_backward(AL, Y, caches):
    """
    Implement back propagation of the entire neural network
    
    Inputs:
    AL -- neural network output, comes from the L_model_forward function
    Y -- vector with the correct labels for each example in the data set: (1, number of examples)
    caches -- list of caches containing all the caches of the linear_activation_forward() function, the caches
                indexed from 0 to L-2 correspond to the relu activation function caches and the indexed cache
                as L-1 corresponds to the cache of the sigmoid activation function
    
    Returns:
    grads -- A dictionary with the derivatives of the cost function with respect to each variable:
             grads["dA" + str(l)] = ...
             grads["dW" + str(l)] = ...
             grads["db" + str(l)] = ...
    """
    
    grads = {}
    L = len(caches)
    m = AL.shape[1]
    Y = Y.reshape(AL.shape)
    
    # initialize the back propagation
    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
    
    # last layer gradient
    current_cache = linear_activation_backward(dAL, caches[L-1], "sigmoid")
    grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = current_cache

    # deep layers gradient
    for l in reversed(range(L-1)):
        current_cache = linear_activation_backward(grads["dA" + str(l + 2)], caches[l], "relu")
        dA_prev_temp, dW_temp, db_temp = current_cache
        grads["dA" + str(l + 1)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp

    return grads

### Updating parameters

Once we have the gradient of the cost function we can use the **gradient descent method** to update the parameters of the neural network. If we denote $\alpha$ the learning rate, the formulas to apply a gradient descent step are:

$$ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]}$$
$$ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]}$$

In [12]:
def update_parameters(parameters, grads, learning_rate):
    """
    Update parameters using gradient descent
    
    Inputs:
    parameters -- dictionary containing the neural network parameters
    grads -- dictionary with the derivatives of the cost function with respect to each parameter,
                corresponds to the output of the L_model_backward function
    
    Returns:
    parameters -- dictionary with updated parameters:
                  parameters["W" + str(l)] = ...
                  parameters["b" + str(l)] = ...
    """
    
    L = len(parameters) // 2

    for l in range(L):
        parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * grads["dW" + str(l+1)]
        parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * grads["db" + str(l+1)]

    return parameters

### Training the NN

In [13]:
def L_layer_model(X, Y, layers_dims, learning_rate, num_iterations, print_cost):
    """
    Implements a neural network of L layers where the first L-1 layers have activation function relu and
    The last layer has sigmoid activation function.
    
    Inputs:
    X -- data: size array (number of variables, number of examples)
    Y -- vector with the correct labels for each example in the data set: (1, number of examples)
    layers_dims -- length list (number of layers + 1) containing the number of variables and the number
                    of neurons in each layer,
    learning_rate -- learning rate to apply the gradient descent method
    num_iterations -- number of steps to apply gradient descent
    print_cost -- if True, writes the value of the cost function every 10 iterations
    
    Returns:
    parameters -- adjusted neural network parameters
    """
    
    # Parameter initialization
    parameters = initialize_parameters(layers_dims)
    
    for i in range(0, num_iterations):
        # Forward prop
        AL, caches = L_model_forward(X, parameters)
        
        # cost function
        cost = compute_cost(AL, Y)
    
        # backrpop
        grads = L_model_backward(AL, Y, caches)
 
        # params update
        parameters = update_parameters(parameters, grads, learning_rate)
                
        # printing the cost every 10 iterations
        if print_cost and i % 10 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
    
    return parameters

In [14]:
layers_dims = [100, 20, 5, 1]
parameters = L_layer_model(train_x.T, train_y.reshape(1, -1), layers_dims=layers_dims, learning_rate=0.1, 
                           num_iterations=250, print_cost=True)

Cost after iteration 0: 0.696913
Cost after iteration 10: 0.691836
Cost after iteration 20: 0.686587
Cost after iteration 30: 0.676047
Cost after iteration 40: 0.650420
Cost after iteration 50: 0.590681
Cost after iteration 60: 0.493065
Cost after iteration 70: 0.405640
Cost after iteration 80: 0.345785
Cost after iteration 90: 0.302525
Cost after iteration 100: 0.269174
Cost after iteration 110: 0.242394
Cost after iteration 120: 0.220268
Cost after iteration 130: 0.201715
Cost after iteration 140: 0.186070
Cost after iteration 150: 0.172704
Cost after iteration 160: 0.161199
Cost after iteration 170: 0.151184
Cost after iteration 180: 0.142391
Cost after iteration 190: 0.134606
Cost after iteration 200: 0.127648
Cost after iteration 210: 0.121405
Cost after iteration 220: 0.115761
Cost after iteration 230: 0.110564
Cost after iteration 240: 0.105818


In [16]:
def accuracy(X, y, parameters):
    """
    Calculate the accuracy of the neural network's predictions.
    
    Inputs:
    X -- data: size array (number of variables, number of examples)
    parameters -- parameters of the trained neural network
    
    Returns:
    accuracy -- value between 0 and 1 that represents the accuracy of the neural network
    """
    
    m = X.shape[1]
    p = np.zeros((1,m))
    
    # forward prop
    probs, caches = L_model_forward(X, parameters)

    # ConversiÃ³n de la salida de la red a valores 0 o 1
    for i in range(0, probs.shape[1]):
        if probs[0, i] > 0.5:
            p[0, i] = 1
        else:
            p[0, i] = 0
            
    accuracy = np.sum((p == y)) / m
    
    return accuracy

print("Accuracy: {:.3f}".format(accuracy(test_x.T, test_y.reshape(1, -1), parameters)))

Accuracy: 0.968


## Tensorflow and Keras NN

In [17]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD

def keras_model(layers_dims, learning_rate):
    """
    Create, using Keras or tensorflow, a neural network of L fully connected layers where the first L-1 layers
    They have relu activation function and the last layer has sigmoid activation function.
    
    Inputs:
    layers_dims -- length list (number of layers + 1) containing the number of variables and the number
                    of neurons in each layer,
    learning_rate -- learning rate to apply the gradient descent method
    
    Returns:
    model -- Keras object that represents the neural network
    """
    
    L = len(layers_dims)
    
    model = Sequential()
    model.add(Dense(layers_dims[1], input_shape=(layers_dims[0],), activation="relu"))
    
    for l in range(2, L-1):
        model.add(Dense(layers_dims[l], activation="relu", kernel_initializer="random_normal",
                bias_initializer="zeros"))
    
    model.add(Dense(layers_dims[L-1], activation="sigmoid", kernel_initializer="random_normal",
                bias_initializer="zeros"))

    opt=SGD(learning_rate==learning_rate)
    model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

In [19]:
layers_dims = [100, 20, 5, 1]
model = keras_model(layers_dims = layers_dims, learning_rate = 0.1)
model.fit(train_x, train_y, epochs=250, batch_size=train_x.shape[0], verbose=1)

Epoch 1/250
Epoch 2/250
Epoch 3/250
Epoch 4/250
Epoch 5/250
Epoch 6/250
Epoch 7/250
Epoch 8/250
Epoch 9/250
Epoch 10/250
Epoch 11/250
Epoch 12/250
Epoch 13/250
Epoch 14/250
Epoch 15/250
Epoch 16/250
Epoch 17/250
Epoch 18/250
Epoch 19/250
Epoch 20/250
Epoch 21/250
Epoch 22/250
Epoch 23/250
Epoch 24/250
Epoch 25/250
Epoch 26/250
Epoch 27/250
Epoch 28/250
Epoch 29/250
Epoch 30/250
Epoch 31/250
Epoch 32/250
Epoch 33/250
Epoch 34/250
Epoch 35/250
Epoch 36/250
Epoch 37/250
Epoch 38/250
Epoch 39/250
Epoch 40/250
Epoch 41/250
Epoch 42/250
Epoch 43/250
Epoch 44/250
Epoch 45/250
Epoch 46/250
Epoch 47/250
Epoch 48/250
Epoch 49/250
Epoch 50/250
Epoch 51/250
Epoch 52/250
Epoch 53/250
Epoch 54/250
Epoch 55/250
Epoch 56/250
Epoch 57/250
Epoch 58/250
Epoch 59/250
Epoch 60/250
Epoch 61/250
Epoch 62/250
Epoch 63/250
Epoch 64/250
Epoch 65/250
Epoch 66/250
Epoch 67/250
Epoch 68/250
Epoch 69/250
Epoch 70/250
Epoch 71/250
Epoch 72/250
Epoch 73/250
Epoch 74/250
Epoch 75/250
Epoch 76/250
Epoch 77/250
Epoch 78

<keras.callbacks.History at 0x19d3778c6d0>

In [20]:
print("Accuracy {:.3f}".format(model.evaluate(test_x, test_y, verbose=0)[1]))

Accuracy 0.983


<div class = "alert alert-success" style="border-radius:15px">
<b>EXCERCISE / TAKE HOME IDEAS:</b> <br>
1) Compare the performance of both implementations. Use different hiperparameters such as:<br>- number of layers <br>
- different dimension for each layer <br>
- epochs <br>
- etc <br>
<br>
2) Program the numpy NN from scratch using OOP.</div>div>