# Shallow neural network (1-hidden layer)

### In this notebook I will explain the structure of a 2-layers or 1-hidden layer neural network. We will be building the neural network from scratch by using plain python, NumPy and Matplotlib for visualization. The model predicts the color of points, label (y=0) for red and label (y=1) for blue

## 1. Importing required libraries
- **NumPy** is used to do mathematical operations on arrays as it contains a multi-dimensional array and matrix data structures.
-  **Matplotlib** is a plotting library used for visualizing data.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

## 2. Generating the dataset that will be used in training the neural network 

 -  points with an x1 component or a x2 component, exclusively, above 0.5 have a classification value of 1, otherwise (if both x1 and x2 components are greater than or less than 0.5)

In [None]:
def xor(X1, X2):
        return 0 if (X1 > .5 and X2 >.5) or (X1 < .5 and X2 < .5) else 1

- This function generates and returns our dataset. It takes one argument called 'size' which detirmines the number of samples or observations.

- It returns an array of our input which in this example will be (2, size) dimensional matrix, and a (1, size) dimensional array of classification labels

In [None]:
def generate_dataset(size):
    
    """
    Generates dataset
    
    Arguments:
    size -- size if the dataset
    
    Returns:
    X -- Input data in shape of (2, size)
    Y -- Output data in shape of (1, size)
    """
    np.random.seed(1)
    
    # Create an array of the given size and fill it with random values between 0 and 1
    X = np.random.rand(2, size)
    
    # Create a list of tuples of each pair of the array created above
    features = list(zip(X[0], X[1]))
    
    # Create an array of classification labels
    Y = np.array([xor(x1 ,x2) for x1, x2 in features])
    
    # Reshape our labels to (1, size) dimensional array
    Y.shape = (1, size)

    # Return the input and output layers
    return X, Y

### 3. Activation Functions

- Activation functions are mathematical equations that determine the output of a neural network by making the incoming data non-linear so that the network can learn complex patterns in the data

- Types of activation functions: Sigmoid, tanh, ReLU, leaky ReLU,  etc.. 
- I will be using only 2 functions as we have 2 layers in our network:

    - tanh: which is a built-in function in numpy and its fomula as follows:
    
        **$$\tanh{x} = \left(\frac{2}{1 + {exp}^{-2}} - 1\right)$$**

    - sigmoid: 
        
        $$\sigma = \frac{1}{1 + {exp}^{-x}}$$
        
        
- **I will make a comparison notebook explaining the activation functions in depth later.**

In [None]:
def sigmoid(Z):
    
    # np.exp() calculates the exponential of all elements in the input array.
    return 1 / (1 + np.exp(-Z))

### 4. Initializing model's parameters
   - Initialization step is critical to model's overall performance and should be done carefully.
   - The weights should be initialized randomly as if we initialized them to zero will lead the nodes to calculate the same features which prevents different neurons from learning different things. So we have to break the symmetry by randomly initializing them.
   - The bias can be initialized to zero unlike the weights
   
**Parameters initialization will be explained in depth later.**

In [None]:
def initialize_parameters(hidden_nodes, features, output_layers):
    
    """
    Initializing the parameters 
    
    Arguments:
    hidden_layers -- number of hidden nodes in the hidden layer
    features -- number of features in our dataset
    output_layers -- number of outputs 
    
    Returns:
    W -- randomly initialized weights
    B -- Bias initialized to zero
    """
    # Initializing the weights with random values
    
    #First layer's weights is (hidden_node, features) dimensional array as it takes the features as input
    W1 = np.random.randn(hidden_nodes, features) * 0.01
    
    #Second layer's weights (the output layer) is (output_layers, hidden_nodes) dimensional array as it takes the hidden layer's output as input
    W2 = np.random.randn(output_layers, hidden_nodes) * 0.01
    
    # Initializing bias with zero
    b1 = np.zeros((hidden_nodes, 1))
    b2 = np.zeros((output_layers, 1))
    
    # Adding the parameters to dictionary
    W = {'W1': W1, 'W2': W2}
    B = {'b1': b1, 'b2': b2}
    
    # Return the parameters
    return W, B

### 5. Forward propagation
- Forward propagation (or forward pass) refers to the calculation and storage of intermediate variables (including outputs) for a neural network in order from the input layer to the output layer. 
- Each hidden layer accepts the input data, processes it as per the activation function and passes to the successive layer. We now work step-by-step through the mechanics of a neural network with one hidden layer.


**$$Z^{[1]}=W^{[1]}.X + b^{[1]}$$**

**$$A^{[1]}=\tanh(Z^{[1]})$$**

**$$Z^{[2]}=W^{[2]}.A^{[1]} + b^{[2]}$$**

**$$A^{[2]}=\sigma(Z^{[2]})$$**

In [None]:
def forward_propagation(W, B, X):
    
    """
    Applying forward propagation using the equations given above
    
    Arguments:
    W -- The weights of our layers
    B -- The bias
    X -- Input data in shape of (2, samples)
    
    Returns:
    A1 -- Activations of the hidden layer
    A2 -- Activations of the output layer
    
    """
    
    W1 = W['W1']
    W2 = W['W2']
    
    b1 = B['b1']
    b2 = B['b2']
    
    # Calculating Z of the hidden layer using np.dot() which does matrix dot production
    Z1 = np.dot(W1, X) + b1
    
    # Calculating activations of the hidden layer using np.tanh which applies the tanh function to the whole matrix elements
    A1 = np.tanh(Z1)
    
    # Calculating Z of the output layer
    Z2 = np.dot(W2, A1) + b2
    
    # Calculating activations of the output layer using sigmoid function
    A2 = sigmoid(Z2)
    
    # Return the activation values
    return A1, A2
    

### 6. Cost
- After calculating the activations now we are able to calculate the cost and what the cost function really doing is calculating the overall performance of our model by quantifing the error between predicted values and expected values and presents it in the form of a single real number.

<img src="images/CostEquation.gif" />

- For the log function we can use numpy builtin log function  np.log()
- np.squeeze() is being used to remove redundant dimensions
- Y.T is the transpose of Y matrix

In [None]:
def cost(Y, A):
    
    """
    Computes the cost using the equation given above
    
    Arguments:
    Y -- Output data in shape of (1, samples)
    A -- Activations of the output layer
    
    Returns:
    cost -- the cost using the equation given above
    """
    
    # Calculate output count
    m = Y.shape[1]
    
    # Calculating the cost using the equation given above
    cost = (-1 / m) * np.sum((np.dot(np.log(A), Y.T) + np.dot(np.log(1 - A), (1 - Y).T)))
    cost = float(np.squeeze(cost))
    
    return cost

### 7. Backward propagation
- Backpropagation repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector.
- Backpropagation aims to minimize the cost function by adjusting network’s weights and biases and the level of adjustment is determined by the gradients of the cost function with respect to those parameters.
- Calculating the gradient tells us how much should the weight and bias change in order to minimize the cost



**$$\frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } = \frac{1}{m} (a^{[2](i)} - y^{(i)})$$**

**$$\frac{\partial \mathcal{J} }{ \partial W_2 } = \frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } a^{[1] (i) T} $$**

**$$\frac{\partial \mathcal{J} }{ \partial b_2 } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)}}}$$**

**$$\frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)} } =  W_2^T \frac{\partial \mathcal{J} }{ \partial z_{2}^{(i)} } * ( 1 - a^{[1] (i) 2}) $$**

**$$\frac{\partial \mathcal{J} }{ \partial W_1 } = \frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)} }  X^T $$**

**$$\frac{\partial \mathcal{J} _i }{ \partial b_1 } = \sum_i{\frac{\partial \mathcal{J} }{ \partial z_{1}^{(i)}}}$$**

   - dW1 = $\frac{\partial \mathcal{J} }{ \partial W_1 }$
   - db1 = $\frac{\partial \mathcal{J} }{ \partial b_1 }$
   - dW2 = $\frac{\partial \mathcal{J} }{ \partial W_2 }$
   - db2 = $\frac{\partial \mathcal{J} }{ \partial b_2 }$

In [None]:
def backward_propagation(W, A, X, Y):
    """
    Applying backward propagation using the equations above
    
    Arguments:
    W -- Python dictionary contains the weights of our network
    A -- Python dictionary contains the activiation values of the layers
    X -- Input data in shape of (2, samples)
    Y -- Output data in shape of (1, samples)
    
    Returns:
    gradients -- Python dictionary contains our gradiants
    
    """
    
    # Samples count
    m = X.shape[1]
    
    # Getting the weights from W
    W1 = W['W1']
    W2 = W['W2']
    
    #Getting the activations from B
    A1 = A['A1']
    A2 = A['A2']
    
    #Calculating activations using equations given above
    dZ2 = A2 - Y
    dW2 = (1 / m) * np.dot(dZ2, A1.T)
    db2 = (1 / m) * np.sum(dZ2, axis=1, keepdims=True)
    
    dZ1 = np.dot(W2.T, dZ2) * (1 - np.power(A1, 2))
    dW1 = (1 / m) * np.dot(dZ1, X.T)
    db1 = (1 / m) * np.sum(dZ1, axis=1, keepdims=True)
    
    gradients = {'dW2': dW2,'db2': db2, 'dW1': dW1, 'db1': db1}
    
    return gradients 

### 8. Updating the parameters

- After finishing the backward propagation process and getting the gradients we are ready to update our parameters using the general gradient descent rule **$$ \theta = \theta - \alpha \frac{\partial J }{ \partial \theta }$$** where $\alpha$ is the learning rate and $\theta$ represents a parameter.

In [None]:
def update_parameters(W, B, gradients, learning_rate = 1.2):
    
    """
    Updates the parameters using gradient descent rule
    
    Arguments:
    W -- The weights
    B -- The bias
    gradients -- The calculated gradients from backward propagation process
    learning_rate -- the step size at each iteration
    
    Returns:
    W -- The updated weights
    B -- The updated bias
    """
    
    #Getting the parameters
    W1 = W['W1']
    W2 = W['W2']
    
    b1 = B['b1']
    b2 = B['b2']
    
    # Getting the gradients
    dW1 = gradients['dW1']
    dW2 = gradients['dW2']
    
    db1 = gradients['db1']
    db2 = gradients['db2']
    
    #Calculating the new parameters using gradient descent rule
    W1 -= learning_rate * dW1
    W2 -= learning_rate * dW2
    
    b1 -= learning_rate * db1
    b2 -= learning_rate * db2
    
    W = {'W1': W1, 'W2': W2}
    B = {'b1': b1, 'b2': b2}
    
    return W, B

### Now we are ready to collect everything together and build our model

In [None]:
def model(X, Y, hidden_nodes, iterations = 20000):
    
    """
    Trains our neural network
    
    Arguments:
    X -- Input data in shape of (2, samples)
    Y -- Output data in shape of (1, samples)
    hidden_nodes -- count of nodes in the hidden layer
    iterations -- Number of iterations in gradient descent loop
    
    Returns:
    W -- The weights learnt by the model
    B -- The bias learnt by the model
    cost_list -- list contains the cost of every iteration
    
    """

    cost_list = []
    
    # Initializing our parameters
    W, B = initialize_parameters(hidden_nodes, X.shape[0], Y.shape[0])
    
    # Gradient descent loop
    for i in range(iterations):
        
        # Calculating the activations
        A1, A2 = forward_propagation(W, B, X)
        
        # Calculating the cost and appending it in the cost_list
        cost_ = cost(Y, A2)
        cost_list.append(cost_)
        
        # Calculating gradients
        gradients = backward_propagation(W, {'A1': A1, 'A2': A2}, X, Y)
        
        #Updating parameters
        W, B = update_parameters(W, B, gradients)
        
        # Print the cost every 1000 iteration 
        if i % 1000 == 0:
            print('Cost after %i iterations = %f'%(i, cost_))
            
    return W, B, cost_list

In [None]:
def predict(W, B, X):
    
    """
    Predicts the label of each example
    
    Arguments:
    W -- Weights
    B -- Bias
    X -- Input Data
    
    Returns:
    predictions: the predicted labels
    """
    
    #Calculating activations
    A1, A2 = forward_propagation(W, B, X)
    
    #For each activation in the output layer higher than 0.5 the prediction is true otherwise the prediction is false
    predictions = (A2 > 0.5)
    
    return predictions

In [None]:
# Generate a dataset with 500 sample 
X, Y = generate_dataset(500)

# Build a model with 5 hidden nodes
iterations = 10000
W, B, costs = model(X, Y, 5, iterations)


In [None]:
#Calculating the accuracy
predictions = predict(W, B, X)
print ('Accuracy: %d' % float((np.dot(Y,predictions.T) + np.dot(1-Y,1-predictions.T))/float(Y.size)*100) + '%')

In [None]:
#Plotting the decision boundary
xx, yy = np.meshgrid(np.arange(0, 1, 0.01), np.arange(0, 1, 0.01))
t = lambda x: predict(W, B, x.T)
Z = t(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.ylabel('x2')
plt.xlabel('x1')
plt.scatter(X[0, :], X[1, :], c=Y, cmap=plt.cm.Spectral)

### Plotting the cost function with the iterations

In [None]:
plt.plot(list(range(iterations)), costs, '-r')
plt.xlabel('Iterations')
plt.ylabel('Cost')