# Task 2

Implement basic backward pass in MLP. Perform forward and backward propagation through your network and check your gradients.
This time, the forward pass is implemented for you. Notice the matrix notation - the dimensions are in form $[m,nX,1]$, where $m$ is batch size (number of samples) and $nX$ is the size of sample vector.

In [112]:
# Import
import numpy as np

## Activations

Implement derivations of standard activation functions (ReLU, Sigmoid), which are used in your task.

In [113]:
#------------------------------------------------------------------------------
#   ActivationFunction class
#------------------------------------------------------------------------------
class ActivationFunction:
    def __init__(self):
        pass

    def __call__(self, z):
        pass

#------------------------------------------------------------------------------
#   LinearActivationFunction class
#------------------------------------------------------------------------------
class LinearActivationFunction(ActivationFunction):
    def __call__(self, z):
        return z

    def derivation(self, z):
        ###>>> start of solution

        ###<<< end of solution
        pass

#------------------------------------------------------------------------------
#   RELUActivationFunction class
#------------------------------------------------------------------------------
class RELUActivationFunction(ActivationFunction):
    def __call__(self, z):
        return np.maximum(z, 0)

    def derivation(self, z):
        ###>>> start of solution

        ###<<< end of solution
        pass

#------------------------------------------------------------------------------
#   SigmoidActivationFunction class
#------------------------------------------------------------------------------
class SigmoidActivationFunction(ActivationFunction):
    def __call__(self, z):
        return 1.0/(1.0+np.exp(-z))

    def derivation(self, z):
        ###>>> start of solution

        ###<<< end of solution
        pass
    
# Activation mapping
    
MAP_ACTIVATION_FUCTIONS = {
    "linear": LinearActivationFunction,
    "relu": RELUActivationFunction,
    "sigmoid": SigmoidActivationFunction
}

def CreateActivationFunction(kind):
    if (kind in MAP_ACTIVATION_FUCTIONS):
        return MAP_ACTIVATION_FUCTIONS[kind]()
    raise ValueError(kind, "Unknown activation function {0}".format(kind))

## Layer

This is the main class which can hold different types of layers and provides us with standard tasks like forward propagation. Implement backward functions for defined classes.

nUnits - number of neuron units in your layer

prevLayer - previous layer (need it to know the shape of it to create appropriate number of weights for you to use in current layer)

In [114]:
#------------------------------------------------------------------------------
#   Layer class
#------------------------------------------------------------------------------
class Layer:
    def __init__(self, act="linear", name="layer"):
        self.shape = (0, 0)
        self.activation = CreateActivationFunction(act)
        self.name = name

    def initialize(self, prevLayer):
        pass

    def forward(self, x):
        pass

#------------------------------------------------------------------------------
#   InputLayer class
#------------------------------------------------------------------------------
class InputLayer(Layer):
    def __init__(self, nUnits, name="Input"):
        super().__init__(act="linear", name=name)
        self.nUnits = nUnits

    def initialize(self, prevLayer):
        self.shape = (self.nUnits, 1)

    def forward(self, x):
        return x

    def backward(self, X):
        return None
    
#------------------------------------------------------------------------------
#   Basic Dense Layer class
#------------------------------------------------------------------------------
class DenseLayer(Layer):
    def __init__(self, nUnits, act="linear", name="Dense"):
        super().__init__(act, name=name)
        # init each neuron into list        
        self.nUnits = nUnits
        self.W = None
        self.b = None

    def initialize(self, prevLayer):
        #initialize all neurons
        self.shape = (self.nUnits, prevLayer.shape[0])

        # Initialize weights and bias
        prev_nUnits, _ = prevLayer.shape
        self.W = np.random.randn(self.nUnits, prev_nUnits)
        self.b = np.zeros((self.nUnits, 1), dtype=float)

    def forward(self, X):
        print("Forward of", self.name)
        self.z = np.matmul(self.W, X) + self.b         # Z = W*x + b
        self.a = self.activation(self.z)               # a = activation(Z)
        
        return self.a

    def backward(self, da, aPrev):
        #   da  =   dLoss -> dL/da of previous layer - with respect to backward pass
        #   aPrev   =   activation of previous layer needed for weights - with respect to forward pass
        batch_size = aPrev.shape[0]
        print("Backward of", self.name)
        ###>>> start of solution

        ###<<< end of solution
        pass
    

## Loss Functions

Implement two standard loss functions (Binary Cross Entropy and Mean Squared Error), which you will/can use in your implementation of MLP backward pass.

In [115]:
#------------------------------------------------------------------------------
#   LossFunction class
#------------------------------------------------------------------------------
class LossFunction:
    def __init__(self):
        pass

    def __call__(self, A, Y):
        pass

    def derivation(self, A, Y):
        pass


#------------------------------------------------------------------------------
#   BinaryCrossEntropyLossFunction class
#------------------------------------------------------------------------------
class BinaryCrossEntropyLossFunction(LossFunction):
    def __call__(self, A, Y):
        # Warning! Use of logarithm - Take care about definition scope
        ###>>> start of solution

        ###<<< end of solution
        pass
    
    def derivation(self, A, Y):
        # Warning! Use of logarithm - Take care about definition scope
        ###>>> start of solution

        ###<<< end of solution
        pass
    
class MeanSquaredErrorLossFunction(LossFunction):
    def __call__(self, A, Y):
        ###>>> start of solution

        ###<<< end of solution
        pass

    def derivation(self, A, Y):
        ###>>> start of solution

        ###<<< end of solution
        pass


MAP_LOSS_FUNCTIONS = {
    "bce": BinaryCrossEntropyLossFunction,
    "mse": MeanSquaredErrorLossFunction
}

def CreateLossFunction(kind):
    if (kind in MAP_LOSS_FUNCTIONS):
        return MAP_LOSS_FUNCTIONS[kind]()
    raise ValueError(kind, "Unknown loss function {0}".format(kind))

## Model class

This is the basic class which holds all of your layers and encapsulate functionality to predict results from your input as a forward pass through all the layers after you create your model and initialize all the layers.

Implemet backpropagation.

In [116]:
#------------------------------------------------------------------------------
#   Model class
#------------------------------------------------------------------------------
class Model:
    def __init__(self, lossName):
        self.layers = []
        # Initialize loss function
        self.loss_fn = CreateLossFunction(lossName)
        
    def addLayer(self,  layer):
        self.layers.append(layer)

    def initialize(self):
        # Call initialization sequentially on all layers
        prevLayer = None
        for l in self.layers:
            l.initialize(prevLayer)
            prevLayer = l      
    
    def forward(self, X):
        # Single feed forward
        A = X
        for l in self.layers:
            A = l.forward(A)
            
        return A  
    
    def backward(self, dLoss):
        ###>>> start of solution

        ###<<< end of solution
        pass
    
    def compute_loss(self, A, Y):
        batch_size = Y.shape[0]
        
        ###>>> start of solution

        ###<<< end of solution
        
        pass
    
    def derive_loss(self, A, Y):
        batch_size = Y.shape[0]
        
        ###>>> start of solution

        ###<<< end of solution
        
        pass

### Main Processing Cell

 1. Initialize dataset. 
 2. Declare a simple model (at least 4 layer) with relu on hidden layers and sigmoid on output layer.
 3. Perform forward pass through the network. 
 4. Compute loss.
 5. Derive loss.
 6. Perform backward pass.
 7. Celebrate and scroll lower.

In [117]:
# Main processing
from dataset import dataset_Circles
# Task A:

X, Y = dataset_Circles(n=16, radius=0.7, noise=0.0)
###>>> start of solution

###<<< end of solution

**How does gradient checking work?**.

As in 1) and 2), you want to compare "gradapprox" to the gradient computed by backpropagation. The formula is still:

$$ \frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} \tag{1}$$

However, $\theta$ is not a scalar anymore. It is a dictionary called "parameters". We implemented a function "`dictionary_to_vector()`" for you. It converts the "parameters" dictionary into a vector called "values", obtained by reshaping all parameters (W1, b1, W2, b2, W3, b3) into vectors and concatenating them.

The inverse function is "`vector_to_dictionary`" which outputs back the "parameters" dictionary.


We have also converted the "gradients" dictionary into a vector "grad" using gradients_to_vector(). You don't need to worry about that.


Here is pseudo-code that will help you implement the gradient check.

For each i in num_parameters:
- To compute `J_plus[i]`:
    1. Set $\theta^{+}$ to `np.copy(parameters_values)`
    2. Set $\theta^{+}_i$ to $\theta^{+}_i + \varepsilon$
    3. Calculate $J^{+}_i$ using to `forward_propagation_n(x, y, vector_to_dictionary(`$\theta^{+}$ `))`.     
- To compute `J_minus[i]`: do the same thing with $\theta^{-}$
- Compute $gradapprox[i] = \frac{J^{+}_i - J^{-}_i}{2 \varepsilon}$

Thus, you get a vector gradapprox, where gradapprox[i] is an approximation of the gradient with respect to `parameter_values[i]`. You can now compare this gradapprox vector to the gradients vector from backpropagation. Just like for the 1D case (Steps 1', 2', 3'), compute: 
$$ difference = \frac {\| grad - gradapprox \|_2}{\| grad \|_2 + \| gradapprox \|_2 } \tag{3}$$


**The code will be added later** but soon enough ;)

In [118]:
# GRADED FUNCTION: gradient_check_n



## Verification cell

 8. Verify your solution by gradient checking.
 9. Start crying.
 10. Repeat until correct ;)

In [119]:
# gradient_check_n(network, X, Y)
