# Intro to Deep Learning

This notebook accompanies the Intro to Deep Learning workshop run by Hackers at Cambridge

TODO: Fill in the code corresponding the comments, and replace the variables set to None.

## Importing Data and Dependencies


First, we will import the dependencies - **numpy**, a python library we'll be using for matrix multplication, and **matplotlib** for visualisation purposes.

We'll import the datasets using nice loader functions from **sklearn**.


To run a cell, hover over the [    ] next to each cell and click the play button on the left - the number in the [   ]  tells you the order of execution (1, 2, 3 etc.)

In [0]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston,load_breast_cancer


We'll load the house dataset, using the **load_boston(return_X_y=True)** function. This returns our inputs $X$ and our labels $y$ as a tuple.
If you read them in, you'll notice they're not the right dimensions, so you'll need to reshape them. We have $m=506$ examples and $n=13$ features. 

We want to store $X$ in a $n$ x $m$ matrix (currently it is $m$ x $n$), and $y$ in a $1$ x $m$ matrix - currently it is a $m$ dimensional vector


It is also good practice to normalise the data: $\mu$ = mean, $\sigma$ = standard deviation *across examples*.

$$X = \frac{X-\mu}{\sigma}$$

as this speeds up learning.

Finally, split the data into training and test data sets (80:20 split)- we'll keep the test data to the side to evaluate our model at the end.

Rather than hardcoding dimensions like 506, 13 etc, extract the dimensions from the X.shape tuple and use them - this will allow you to reuse the code for different datasets.

### Useful functions:

          A.shape #return a tuple consisting of A's dimensions
          
          A = np.reshape(A, (b,c)) # reshapes A into a b x c matrix - (b,c) = tuple of dimensions
          A.T  #returns the transpose of A (flips rows and columns)
          
          np.mean(A, axis=1, keepdims=True) # takes the mean of A across axis 1, keepdims=True ensures number of dimensions is same. if we have a (a,b) dimensional array then using axis=0 returns a (1,b) dimensional array
          
                   
         np.std(A, axis=1, keepdims=True) #ditto but with standard deviation
       
       A[2:5, :] # we can take slices - e.g. this will return all the columns of the rows 2-4 




In [0]:
X,y = load_boston(return_X_y=True)

# reshape to make sure dimensions are right 
X = None
y = None


#normalise the data
X = None

#split data into train and test set
X_train = None
Y_train = None

X_test = None
Y_test = None


## Creating the neural network:

Having preprocessed our data into matrices, it is now time to create the feedforward neural network. 

First we need to initialise parameters: the weights and biases for each layer.

The weights for layer *$l$* are stored in *$ W^{[l]}$*, a *$n_l$ x $n_{(l-1)}$* matrix, where *$n_l$* is the number of units in layer *$l$*. 
We  initialise the weights randomly to break symmetry, and multiply by 0.001 to ensure weights aren't too large.

The biases for layer *$l$* are stored in *$ b^{[l]}$*, which is a *$n_l$ x 1* matrix.

### Useful functions:
          
          np.random.randn(a, b) # creates a random matrix with dimensions (a,b)
          np.zeros((a,b))  #matrix of zeros of size (a,b) - note the extra set of brackets!

In [0]:
def initialise_parameters(layers_units): #layers_units = list of number of nodes in each layer
    parameters = {}            # create a dictionary containing the parameters
    for l in range(1, len(layers_units)):
        parameters['W' + str(l)] = None
        parameters['b' + str(l)] = None
    return parameters

### Activation Functions:

The activation function $g(z)$ we will be using is the ReLU function $g(z) = max(0,z)$ in the hidden layers.

Another one is sigmoid : $\sigma(z) = \frac{1}{1 + e^{-z}}$

NB: Although the ReLU function is technically non-differentiable when $z=0$, in practice we can set the derivative=0 at $z=0$.


###Useful functions:
      
      np.exp(z) #exponentiates z element-wise
      A>c  #compares each element of A with c and because this is Python, False=0 , True=1 when multiplying with ints.
 
 So if you multiply this with matrix **A** , you can zero-out values where A>c is false.

In [0]:
def sigmoid(z):
    return None

def relu(z, deriv = False):
    if(deriv):  #This is for gradients - do this when you do next section!
        return None #gradient = 1 if z>0, 0 otherwise
    else:
        return None

We can now write the code for the forward propagation step.

In each layer $l$ , we matrix multiply the output of the previous layer $A^{(l-1)}$  by a weight matrix $W^{(l)}$ and then add a bias term $b^{(l)}$. We then take the result $Z^{(l)}$ and apply the activation function $g(z)$ to it to get the output $A^{(l)}$. $L$ = number of layers.
The equations are thus:
$$Z^{(l)}=W^{[l]}A^{([l-1]} + b^{[l]}$$
$$A^{[l]}=g(Z^{[l]})$$
 
Here, we have $g(z) = ReLU(z)$

### Useful functions:
          
          C = A.dot(B) #matrix multiplies A, B
          C = np.dot(A,B) #equivalent operation
          
         



In [0]:
def forward_propagation(X,parameters):
    cache = {} #stores all our intermediate calculations since 
    L = len(parameters)//2 #final layer
    cache["A0"] = X #ease of notation since input = layer 0
    for l in range(1, L):
        cache['Z' + str(l)] = None
        cache['A' + str(l)] = None #use relu as activation function
    #final layer
    cache['Z' + str(L)] = None
    cache['A' + str(L)] = cache['Z' + str(L)] #no activation function for last layer 
    return cache 

## Implementing the Learning 


Next we can compute the loss function - this is the objective function the neural network will aim to minimise during training:

$m$ = number of training examples, $(x^{(i)},y^{(i)})$ is the $i^{th}$ training example, $a^{[L](i)}$ is the output of the final layer $L$ for that $i^{th}$ training example.


**Mean Squared Error:**

$$ J(W^{(1)}, b^{(1)},...) = \frac{1}{2m} \sum_{i=1}^{m} (a^{[L](i)} - y^{(i)})^2 $$

### Useful functions:
          
         np.square(A) #square each element in A

          np.sum(A, axis=1, keepdims= True) # just like mean and std, take sum along axis 1



In [0]:
def cost_function(AL,Y):
    m = None
    cost = None
    return cost

### Backpropagation:

Calculating the gradients:

For the final layer:

$$\frac{\partial \mathcal{J} }{\partial Z^{(L)}} = A^{(L)} - Y$$ 


For a general layer $l$, 

$$ \frac{\partial \mathcal{J} }{\partial Z^{[l]}} = \frac{\partial \mathcal{J} }{\partial A^{[l]}}*g^{'}(Z^{[l]})$$

$$ \frac{\partial \mathcal{J} }{\partial W^{[l]}} = \frac{1}{m}\frac{\partial \mathcal{J} }{\partial Z^{[l]}} A^{[l-1] T} $$

$$ \frac{\partial \mathcal{J} }{\partial b^{(l)}} = \frac{1}{m} \sum_{i = 1}^{m} \frac{\partial \mathcal{J} }{\partial Z^{(l)(i)}}$$

$$ \frac{\partial \mathcal{J} }{\partial A^{[l-1]}} = W^{[l] T} \frac{\partial \mathcal{J} }{\partial Z^{[l]}} $$




If you are keen, it's a good exercise to derive them yourself or alternatively check this [post](https://mukul-rathi.github.io/2018/08/31/Backpropagation.html) for a deeper dive into the intuition behind it.

### Useful functions:
          np.sum(A, axis=1, keepdims= True) # just like mean and std, take sum along axis 1
          
          A.dot(B) # matrix multiplication returns A.B
          
         A*B #returns elementwise multiplication (useful for last equation)
​
​

In [0]:
def backpropagation(cache,Y,parameters):
    L = len(parameters)//2 
    m = Y.shape[1]
    grads = {}
    #code up the last layer explicitly
    grads["dZ" + str(L)]= None
    grads["dW" + str(L)]= None
    grads["db" + str(L)]= None
    for l in range(L-1,0,-1): 
        grads["dA" + str(l)]= None
        grads["dZ" + str(l)]= None
        grads["dW" + str(l)]= None
        grads["db" + str(l)]= None
    return grads

### Gradient Descent

Now let's combine the functions created so far to create a model and train it using  gradient descent. 

The update equations for the parameters are as follows:
$$ W^{[l]} = W^{[l]} - \alpha \frac{\partial \mathcal{J} }{\partial W^{[l]}} $$

$$ b^{[l]} = b^{[l]} - \alpha \frac{\partial \mathcal{J} }{\partial b^{[l]}} $$

where $\alpha$ is the learning rate parameter.

In [0]:
def train_model(X_train, Y_train,num_epochs,layers_units,learning_rate): #epoch = one cycle through the dataset
    train_costs = []
    
    #Initialise the parameters
    
    L = len(layers_units)-1 
    for epoch in range (num_epochs):
        #run one step of forward propagation
        
        #calculate the cost
        
        #get gradients using backpropagation
        grads = None

        #iterate through each layer and update the parameters using gradient descent 
        #hint:  weight at layer l  = parameters["W"+ str(l)]  
        for _ in _:
            None

        #periodically output an update on the current cost and performance on the training set for visualisation
        train_costs.append(cost)
        if(epoch%(num_epochs//10)==0):
            print("Training the model, epoch: " + str(epoch+1))
            print("Cost after epoch " + str((epoch)) + ": " + str(cost))
    print("Training complete!")
    #return the trained parameters and the visualisation metrics
    return parameters, train_costs

To evaluate the model, we'll visualise the training set error over the number of iterations. We then output the final value of the evaluation metric for training and test sets. (I've used *matplotlib* to plot the graph).

In [0]:
def evaluate_model(train_costs,parameters,X_train, Y_train, X_test, Y_test):
    #plot the graphs of training set error
    plt.plot(np.squeeze(train_costs))
    plt.ylabel('Cost')
    plt.xlabel('Iterations')
    plt.title("Training Set Error")
    plt.show()
    L = len(parameters)//2
    
    #For train and test sets, perform a step of forward propagation to obtain the trained model's 
    #predictions and evaluate this
    
    train_cache = forward_propagation(X_train,parameters)
    train_AL = train_cache["A"+ str(L)]
    
    print("The train set MSE is: "+str(cost_function(train_AL,Y_train)))
        
    test_cache = forward_propagation(X_test,parameters)
    test_AL = test_cache["A"+ str(L)]
    
    print("The test set MSE is: "+str(cost_function(test_AL,Y_test)))
    

## Training the model

Now it's time to train the model using our helper functions.

Let's define our hyperparameters - I encourage you to play around with these - e.g. add more layers, change number of iterations.

You might find the model does much worse on the test set - this is called **overfitting** - again you can read up more about it [here](https://mukul-rathi.github.io/2018/09/02/DebuggingLearningCurve.html)



In [0]:
#define the hyperparameters for the model - "tuning knobs"

num_epochs = 1500 #number of passes through the training set
layers_units = [X.shape[0], 1] #layer 0 is the input layer - each value in list = number of nodes in that layer
learning_rate = 1e-4 #size of our step


In [0]:
parameters, train_costs = train_model(X_train, Y_train ,num_epochs,layers_units,learning_rate)         

In [0]:
evaluate_model(train_costs,parameters,X_train, Y_train, X_test, Y_test)

## Summary and Extensions:

You've just trained your first deep learning model! As an extension, try running the code again, but this time, use the **load_cancer()** function instead of **load boston()**. This is a dataset that classifies breast cancer as malignant/benign.

Remember that sigmoid function in the lectures? We can use it to predict probabilities for classification, so all you need to do is apply it to the output of the final layer.

A couple of other minor tweaks - for classification, the network uses the **cross-entropy loss** as a cost function instead of mean-square error, and you'll want to print out accuracy not MSE in the evaluation function. 

But the cool thing is that the network structure is the **same**! The same network, just with a sigmoid function applied to the output, can be trained on a *completely different task* and still work.

That's the power of deep learning! Stay tuned for future workshops on specialised deep learning models for computer vision and natural language processing. If you want to dive deeper, head over to the [blog](http://mukul-rathi.github.io/blog.html).

In [0]:
print("Have a good day!")