# C1_Notes_W1 - Neural Networks Basics

This is the first course of the deep learning specialization at [Coursera](https://www.coursera.org/specializations/deep-learning) which is moderated by [DeepLearning.ai](http://deeplearning.ai/). The course is taught by Andrew Ng.

## Table of contents
* [Course summary](#course-summary)
* [1. Introduction to deep learning](#introduction-to-deep-learning)
  * [1.1 What is a (Neural Network) NN?](#what-is-a-neural-network-nn)
  * [1.2 Supervised learning with neural networks](#supervised-learning-with-neural-networks)
  * [1.3 Why is deep learning taking off?](#why-is-deep-learning-taking-off)
* [2. Neural Networks Basics](#neural-networks-basics)
  * [2.1 Binary classification](#binary-classification)
  * [2.2 Logistic regression](#logistic-regression)
  * [2.3 Logistic regression cost function](#logistic-regression-cost-function)
  * [2.4 Gradient Descent](#gradient-descent)
  * [2.5 Derivatives](#derivatives)
  * [2.6 More Derivatives examples](#more-derivatives-examples)
  * [2.7 Computation graph](#computation-graph)
  * [2.8 Derivatives with a Computation Graph](#derivatives-with-a-computation-graph)
  * [2.9 Logistic Regression Gradient Descent](#logistic-regression-gradient-descent)
  * [2.10 Gradient Descent on m Examples](#gradient-descent-on-m-examples)
  * [2.11 Vectorization](#vectorization)
  * [2.12 Vectorizing Logistic Regression](#vectorizing-logistic-regression)
  * [2.13 Notes on Python and NumPy](#notes-on-python-and-numpy)
  * [2.14 General Notes](#general-notes)

## Course summary

Here are the course summary as its given on the course [link](https://www.coursera.org/learn/neural-networks-deep-learning):

> If you want to break into cutting-edge AI, this course will help you do so. Deep learning engineers are highly sought after, and mastering deep learning will give you numerous new career opportunities. Deep learning is also a new "superpower" that will let you build AI systems that just weren't possible a few years ago. 
>
> In this course, you will learn the foundations of deep learning. When you finish this class, you will:
> - Understand the major technology trends driving Deep Learning
> - Be able to build, train and apply fully connected deep neural networks 
> - Know how to implement efficient (vectorized) neural networks 
> - Understand the key parameters in a neural network's architecture 
>
> This course also teaches you how Deep Learning actually works, rather than presenting only a cursory or surface-level description. So after completing it, you will be able to apply deep learning to a your own applications. If you are looking for a job in AI, after this course you will also be able to answer basic interview questions. 



# 1. Introduction to deep learning

> Be able to explain the major trends driving the rise of deep learning, and understand where and how it is applied today.

## 1.1 What is a (Neural Network) NN?

- Single neuron == linear regression
- Simple NN graph:
  - ![](Images/Others/01.jpg)
  - Image taken from [tutorialspoint.com](tutorialspoint.com)
- RELU stands for rectified linear unit is the most popular activation function right now that makes deep NNs train faster now.
- Hidden layers predicts connection between inputs automatically, thats what deep learning is good at.
- Deep NN consists of more hidden layers (Deeper layers)
  - ![](Images/Others/02.png)
  - Image taken from [opennn.net](opennn.net)
- Each Input will be connected to the hidden layer and the NN will decide the connections.
- Supervised learning means we have the (X,Y) and we need to get the function that maps X to Y.

## 1.2 Supervised learning with neural networks

- Different types of neural networks for supervised learning which includes:
  - CNN or convolutional neural networks (Useful in computer vision)
  - RNN or Recurrent neural networks (Useful in Speech recognition or NLP)
  - Standard NN (Useful for Structured data)
  - Hybrid/custom NN or a Collection of NNs types
- Structured data is like the databases and tables.
- Unstructured data is like images, video, audio, and text.
- Structured data gives more money because companies relies on prediction on its big data.

## 1.2 Why is deep learning taking off?

- Deep learning is taking off for 3 reasons:
  1. Data:
     - Using this image we can conclude:
       - ![](Images/11.png)
     - For small data NN can perform as Linear regression or SVM (Support vector machine)
     - For big data a small NN is better that SVM
     - For big data a big NN is better that a medium NN is better that small NN.
     - Hopefully we have a lot of data because the world is using the computer a little bit more
       - Mobiles
       - IOT (Internet of things)
  2. Computation:
     - GPUs.
     - Powerful CPUs.
     - Distributed computing.
     - ASICs
  3. Algorithm:
     1. Creative algorithms has appeared that changed the way NN works.
        - For example using RELU function is so much better than using SIGMOID function in training a NN because it helps with the vanishing gradient problem.


# 2 Neural Networks Basics

> Learn to set up a machine learning problem with a neural network mindset. Learn to use vectorization to speed up your models.

## 2.1 Binary classification

- Mainly he is talking about how to do a logistic regression to make a binary classifier.
  - ![log](Images/Others/03.png)
  - Image taken from [3.bp.blogspot.com](http://3.bp.blogspot.com)
- He talked about an example of knowing if the current image contains a cat or not.
- Here are some notations:
  - `M is the number of training vectors`
  - `Nx is the size of the input vector`
  - `Ny is the size of the output vector`
  - `X(1) is the first input vector`
  - `Y(1) is the first output vector`
  - `X = [x(1) x(2).. x(M)]`
  - `Y = (y(1) y(2).. y(M))`
- We will use python in this course.
- In NumPy we can make matrices and make operations on them in a fast and reliable time.
# NOTE: We stack observations along the COLUMNS NOT ROWS [x.shape = (n,m) where n is num_feats and m is num_obs] ... this makes implementation for neural networks a lot easier. 

## 2.2 Logistic regression

- Algorithm is used for classification algorithm of 2 classes.
- Equations:
  - Simple equation:	`y = wx + b`
  - If x is a vector: `y = w(transpose)x + b`
  - If we need y to be in between 0 and 1 (probability): `y = sigmoid(w(transpose)x + b)`
  - In some notations this might be used: `y = sigmoid(w(transpose)x)` 
    - While `b` is `w0` of `w` and we add `x0 = 1`. but we won't use this notation in the course (Andrew said that the first notation is better).
- In binary classification `Y` has to be between `0` and `1`.
- In the last equation `w` is a vector of `Nx` and `b` is a real number


## 2.3 Logistic regression cost function

### NON CONVEX (we wont use this):
- First loss function would be the square root error:  `L(y',y) = 1/2 (y' - y)^2`
  - But we won't use this notation because it leads us to optimization problem which is **non convex, means it contains local optimum points.**

### CONVEX (use this one!):
- This is the function that we will use: `L(y',y) = -(y*log(y') + (1-y)*log(1-y'))`
- To explain the last function lets see:
  - if `y = 1` ==> `L(y',1) = -log(y')`  ==> we want `y'` to be the largest   ==> `y`' biggest value is 1
  - if `y = 0` ==> `L(y',0) = -log(1-y')` ==> we want `1-y'` to be the largest ==> `y'` to be smaller as possible because it can only has 1 value.
- Then the Cost function will be: `J(w,b) = (1/m) * Sum(L(y'[i],y[i]))`
- The loss function computes the error for a single training example; the cost function is the average of the loss functions of the entire training set.

![](images/log_reg.png)

## 2.4 Gradient Descent

- We want to predict `w` and `b` that minimize the cost function.
- Our cost function is convex.
- First we initialize `w` and `b` to 0,0 or initialize them to a random value in the convex function and then try to improve the values the reach minimum value.
- In Logistic regression people always use 0,0 instead of random.
- The gradient decent algorithm repeats: `w = w - alpha * dw`
  where alpha is the learning rate and `dw` is the derivative of `w` (Change to `w`) 
  The derivative is also the slope of `w`
- Looks like greedy algorithms. the derivative give us the direction to improve our parameters.


- The actual equations we will implement:
  - `w = w - alpha * d(J(w,b) / dw)`        (how much the function slopes in the w direction)
  - `b = b - alpha * d(J(w,b) / db)`        (how much the function slopes in the d direction)

![](images/grad_desc.png)

### NOTE: If J (cost function) is a function of 2 or more parameters (ie. J(W,b)) then instead of lowercase (d) we used that squiggly d to denote a 'partial derivative'
## 2.5 Derivatives

- We will talk about some of required calculus.
- You don't need to be a calculus geek to master deep learning but you'll need some skills from it.
- Derivative of a linear line is its slope.
- ex. `f(a) = 3a`
  - `d(f(a))/d(a) = 3`
- if `a = 2` then `f(a) = 6`
- if we move a a little bit `a = 2.001` then `f(a) = 6.003` means that we multiplied the derivative (Slope) to the moved area and added it to the last result.

### Deriviatve = Slope: when you nudge Wa up by 2, and J goes up by 4, then the partial derivate dWa (eg 'Wa' wrt J(Wa, b)) is 2 because J went up 2x as much as Wa. 
   - **However, as we know, derrivatives are defined by the slope of a cost function at a super small amount (using limits). But the intution holds.**
   - Also, for non-linear funcitons (which is most functions), slope is non the same everywhere! 

![](images/slope.png)
## 2.6 More Derivatives examples

- `f(a) = a^2`  ==> `d(f(a))/d(a) = 2a`
  - `a = 2`  ==> `f(a) = 4`
  - `a = 2.0001` ==> `f(a) = 4.0004` approx.

**THIS JUST MEANS THAT AT ANY POINT ON THE LINE, THE SLOPE WILL BE 2x THE VALUE OF 'a'... Meaning that if we increase 'a' by 'a_delta', the value of f(a) will increase by a_delta\*2**  
 
- `f(a) = a^3`  ==> `d(f(a))/d(a) = 3a^2`
- `f(a) = log(a)`  ==> `d(f(a))/d(a) = 1/a`

**THIS MEANS THAT FOR ANY VALUE OF 'a' THE SLOPE WILL BE 3a^2... AND IF WE INCREASE A BY 'a_delta', THEN f(a) will increase by 3\*a_delta^2**  
- To conclude, Derivative is the slope and slope is different in different points in the function thats why the derivative is a function.

![](images/calculus1.png)

## NOTE: to get the different calculus rules that determine the derivatives for different functions, we need to open a calculus textbook! These all can obviously be proved mathermatically... but we wont do that here. Take a calculus course for that stuff!

![](images/calculus2.png)


![](images/c1w1n_notes1.jpg)

## 2.7 Computation graph

- Its a graph that organizes the computation from left to right.
- Its pretty much breaking down the function into its component parts!
  - ![](Images/02.png)

## 2.8 Derivatives with a Computation Graph

- Calculus chain rule says:
  If `x -> y -> z`          (x effect y and y effects z)
  Then `d(z)/d(x) = d(z)/d(y) * d(y)/d(x)`
- The video illustrates a big example.
  - ![](Images/03.png)
- We compute the derivatives on a graph from right to left and it will be a lot more easier.
- `dvar` means the derivatives of a final output variable with respect to various intermediate quantities.

### NOTE: In code, 'dvar' really means dJ/dvar... Derivative of J wrt variable

## 2.9 Logistic Regression Gradient Descent

- In the video he discussed the derivatives of gradient decent example for one sample with two features `x1` and `x2`.
![](images/c1w1n_logreg.png)
![](Images/04.png)

## 2.10 Gradient Descent on m Examples

- Lets say we have these variables:

  ```
  	X1					Feature
  	X2                  Feature
  	W1                  Weight of the first feature.
  	W2                  Weight of the second feature.
  	B                   Logistic Regression parameter.
  	M                   Number of training examples
  	Y(i)				Expected output of i
  ```

- So we have:
  ![](Images/09.png)

- Then from right to left we will calculate derivations compared to the result:

  ```
  	d(a)  = d(l)/d(a) = -(y/a) + ((1-y)/(1-a))
  	d(z)  = d(l)/d(z) = a - y
  	d(W1) = X1 * d(z)
  	d(W2) = X2 * d(z)
  	d(B) = d(z)
  ```

- From the above we can conclude the logistic regression pseudo code:

  ```
  	J = 0; dw1 = 0; dw2 =0; db = 0;                 # Devs.
  	w1 = 0; w2 = 0; b=0;							# Weights
  	for i = 1 to m
  		# Forward pass
  		z(i) = W1*x1(i) + W2*x2(i) + b
  		a(i) = Sigmoid(z(i))
  		J += (Y(i)*log(a(i)) + (1-Y(i))*log(1-a(i)))
  		
  		# Backward pass
        # ADD GRADIENTS TOGETHER FROM EACH OBSERVATION
        # The same gradients will be added for every additional feature in the logisitic regression model
        # Two features is easy, but you will probably need another for loop to go over ALL features if n grows
  		dz(i) = a(i) - Y(i)
  		dw1 += dz(i) * x1(i)
  		dw2 += dz(i) * x2(i)
  		db  += dz(i)
    
    # AVERAGE THE GRADIENTS OVER M (NUMBER OF OBSERVATIONS)
    # THESE ARE THE GRADIENTS FOR A SINGLE STEP OF GS
  	J /= m
  	dw1/= m
  	dw2/= m
  	db/= m
  	
  	# Gradient descent: SINGLE STEP
  	w1 = w1 - alpa * dw1
  	w2 = w2 - alpa * dw2
  	b = b - alpa * db
  ```

- EVERYTHING ABOVE REPRESENTS **A SINGLE STEP OF GRADIENT DESCENT** because we are only updating the gradients once. The above code should run for some iterations to minimize error.

- So there will be two inner loops to implement the logistic regression:
    - 1. For loop to loop over EACH training example
    - 2. For loop to do the gradient calcs for each feature weights... So with 2 weights we dont need this, but if you hvae 100 features (n_x = 100) you dont want to write a line for each of dW1, dw2... dw100... so you will need another loop!
- **This is SUPER INEFFICIENT!!!** And you will also need another loop to go through the number of gradient descent updates you want... remember, the code above only updates the gradients ONCE!!! 

- Vectorization can help us solve these inneficiencies... **We will get do a single iteration of gradient descent WITHOUT ANY FOR LOOPS**. 
- Vectorization is so important on deep learning to reduce loops. In the last code we can make the whole loop in one step using vectorization!

## 2.11 Vectorization

- Deep learning shines when the dataset are big. However for loops will make you wait a lot for a result. Thats why we need vectorization to get rid of some of our for loops.
- NumPy library (dot) function is using vectorization by default.
- The vectorization can be done on CPU or GPU thought the SIMD operation. But its faster on GPU.
- Whenever possible avoid for loops.
- Most of the NumPy library methods are vectorized version.

### Vectorizing (as opposed to for loops) can speed things up by a crazy amount... Below, our code runs 300x faster!!!

In [37]:
# SIMPLE MULTIPLICATION EXAMPLE
import time
a = np.random.rand(1000000)
b = np.random.rand(1000000)

# RUN VECTORIZED MATRIX MULTIPLICATION
tic = time.time()
c = np.dot(a,b)
toc = time.time()
print(c)
print("Vectorized runtime (milsec): {}".format(1000*(toc-tic)))

# RUN NON-VECTORIZED MATRIX MULTIPLICATION
c = 0
tic = time.time()
for i in range(1000000):
    c += a[i]*b[i]
toc = time.time()
print(c)
print("For Loop runtime (milsec): {}".format(1000*(toc-tic)))

250329.512541
Vectorized runtime (milsec): 3.503084182739258
250329.512541
For Loop runtime (milsec): 946.1748600006104


In [38]:
import time
import math
n=100000
a = np.random.rand(n)

# RUN VECTORIZED MATRIX MULTIPLICATION
tic = time.time()
c1 = np.exp(a)
toc = time.time()
print("Vectorized runtime (milsec): {}".format(1000*(toc-tic)))

# RUN NON-VECTORIZED MATRIX MULTIPLICATION
c2 = np.zeros((n,))
tic = time.time()
for i in range(n):
    c2[i] = math.exp(a[i])
toc = time.time()
print("For Loop runtime (milsec): {}".format(1000*(toc-tic)))
print("Both vectors equal: ",all(c1==c2))

Vectorized runtime (milsec): 1.5192031860351562
For Loop runtime (milsec): 45.53103446960449
Both vectors equal:  True


### NOTE: Try to always use proper shapes with your vectors and matrices (ie avoid shapes like (5,).

#### a.shape = (5,) is refered to as a "rank 1 array"

In [6]:
#RANK ONE ARRAY
print(np.random.rand(5).shape)

# REGULAR VECTOR
print(np.random.rand(5,1).shape)

# use assert function to check your shapes
a = np.random.rand(5,1)
b = np.random.rand(5,)
assert a.shape == (5,1)
assert b.shape == (5,1), "DON'T USE 1 RANK ARRAYS U IDIOT!!"

(5,)
(5, 1)


AssertionError: DON'T USE 1 RANK ARRAYS U IDIOT!!

## 2.12 Vectorizing Logistic Regression
- Whenever possible, we want to avoid for loops (either using a built in function, or vectorizing)...
    - For instance, np.log and np.exp are way faster than doing for loops.
    - **Always look for a built in function first before doing a for loop!**
- We will implement Logistic Regression using one for loop then without any for loop.
- As an input we have a matrix `X` and its `[Nx, m]` and a matrix `Y` and its `[Ny, m]`.
- We will then compute at instance `[z1,z2...zm] = W.T * X + [b,b,...b]`... 
    - As we know, we have concatenated ALL of the weights and ALL of the X features into a single matrix each. So instead of doing separate vector calcs for W1X2 and W2X2, we can simply multiply the two matrices together: W1\*X1 + W2\*X2 ==> W\*X This is basic shit, but its good to review!

This can be written in python as:
            
        # Vectorization, then broadcasting, Z shape is (1, m)
        Z = np.dot(W.T,X) + b
        
        # Vectorization, A shape is (1, m)
        # VECTORIZED SIGMOID FUNCTION
        A = 1 / 1 + np.exp(-Z)   

Vectorizing Logistic Regression's Gradient Output:
            
        # Vectorization, dz shape is (1, m)
        dz = A - Y                 

        # Vectorization, dw shape is (Nx, 1)
        dw = np.dot(X, dz.T) / m

        # Vectorization, dz shape is (1, 1)
        db = dz.sum() / m    
        
![](images/c1w1n_npgs.png)
![](images/c1w1n_npgs2.png)

## 2.13 Notes on Python and NumPy

- In NumPy, `obj.sum(axis = 0)` sums the columns while `obj.sum(axis = 1)` sums the rows.
- In NumPy, `obj.reshape(1,4)` changes the shape of the matrix by broadcasting the values.
- Reshape is cheap in calculations so put it everywhere you're not sure about the calculations.
- Broadcasting works when you do a matrix operation with matrices that doesn't match for the operation, in this case NumPy automatically makes the shapes ready for the operation by broadcasting the values.
- Some tricks to eliminate all the strange bugs in the code:
  - If you didn't specify the shape of a vector, it will take a shape of `(m,)` and the transpose operation won't work. You have to reshape it to `(m, 1)`
  - Try to not use the rank one matrix in ANN
  - Don't hesitate to use `assert(a.shape == (5,1))` to check if your matrix shape is the required one.
  - If you've found a rank one matrix try to run reshape on it.
- Jupyter / IPython notebooks are so useful library in python that makes it easy to integrate code and document at the same time. It runs in the browser and doesn't need an IDE to run.
  - To open Jupyter Notebook, open the command line and call: `jupyter-notebook` It should be installed to work.
- To Compute the derivative of Sigmoid:

  ```
    # derivative  using calculus
  	s = sigmoid(x)
  	ds = s * (1 - s)      
  ```

- To make an image of `(width,height,depth)` be a vector, use this:

  ```
  #reshapes the image.
  v = image.reshape(image.shape[0]*image.shape[1]*image.shape[2],1)  
  ```

![](images/c1w1n_broadcast.png)
- Gradient descent converges faster after normalization of the input matrices.

## 2.14 General Notes

- The main steps for building a Neural Network are:
  - Define the model structure (such as number of input features and outputs)
  - Initialize the model's parameters.
  - Loop.
    - Calculate current loss (forward propagation)
    - Calculate current gradient (backward propagation)
    - Update parameters (gradient descent)
- Preprocessing the dataset is important.
- Tuning the learning rate (which is an example of a "hyperparameter") can make a big difference to the algorithm.
- [kaggle.com](kaggle.com) is a good place for datasets and competitions.
- [Pieter Abbeel](https://www2.eecs.berkeley.edu/Faculty/Homepages/abbeel.html) is one of the best in deep reinforcement learning.



# EXTRA: Logistic Regression Cost Function
![](images/c1w1n_extrtalog.png)
![](images/c1w1n_extrtalog2.png)

# Hand Written Notes Review

![](handnotes/c1w1n_notes1.jpg)
![](handnotes/c1w1n_notes2.jpg)
![](handnotes/c1w1n_notes3a.jpg)
![](handnotes/c1w1n_notes3b.jpg)