<img src="figures/nn.gif" alt="nn" style="width: 1000px;"/>

# 1. Why is deep learning taking off

The basic technical ideas behind Deep Learning are around for decades, why are they taking off today ?

Deep learning is taking off today for 3 main reasons:

* Data
* Computation
* Algorithm


### 1.1. Data 

The best thing to answer this question would be to show and explain you the picture below. 

<div class="item">
    <img src="figures/why_dl.png" alt="why_dl" width="600px"/>
</div>

At the vertical axes of the diagram you can see the performance of an algorithm (e.g. it’s prediction accuracy) and at the horizontal axes you can see the amount of data it has been given.

You can also see that the performance of traditional learning algorithms (Logistic Regression, SVM’s etc.) increases at the beginning with an increase of the amount of data but that it plateaus at a certain level and stops improving it’s performance.

The thing is that we have accumulated huge amounts of data over the last decades where our traditional learning algorithms can’t take advantage of, which is where Deep Learning comes into play.

Large Neural Networks (e.g. Deep Learning) are getting better and better the more data you put into them.

* For a small amount of data, Neural Networks can perform as Linear regression or SVM (Support vector machine)
* For a big amount of data, a small Neural Network is better than SVM.
* For a big amount of data, a deeper Neural Network is better that a medium Neural Network, which is better than a small NN.
* Over the last decade, the world is generating a huge amount of data
    * Mobiles
    * IOT (Internet of things)
    * ...
    
### 1.2. Computation

The last decade has seen the emergence of

* GPUs and TPUs.
* Powerful CPUs.
* Distributed computing.
* ASICs (Application-specific integrated circuit).

Fast computation is very important because the process of training neural networks is very iterative.

You often have an idea for a neural network architecture, then you code your idea and then you run an experiment which tells you how well your model does. Then you look back at the details of your neural network, change something and let it run again, therefore fast computation makes you much faster in finding the right solution. The illustration below shows this concept very good.


<div class="item">
    <img src="figures/iterative_ml.png" alt="iterative_ml" width="400px"/>
</div>

### 1.3. Algorithms

Getting a better accuracy with deep learning algorithms is either due to a better Neural Network, more computational power or huge amounts of data. Eventually you will reach a certain point where you don’t have enough data left or where you cant improve the algorithm anymore because it then will take too much time to train.

If you don’t have enough data it doesn’t really matter if you use deep learning- or traditional algorithms but it matters how well you have adjusted your model to it’s current prediction goal. So if you don’t have big amounts of data it is often up to your skills to score a high accuracy.

The recent breakthroughs in the development of algorithms are mostly due to making them run much faster than before, which makes it possible to use more and more data. For an example, a big advancement came from switching from a Sigmoid function to a rectified-Linear-Unit function.

<div class="item">
    <img src="figures/sigmoid_relu.png" alt="sigmoid_relu" width="700px"/>
</div>

One of the problems of using the Sigmoid function within deep learning was that at the left-most and right-most points, the slope of the function (the gradients) are nearly zero which causes that the parameters change very slowly and therefore learning becomes really slow. In using RELU as activation function, the gradients are equal to one for all positive values of input and so on the gradients are less likely to decrease to zero. 

# 2. The Neural Network

## 2.1. Biological Neuron

Neural networks were conceived in the middle of the twentieth century as scientists started understanding how the brain works. What makes the brain so interesting to computer scientists is that in many ways it acts like a mathematical function: input comes in (in the form of electrical and chemical signals), some type of neurological computation happens, and output goes out to other cells. For example, you receive input from your sensory system, perhaps noticing a burning sensation on your hand. Then your brain performs some computation, such as determining that your hand has touched a hot pan. Finally, electrical output is sent to your nervous system causing you to contract your arm muscles and jerk your hand away from the hot surface.

In the 1940s and 50s a group of scientists, including Warren McCulloch, Walter Pitts and Frank Rosenblatt published a series of papers that laid out the blueprint for the first algorithms based on the brain. They observed that the brain is a network, made up of billions of nerve cell connectors called neurons. Neurons are comprised of branch-like receptors called dendrites, that receive electrical impulses from other upstream neural cells, and a long, thin projection called an axon that sends signals downstream to other neurons. Once a neurons’ dendrites receive an impulse, if a certain threshold is reached, the neuron fires and it sends a signal to other neurons through its axon. Since neurons are connected in a large network, when an impulse is received by one neuron, it can set off a wide range of electrical activity within the brain, creating a chain reaction.


<div class="item">
    <img src="figures/biological_neuron.jpg" alt="biological_neuron" width="600px"/>
</div>


McCulloch, Pitts and Rosenblatt realized that a simplified version of this neurological process could be used to implement a self-learning computer function. They understood that the brain is able to update the strength of its neural connections automatically over time as it gains new experience, and that these neural adjustments are how people learn concepts and store memory. Based on this “connectionist” approach, they pioneered a completely new approach to computer science.


## 2.2. The Perceptron

Rosenblatt’s artificial neuron, the perceptron, was a popular early implementation of this neurological architecture. The perceptron was designed to be a simple electronic representation of a single neuron. Like a real neuron, it received input, performed a calculation and “fired” if a certain threshold was met. In addition, it updated its connection strength based on its experience, and thus was able to “learn” from experience. Although relatively simple by today’s standards, the perceptron still forms the basis for a class of powerful prediction algorithms. For example, we could use the perceptron to create a program to detect fraudulent credit card transactions. We would do this by compiling a large list of existing transactions that we know to be either legitimate or fraudulent, and feeding the dataset to the perceptron to learn to predict fraud.

The perceptron algorithm takes a series of numbers as inputs and runs them through a set of corresponding numerical weights. The inputs are analogous to a neurons’ dendrites receiving electrical impulses, the weights represent the connection strength between neurons, and an activation function determines if the threshold has been reached. (Fancier options for the activation function came later, but the original perceptron used a very simple activation function called a step function that simply returned 1 if net input is above the threshold, or 0 otherwise.)

The diagram below gives a better picture of how the perceptron actually worked.

<img src="figures/perceptron.png" alt="perceptron" width="600px" style="margin:10px 100px"/>

After each pass data through the network, the weights are adjusted based on whether the output was correct or not. As this process is repeated, the network learns how to interpret the data better. 

You can find more information about the learning process of the perceptron [here](2_perceptron_blueprint.ipynb).

In the modern sense, the perceptron is an algorithm for learning a binary classifier called a threshold function: a function that maps its input ${\displaystyle \mathbf {x} }$  (a real-valued vector) to an output value ${\displaystyle f(\mathbf {x} )}$ (a single binary value):

$$
{\displaystyle f(\mathbf {x} )={\begin{cases}1&{\text{if }}\ \mathbf {w} \cdot \mathbf {x} +b>0,\\0&{\text{otherwise}}\end{cases}}}
$$

where ${\displaystyle \mathbf {w} }$ is a vector of real-valued weights, ${\displaystyle \mathbf {w} \cdot \mathbf {x} }$ is the dot product ${\displaystyle \sum _{i=1}^{n}w_{i}x_{i}}$, where $n$ is the number of inputs to the perceptron, and $b$ is the bias. The bias shifts the decision boundary away from the origin and does not depend on any input value.

The value of ${\displaystyle f(\mathbf {x} )}$ (0 or 1) is used to classify ${\displaystyle \mathbf {x} }$  as either a positive or a negative instance, in the case of a binary classification problem. If b is negative, then the weighted combination of inputs must produce a positive value greater than ${\displaystyle |b|}$ in order to push the classifier neuron over the 0 threshold. Spatially, the bias alters the position (though not the orientation) of the decision boundary.


## 2.3. A Single Neuron

The basic unit of computation in a neural network is the neuron, often called a node or unit. It receives input from some other nodes, or from an external source and computes an output. Each input has an associated weight (w), which is assigned on the basis of its relative importance to other inputs. The node applies a function f (defined below) to the weighted sum of its inputs as shown the figure below:

<div class="item">
    <img src="figures/neuron.png" alt="neuron" width="600px"/>
</div>

The above network takes numerical inputs $x_1$ and $x_2$ and has weights $w_1$ and $w_2$ associated with those inputs. Additionally, there is another input 1 with weight b (called the Bias) associated with it. 

The output $y$ from the neuron is computed as shown the above figure. The function $f$ is usually non-linear and is called the **Activation Function**. The purpose of the activation function is to introduce non-linearity into the output of a neuron. This is important because most real world data is non linear and we want neurons to learn these non linear representations.

### 2.3.1. A Single Neuron: Linear Regression

Originally, perceptrons were used as binary classifiers i.e to classify binary labels ( 0 or 1 ). But, if no non-linear activation function is applied to the dot product of the features and weights, then it is simply a linear regressor.

If the linear function is $f(x)=x$ and $n$ is the number of features then,

$$
\Large \hat{y} = \sum_{i=1}^n x_i w_i + b
$$

Hence, $\hat{y} = \sum_{i=1}^n x_i w_i + b$ generally represents a hyperplane which is used in **linear regression.**


### 2.3.2. A Single Neuron: Logistic Regression

If the non-linear activation function $g(z) = \dfrac{1}{1 + e^{-z}}$ is applied to the dot product of the features and weights, then it is simply a linear regressor.

If the sigmoid function is $g(z) = \dfrac{1}{1 + e^{-z}}$ and $n$ is the number of features then,


\begin{align*} 
& z = \sum_{i=1}^n x_i w_i + b \newline 
& g(z) = \dfrac{1}{1 + e^{-z}} \newline 
& \hat{y} = g ( \sum_{i=1}^n x_i w_i + b ) 
\end{align*}

Hence, $\hat{y}$ generally represents a hyperplane which is used in **logistic regression.**

# 3. Multi-Layer Perceptron

Well, not exactly. Although the perceptron represented a huge advancement in AI, it suffered from a couple of big problems. First, the process used for predicting the outcome—i.e. using a weighted sum of the inputs—is a form of a linear equation. Linear equations are a type of function that output a line. In the case of the perceptron, the linear equation is used to draw a boundary between the classes it is trying to predict. While many classification problems can be predicted using linear equations, many more cannot. So the fact that the perceptron’s predictions are based on a linear equation means it is limited in the types of problems it can address.

<img src="figures/linear_classifier.png" alt="linear_classifier" style="width: 600px;"/>

A symptom of this problem is the fact that the perceptron can’t implement an “Exclusive Or” (XOR). XOR is a simple logic function that returns true if either of its two inputs are true, but not if both are. Pedro Domingos provides a good example of XOR in his book The Master Algorithm. Nike is said to be popular among teenage boys and middle-aged women. Therefore, people who are female or young, may be receptive to a Nike advertisement, but those who are female and young, are a much less attractive prospect. Furthermore, if you are neither young nor female, you’re also an unpromising prospect. If you were using a perceptron to build a targeted marketing engine, you wouldn’t be able to handle this type of relationship.

<img src="figures/xor.svg" alt="xor" style="width: 500px;"/>

Making matters worse, implementing XOR—a fundamental logic function—is trivial using traditional computer science approaches, yet is impossible to do with a perceptron. Theoretically, it was possible to solve XOR with a perceptron using multiple layers of inputs and weights. Such a multilayer perceptron (MLP) would contain an input layer, one or more intermediate hidden layers, and an output layer. (The middle layers are called the “hidden” layers because we don’t see the output of their calculations—they are fed into other intermediate layers or to the output layer directly.)

<img src="figures/mlp.png" alt="mlp" style="width: 800px;"/>

# 4. Neural Network Representation

## 4.1. Definition

Artificial Neural Network is computing system inspired by biological neural network that constitute animal brain. Such systems “learn” to perform tasks by considering examples, generally without being programmed with any task-specific rules.

The Neural Network is constructed from 3 type of layers:
<img style="float: right; margin: 0px 0px 15px 15px;" src="figures/nn_archi.png" width="300" />

* Input layer — initial data for the neural network.
* Hidden layers — intermediate layer between input and output layer and place where all the computation is done.
* Output layer — produce the result for given inputs.

Each node is connected with each node from the next layer and each connection (black arrow) has particular weight. Weight can be seen as impact that that node has on the node from the next layer. So if we take a look on one node it would look just like the single neuron presented in previous section.

## 4.2. Mathematical Representation

To establish notation for future use, we’ll use $\mathbf{x}^{(i)}$ to denote the “input” variables, also called input features, and $y^{(i)}$ to denote the “output” or target variable that we are trying to predict. A pair $(x^{(i)} , y^{(i)})$ is called a training example, and the dataset that we’ll be using to learn — a list of $m$ training examples $(x^{(i)} , y^{(i)})$; $i=1,...,m$ — is called a training set. Note that the superscript “(i)” in the notation is simply an index into the training set, and has nothing to do with exponentiation. We will also use $X$ to denote the space of input values, and $Y$ to denote the space of output values. In this example, $X = Y = ℝ$.

\begin{align*}
x_j^{(i)} &= \text{value of feature } j \text{ in the }i^{th}\text{ training example} \newline 
\mathbf{x}^{(i)}& = \text{the input (features) of the }i^{th}\text{ training example} \newline 
m &= \text{the number of training examples} \newline 
n &= \text{the number of features} 
\end{align*}

In the case where there are multiple outputs, the output will then be a vector $\mathbf{y}$. $n_x$ will refer to the number of the features, and $n_y$ will refer to the number of outputs.

We'll use the superscript $\left[ l \right]$ to refer to the ${l^{th}}$ layer of the network and the subscript $j$ to refer to the ${j^{th}}$ neuron in a layer.

In order to understand the mathematical equations we will use a simpler Neural Network model. This model will have 3 input nodes. One hidden layer with 4 nodes and one output node.

<img src="figures/nn_representation.png" alt="nn_representation" style="width: 1000px;"/>

Let's vectorize the weights of each line, i.e., the weights that participate in the calculation of the activation at each node in the layer. With <span style="color:red"> $\mathbf{w}_1^{[1]} = \begin{bmatrix}w_{11}^{\left[ 1 \right]} \newline w_{12}^{\left[ 1 \right]} \newline w_{13}^{\left[ 1 \right]} \end{bmatrix}$ </span>,  <span style="color:blue">
$\mathbf{w}_2^{[1]} = \begin{bmatrix}w_{21}^{\left[ 1 \right]} \newline w_{22}^{\left[ 1 \right]} \newline w_{23}^{\left[ 1 \right]}  \end{bmatrix}$ </span>, $\dots \ $; ${\mathbf{a}^{\left[ 0 \right]}} = \mathbf{x} = \begin{bmatrix} x_1 \newline x_2 \newline x_3 \end{bmatrix}$, We can then write

<span style="color:red">
$$\Large z_1^{\left[ 1 \right]} = \mathbf{w}_1^{\left[ 1 \right]{\rm{T}}}{\mathbf{a}^{\left[ 0 \right]}} + {b_1}^{[1]}, \ \ \ {a_1^{\left[ 1 \right]}} = f({z_1^{\left[ 1 \right]}}) \newline$$
</span>


<span style="color:blue">
$$\Large z_2^{\left[ 1 \right]} = \mathbf{w}_2^{\left[ 1 \right]{\rm{T}}}{\mathbf{a}^{\left[ 0 \right]}} + {b_2}^{[1]}, \ \ \ {a_2^{\left[ 1 \right]}} = f({z_2^{\left[ 1 \right]}}) \newline$$
</span>

<span style="color:green">
$$\Large z_3^{\left[ 1 \right]} = \mathbf{w}_3^{\left[ 1 \right]{\rm{T}}}{\mathbf{a}^{\left[ 0 \right]}} + {b_3}^{[1]}, \ \ \ {a_3^{\left[ 1 \right]}} = f({z_3^{\left[ 1 \right]}}) \newline$$
</span>

<span style="color:orange">
$$\Large z_4^{\left[ 1 \right]} = \mathbf{w}_4^{\left[ 1 \right]{\rm{T}}}{\mathbf{a}^{\left[ 0 \right]}} + {b_4}^{[1]}, \ \ \ {a_4^{\left[ 1 \right]}} = f({z_4^{\left[ 1 \right]}}) \newline$$
</span>


Let's us group the weight vectors of the first layer together in 1 matrix to get: 

$ 
{W^{\left[ 1 \right]}} = \begin{bmatrix} 
\color{red} - &  \color{red}{\mathbf{w}_1^{[l]}} &  \color{red} -  \newline  
\color{blue} - & \color{blue}{\mathbf{w}_2^{[l]}} & \color{blue} -  \newline
\color{green} - & \color{green}{\mathbf{w}_3^{[l]}} & \color{green} -  \newline 
\color{orange} - & \color{orange}{\mathbf{w}_4^{[l]}} & \color{orange} - 
\end{bmatrix} $ 

Then we can calculate the activation at layers [1] and [2] of our network using the following vectorized form of equations:

activations at layer [1]: $$\Large \mathbf{z}^{\left[ 1 \right]} = W^{\left[ 1 \right]}{\mathbf{a}^{\left[ 0 \right]}} + \mathbf{b}^{[1]}, \ \ \ {\mathbf{a}^{\left[ 1 \right]}} = f({\mathbf{z}^{\left[ 1 \right]}})$$
activations at layer [2]: $$\Large \mathbf{z}^{\left[ 2 \right]} = W^{\left[ 2 \right]}{\mathbf{a}^{\left[ 1 \right]}} + \mathbf{b}^{[2]}, \ \ \ {\mathbf{a}^{\left[ 2 \right]}} = f({\mathbf{z}^{\left[ 2 \right]}})$$

Generalizing for any layer [l], then we can write:


$$\Large \mathbf{z}^{\left[ l \right]} = W^{\left[ l \right]}{\mathbf{a}^{\left[ l-1 \right]}} + \mathbf{b}^{[l]}, \ \ \ {\mathbf{a}^{\left[ l \right]}} = f({\mathbf{z}^{\left[ l \right]}})$$

### 4.3. Vectorizing across multiple examples:

In the last section, we saw how to compute the prediction on a neural network, given a single training example. In this section, we will see how to vectorize across multiple training examples. 

Whereby stacking up different training examples in different columns of the matrix, we'd be able to take the equations we had from the previous section. And with very little modification, change them to make the neural network compute the outputs on all the examples at the same time. 

So far we computed $\mathbf{z}^{\left[ 1 \right]}$, $\mathbf{a}^{\left[ 1 \right]}$, $\mathbf{z}^{\left[ 2 \right]}$ and $\mathbf{a}^{\left[ 2 \right]}$. These equations help us comupte how, given an input feature back to $\mathbf{a}^{\left[ 0 \right]} = \mathbf{x}$, the ouput of our neural network for a single training example. Considering the same example as above, for a given one example $\mathbf{x}$, we can calculate the output as follow:

$$ \mathbf{z}^{\left[ 1 \right]} = W^{\left[ 1 \right]}{\mathbf{a}^{\left[ 0 \right]}} + \mathbf{b}^{[1]}, \ \ \ {\mathbf{a}^{\left[ 1 \right]}} = f({\mathbf{z}^{\left[ 1 \right]}})$$
$$ \mathbf{z}^{\left[ 2 \right]} = W^{\left[ 2 \right]}{\mathbf{a}^{\left[ 1 \right]}} + \mathbf{b}^{[2]}, \ \ \ {\mathbf{a}^{\left[ 2 \right]}} = f({\mathbf{z}^{\left[ 2 \right]}})$$

If we have $m$ training examples, we need to repeat this process $m$ times. 

$$ \mathbf{x} \xrightarrow{\hspace{2cm}} \hat{y} = \mathbf{a}^{\left[ 2 \right]}$$
$$ \mathbf{x}^{(1)} \xrightarrow{\hspace{2cm}} \hat{y}^{(1)} = \mathbf{a}^{\left[ 2 \right](1)}$$
$$ \mathbf{x}^{(2)} \xrightarrow{\hspace{2cm}} \hat{y}^{(2)} = \mathbf{a}^{\left[ 2 \right](2)}$$
$$ \vdots $$
$$ \mathbf{x}^{(m)} \xrightarrow{\hspace{2cm}} \hat{y}^{(m)} = \mathbf{a}^{\left[ 2 \right](m)}$$

Let's group our $m$ observations in one big $(n_x, m)$ input matrix where each column of the matrix represents one observation, then we can write:

$ 
X = \begin{bmatrix} 
\vdots & \vdots &  & \vdots \newline 
\mathbf{x}^{(1)} & \mathbf{x}^{(2)}& & \mathbf{x}^{(m)} \newline
\vdots & \vdots &  & \vdots \newline 
\end{bmatrix} $ 

We can then calculate the activation at each layer for all observations at one time with the following equations:

activations at layer [1]: $$\Large Z^{\left[ 1 \right]} = W^{\left[ 1 \right]}{A^{\left[ 0 \right]}} + \mathbf{b}^{[1]}, \ \ \ {A^{\left[ 1 \right]}} = f({Z^{\left[ 1 \right]}})$$
activations at layer [2]: $$\Large Z^{\left[ 2 \right]} = W^{\left[ 2 \right]}{A^{\left[ 1 \right]}} + \mathbf{b}^{[2]}, \ \ \ {A^{\left[ 2 \right]}} = f({Z^{\left[ 2 \right]}})$$

$ 
Z = \begin{bmatrix} 
\vdots & \vdots &  & \vdots \newline 
\mathbf{z}^{(1)} & \mathbf{z}^{(2)}& & \mathbf{z}^{(m)} \newline
\vdots & \vdots &  & \vdots \newline 
\end{bmatrix} $ and $ 
A = \begin{bmatrix} 
\vdots & \vdots &  & \vdots \newline 
\mathbf{a}^{(1)} & \mathbf{a}^{(2)}& & \mathbf{a}^{(m)} \newline
\vdots & \vdots &  & \vdots \newline 
\end{bmatrix} $ 



### 4.4. Pseudo code for forward propagation for the 2 layers NN:

```python
for i = 1 to m
  z[1, i] = W1*x[i] + b1      # shape of z[1, i] is (noOfHiddenNeurons,1)
  a[1, i] = f(z[1, i])  # shape of a[1, i] is (noOfHiddenNeurons,1)
  z[2, i] = W2*a[1, i] + b2   # shape of z[2, i] is (1,1)
  a[2, i] = sigmoid(z[2, i])  # shape of a[2, i] is (1,1)
```

Lets say we have X on shape $(n_x,m)$. So the new pseudo code:

```python
Z1 = W1X + b1     # shape of Z1 (noOfHiddenNeurons,m)
A1 = f(Z1)  # shape of A1 (noOfHiddenNeurons,m)
Z2 = W2A1 + b2    # shape of Z2 is (1,m)
A2 = sigmoid(Z2)  # shape of A2 is (1,m)
```

If you notice always m is the number of columns.

In the last example we can call X = A0. So the previous step can be rewritten as:

```python
Z1 = W1A0 + b1    # shape of Z1 (noOfHiddenNeurons,m)
A1 = f(Z1)  # shape of A1 (noOfHiddenNeurons,m)
Z2 = W2A1 + b2    # shape of Z2 is (1,m)
A2 = sigmoid(Z2)  # shape of A2 is (1,m)
```

# 5. Activation Functions

When we build a neural network, one of the choices we get to make is what activation function to use in the hidden layers, as well as what in the output units of our neural network. So far, we've seen activation functions like the sigmoid activation function. But sometimes other choices can work much better. Let's take a look at some of the options. 

In a neural network, an activation function normalizes the input and produces an output which is then passed forward into the subsequent layer.  Activation functions **add non-linearity** to the output which enables neural networks to solve non-linear problems.  In other words, a neural network without an activation function is essentially just a linear regression model. 

A perceptron is either 0 or 1 and that is a big jump and it will not help it to learn. We need something different, smoother. We need a function that progressively changes from 0 to 1 with no discontinuity.

The Activation Functions can be basically divided into 2 types
- Linear Activation Functions
- Non-linear Activation Functions

## 5.1. Linear Activation Functions

As you can see the function is a line or linear. Therefore, the output of the functions will not be confined between any range.

<img src="figures/linear_activation_function.png" alt="linear_activation_function" style="width: 300px;"/>


* Equation : $f(x) = x$
* Range : (-infinity to infinity)

It doesn’t help with the complexity or various parameters of usual data that is fed to the neural networks.

## 5.2. Non-linear Activation Functions

The Nonlinear Activation Functions are the most used activation functions. Nonlinearity helps makes your neural network captures non linear relationships as we can see in the figure below. It makes it easy for the model to generalize or adapt with variety of data and to differentiate between the output.

<img src="figures/non_linearity.png" alt="non_linearity" style="width: 400px;"/>

The main terminologies needed to understand for nonlinear functions are:
> * Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It is also known as slope.
* Monotonic function: A function which is either entirely non-increasing or non-decreasing.

The Nonlinear Activation Functions are mainly divided on the basis of their **range or curves**.

<img src="figures/activation_functions.png" alt="activation_functions" style="width: 700px;"/>


### 5.2.1. Sigmoid or Logistic Activation Function

The sigmoid function is defined as follows:

$$g(z) = \dfrac{1}{1 + e^{-z}}$$

The Sigmoid Function curve looks like a S-shape. It translates the input ranged in [-Inf; +Inf] to the range in (0; 1). The main reason why we use sigmoid function is because it exists between (0; 1). Therefore, it is especially used for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.

The sigmoid is not useful for the regression tasks as well. Simple linear units, f(x) = x can be used. For multiclass classification, the softmax function is a more generalized logistic activation function.

* The function is differentiable. That means, we can find the slope of the sigmoid curve at any two points.
* The function is monotonic but function’s derivative is not.

**Problems With Sigmoid Function**

* The exp( ) function is computationally expensive.
* The problem of vanishing gradients. The logistic sigmoid function can cause a neural network to get stuck at the training time.

### 5.2.2. Tanh or hyperbolic tangent Activation Function

The tanh function is defined as follows:

$$h(z) = \dfrac{e^{z} - e^{-z}}{e^{z} + e^{-z}}$$


tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to 1). tanh is also sigmoidal (S-shaped).

The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph. The tanh function is mainly used classification between two classes. It is also common to use the tanh function in a state to state transition model (recurrent neural networks). 

* The function is differentiable.
* The function is monotonic while its derivative is not monotonic.

**Problems With Tanh Function**

* Like sigmoid, tanh also has a vanishing gradient problem.

In practice, optimization is easier in this method hence in practice it is always preferred over Sigmoid function. 

### 5.2.3. ReLU (Rectified Linear Unit) Activation Function

The ReLU is the most used activation function in the world right now. ReLU is linear (identity) for all positive values, and zero for all negative values

**Benefits of ReLU** 

* Cheap to compute as there is no complicated math and hence easier to optimize.
* It converges faster. It accelerates the convergence of SGD compared to sigmoid and tanh (around 6 times).
* It does not have a vanishing gradient problem like tanh or sigmoid function.
* It is capable of outputting a true zero value allowing the activation of hidden layers in neural networks to contain one or more true zero values called Representational Sparsity.

**Problems with ReLU**

* The downside for being zero for all negative values called dying ReLU. So if once neuron gets negative it is unlikely for it to recover. This is called “dying ReLU” problem.
* If the learning rate is too high the weights may change to a value that causes the neuron to not get updated at any data point again.

### 5.2.4. Leaky ReLU

Leaky ReLUs attempt to fix the “dying ReLU” problem. Instead of the function being zero when x < 0, a leaky ReLU gives a small negative slope.

The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01 or so. When a is not 0.01 then it is called Randomized ReLU.

Therefore the range of the Leaky ReLU is (-infinity to infinity).

Leaky ReLUs allow a small, positive gradient when the unit is not active.

$${\displaystyle f(x)={\begin{cases}x&{\text{if }}x>0,\\0.01x&{\text{otherwise}}.\end{cases}}}$$



### 5.2.5. Parametric ReLU (PReLU)

Parametric ReLU (PReLU) is a type of leaky ReLU that, instead of having a predetermined slope like 0.01, makes it a parameter for the neural network to figure out itself: y = ax when x < 0.

$${\displaystyle f(x)={\begin{cases}x&{\text{if }}x>0,\\ax&{\text{otherwise}}.\end{cases}}}$$

### 5.2.6. Exponential Linear Unit (ELU)

The exponential Linear Unit leads to higher classification results than traditional ReLU. It follows the same rule for x>= 0 as ReLU, and increases exponentially for x < 0.

ELU tries to make the mean activations closer to zero which speeds up training.

$${\displaystyle f(x)={\begin{cases}x&{\text{if }}x>0,\\a(e^{x}-1)&{\text{otherwise}},\end{cases}}}$$

where ${\displaystyle a}$ is a hyper-parameter to be tuned, and ${\displaystyle a\geq 0}$ is a constraint.


## 5.3. Activation Functions in Tensorflow

Activation Functions can be found either in [tf.nn](https://www.tensorflow.org/api_docs/python/tf/nn) module or in [tf.keras.activation](https://www.tensorflow.org/api_docs/python/tf/keras/activations). So for example, we can either use tf.nn.relu or tf.keras.activations.relu to call a RELU activation function in our Dense layer.


* tf.nn.relu : It comes from TensorFlow library. It is located in the nn module. Hence, it is used as an operation in neural networks. If x is a tensor then,

```python
y = tf.nn.relu( x )
```

    It is used in creating custom layers and NN. If you use it with Keras, you may face some problems while loading or saving the models or converting the model to TF Lite.

* tf.keras.activations.relu : It comes from the Keras library included in TensorFlow. It is located in the activations module which also provides another activation functions. It is mostly used in Keras Layers ( tf.keras.layers ) for the activation= argument :

```python
model.add( keras.layers.Dense( 25 , activation=tf.keras.activations.relu  ) )
```

    But, it can also be used as the example in the above section. It is more specific to Keras ( Sequential or Model ) rather than raw TensorFlow computations.

tf.nn.relu is a TensorFlow specific whereas tf.keras.activations.relu has more uses in Keras own library. If I create a NN with only TF, I will most probably use tf.nn.relu and if I am creating a Keras Sequential model then I will use tf.keras.activations.relu.

## 5.4. Why do we need non-linear activation functions?

* If we removed the activation function from our algorithm that can be called linear activation function. Linear activation function will output linear activations
* Whatever hidden layers you add, the activation will be always linear like logistic regression (So its useless in a lot of complex problems).
* You might use linear activation function in one place - in the output layer if the output is real numbers (regression problem). But even in this case if the output value is non-negative you could use RELU instead.

## 5.5. Derivatives of activation functions

* Derivation of Sigmoid activation function:

```
f(z)  = 1 / (1 + np.exp(-z))
f'(z) = (1 / (1 + np.exp(-z))) * (1 - (1 / (1 + np.exp(-z))))
f'(z) = f(z) * (1 - f(z))
```

* Derivation of Tanh activation function:

```
f(z)  = (e^z - e^-z) / (e^z + e^-z)
f'(z) = 1 - np.tanh(z)^2 = 1 - f(z)^2
```

* Derivation of RELU activation function:

```
f(z)  = np.maximum(0,z)
f'(z) = { 0  if z < 0
          1  if z >= 0  }
```

* Derivation of leaky RELU activation function:

```
f(z)  = np.maximum(0.01 * z, z)
f'(z) = { 0.01  if z < 0
          1     if z >= 0   }
```

# 6. Output Layers

The design of the output layer depends on the use case, and the type of machine learning problem we're trying to solve.

* Regression: We usually use an output layer with a **single neuron** having a linear or an identity activation function. This is equivalent to not having an activation function at all.
* Binary Classification: We usually use an output layer with a **single neuron** having a **sigmoid** or **tanh** activation function. 
* Multiclass Classification: We usually use an output layer with a number of neurons equal to the number of classes with a **softmax** activation function. 

<img src="figures/softmax.jpeg" alt="softmax" style="width: 500px;"/>

# 7. Cost Functions

## 7.1. Regression cost functions

The cost function in a regression problem takes the average difference (actually a fancier version of an average) of all the results of the hypothesis (predicted by the hypothesis of neural network) with inputs from x's and the actual output y's. 

\begin{equation}
J(W, \mathbf{b}) = \dfrac {1}{m} \displaystyle \sum _{i=1}^m J^{(i)}(W, \mathbf{b}) = \dfrac {1}{m} \displaystyle \sum _{i=1}^m \left ( \hat{y}^{(i)}- y^{(i)} \right)^2 = \dfrac {1}{m} \displaystyle \sum _{i=1}^m \left (h_{W, \mathbf{b}} (\mathbf{x}^{(i)}) - y^{(i)} \right)^2
\end{equation}

This loss is known by the **MSE** loss or the **Mean Squared Error**.

## 7.2. Classification cost functions

### 7.2.1. Binary Classification

The main cost function used for binary classification is the same as the one we used for logistic regression. The only difference is the hypothesis it self. 

\begin{align*}
J(W, \mathbf{b}) & = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_{W, \mathbf{b}}(\mathbf{x}^{(i)}),y^{(i)}) \newline
& = \dfrac{1}{m} \sum_{i=1}^m \left[-y^{(i)} \log(h_{W, \mathbf{b}}(\mathbf{x}^{(i)})) -(1-y^{(i)}) \log(1-h_{W, \mathbf{b}}(\mathbf{x}^{(i)}))\right]
\end{align*}

This loss is also known as the **cross entropy**, **binary crossentropy**, or **logloss**.

### 7.2.2. Multi-class Classifiication


We now describe the cost function that we’ll use for multi-class classification. In the equation below, $\mathbb{1}_{\left\{.\right\}}$ is the "indicator function", so that $\mathbb{1}_{\left\{\text{a true statement}\right\}}=1$, and $\mathbb{1}_{\left\{\text{a false statement}\right\}}$. For example, $\mathbb{1}_{\left\{\text{2+2=4}\right\}}$ evaluates to 1; whereas $\mathbb{1}_{\left\{\text{1+1=5}\right\}}$ evaluates to 0. Our cost function will be:

\begin{align}
J(W, \mathbf{b}) = - \left[ \sum_{i=1}^{m} \sum_{c=1}^{C}  \mathbb{1}_{\left\{y^{(i)} = c\right\}} \log p(y^{(i)} = c | \mathbf{x}^{(i)} ; W, \mathbf{b})\right]
\end{align}

This loss is known by the **categorical crossentropy**.


## 7.3. Cost Functions in Tensorflow

Loss Functions can be found either in [tf.nn](https://www.tensorflow.org/api_docs/python/tf/nn) module or in [tf.keras.losses](https://www.tensorflow.org/api_docs/python/tf/keras/losses). So for example, we can either use tf.nn.sigmoid_cross_entropy_with_logits or tf.keras.losses.binary_crossentropy to compute the binary cross entropy between the real output labels and the neural network output.

For multi-class classification

* If your labels are one-hot encoded: use tf.keras.losses.categorical_crossentropy

* If your labels are encoded as integers: use tf.keras.losses.sparse_categorical_crossentropy

In [None]:
import tensorflow as tf

y_true = [[0, 1, 0], [0, 0, 1]]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
loss = tf.keras.losses.categorical_crossentropy(y_true, y_pred)
assert loss.shape == (2,)
loss.numpy()

In [None]:
y_true = [1, 2]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
loss = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred)
assert loss.shape == (2,)
loss.numpy()

# 8. Optimization: Gradient Descent For Neural Networks

## 8.1. Gradients of simple Logistic Regression 

In order to demonstrate how can we calculate the gradient descent for our neural network, let us first consider the simple example of logiistic regression.

<img src="figures/lg_vs_nn.png" alt="lg_vs_nn" style="width: 500px;"/>


The logistic function can be then expressed using the following three equations:

$$\Large z = \mathbf{w}^{\rm{T}}{\mathbf{x}} + {b}  $$

$$\Large \hat{y} = {a} = g({z}) $$

$$\Large J(a,y) = -(a log(y) + (1-a) log(1-y)) $$


<img src="figures/lg_graph.png" alt="lg_graph" style="width: 500px;"/>

The gradient descent for every weight $w$ of our algorithm is:

\begin{align*} 
\text{repeat until convergence: } 
\lbrace 
& \newline w &:= w - \alpha \frac{\partial}{\partial w} J 
\newline \rbrace & 
\end{align*}

The question is, how do we calculate $\frac{\partial}{\partial w} J $. For simplicity, any derivative of the form  $\frac{\partial}{\partial w} J $ will be represented as $dw$.

In logistic regression, what we want to do is to modify the parameters, $w_1$, $w_2$ and $b$, in order to reduce this loss. In the above equations,  we've described the propagation steps of how you actually compute the loss on a single training example, now let's talk about how you can go backwards to compute the derivatives. 

$$ da = \frac{\partial}{\partial a} J  = -\frac{y}{a} + \frac{1-y}{1-a} $$

$$ dz = \frac{\partial}{\partial z} J = \frac{\partial J}{\partial a} \frac{\partial a}{\partial z}  = \big( -\frac{y}{a} + \frac{1-y}{1-a} \big)\big( a(1-a) \big) = a - y $$

$$ dw_1 = \frac{\partial}{\partial w_1} J = \frac{\partial J}{\partial z} \frac{\partial z}{\partial w_1}  = x_1 \frac{\partial J}{\partial z} $$

$$ dw_2 = \frac{\partial}{\partial w_2} J = \frac{\partial J}{\partial z} \frac{\partial z}{\partial w_2}  = x_2\frac{\partial J}{\partial z} $$

$$ db = \frac{\partial}{\partial b} J = \frac{\partial J}{\partial z} \frac{\partial z}{\partial b}  = \frac{\partial J}{\partial z} $$

## 8.2. Gradient Descent for Neural Networks

We can now repeat the same **backpropagation** logic for any neural network with any side.

<img src="figures/nn_graph.png" alt="nn_graph" style="width: 700px;"/>


## 8.3. Forward and Backward Propagation

To summarise, calculating the gradients for our gradient descent require 2 steps, a forward propagation over all layers in order to calculate the activations at each layer, a backward propagation step over all layers in order to compute the gradient of all weights and biases at each layer. Vectorizing across all examples $m$, then the pseudo-code for forward and backward propagation can be written by the following equations: 

Pseudo code for forward propagation for layer [l]:

```python
Input  A[l-1]
Z[l] = W[l]A[l-1] + b[l]
A[l] = f[l](Z[l])
Output A[l], cache(Z[l])
```

Pseudo code for back propagation for layer [l]:

```python
Input dA[l], Caches
dZ[l] = dA[l] * f'[l](Z[l])
dW[l] = (dZ[l]A[l-1].T) / m
db[l] = sum(dZ[l])/m                # Dont forget axis=1, keepdims=True
dA[l-1] = w[l].T * dZ[l]            # The multiplication here are a dot product.
Output dA[l-1], dW[l], db[l]
```



## 8.4. Random Initialization of weights 

* In logistic regression it wasn't important to initialize the weights randomly, while in NN we have to initialize them randomly.

* If we initialize all the weights with zeros in NN it won't work (initializing bias with zero is OK):

    * all hidden units will be completely identical (symmetric) - compute exactly the same function
    * on each gradient descent iteration all the hidden units will always update the same

* To solve this we initialize the W's with a small random numbers:

```python
W1 = np.random.randn((2,2)) * 0.01 # 0.01 to make it small enough
b1 = np.zeros((2,1))               # its ok to have b as zero, it won't get us to the symmetry breaking problem
```

* We need small values because in sigmoid (or tanh), for example, if the weight is too large you are more likely to end up even at the very start of training with very large values of Z. Which causes your tanh or your sigmoid activation function to be saturated, thus slowing down learning. If you don't have any sigmoid or tanh activation functions throughout your neural network, this is less of an issue.

# 9. Tensorflow Playground

Play with a neural network right in the browser at http://playground.tensorflow.org. See if you can figure out the parameters to get the neural network to pattern match to the desired groups. The spiral is particularly challenging!