# 4. Neural Networks: Representation

### Motivation: Non-linear hypotheses

Suppose you have a housing classification problem with 100 original features. Combinations of these features would likely add up to a very large number of total features (for logistic regression, it is roughly $\frac{n^2}{2}$).

It turns out, this is indeed both time consuming and very computationally expensive with our existing techniques.

Using the example of **computer vision**, to recognize a car we could use the intensity of a 50x50 grid of pixels (2500 features - 7500 if RGB) or use *only* the quadratic features with our existing methods (around 3 **million** features).  
The first approach looks more pragmatic (and a bit magic), so this section will cover how it works.  

### Neurons and the brain

Neural networks are algorithms which try to mimic the brain. Popular in 80s and early 90s, they are now back in town and are state-of-the-art for several applications (some of them covered later in this section).  

Hypothesis behind the "biological inspiration": the one learning algorithm. In short, it means that physical hardware (brain tissue) does not determine what is learned. In experimental settings, it has been proven that e.g. the auditory cortex can _learn_ to see. 
Therefore, there must be an underlying algorithm that allows these different parts of the brain to learn to perform the same function.  

### Model Representation: Intuition

On a (very very) simplicistic level, a neuron has three parts: the _dendrite_ (input), the _nucleus_ (computation) and the _axon_ (output). 

Our neuron model will be modeled as a **logistic unit**, therefore our previous function $g(z)$ can be referred to as **sigmoid (logistic) activation function**. 

$g(z) = \frac{1}{1+e^{-z}}$

The parameters we previously referred as theta are also called **weights**, especially in the literature. 

A neural network is made of _at least_ 3 different types _layers_:

1. **Input** layer
2. **Hidden** layer
3. **Output** layer

![Visual Representation of a Neural Network](Images/Neural_Network.jpg)

More precise denominations:

$a_i^{(j)}$ = "activation" of unit $i$ in layer $j$  
$\Theta^{(j)}$ = matrix of weights controlling function mapping from layer $j$ to layer $j + 1$

Formally:
    
$a_1^{{(2)}} = g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3)$ 

$a_2^{{(2)}} = g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3)$   

$a_3^{{(2)}} = g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3)$   

$h_\Theta(x) = a_1^{{(3)}} = g(\Theta_{10}^{(2}a_0^{{(2)}} + \Theta_{11}^{(2}a_1^{{(2)}} + \Theta_{12}^{(2}a_2^{{(2)}} + \Theta_{13}^{(3}a_0^{{(2)}})$    

In terms of dimensions, we can always expect $\theta^{(j)}$ to have dimension $s_{j+1} x s_j + 1$

The approach we just used is called **forward propagation**. To calculate it more efficiently, we are going to use a vectorized implementation.

But first, let's simplify the above as:

$a_1^{{(2)}} = g(z_1^{{(2)}})$   

$a_2^{{(2)}} = g(z_2^{{(2)}})$   

$a_3^{{(2)}} = g(z_3^{{(2)}})$   

Looking at our coefficients and variables, we can see how they can both be representated by vectors. 
The computation is now much easier:

$z^{(2)} = \Theta^{(1)}x$  

$a^{(2)} = g(z^{(2)})$

with their matrix multiplication being equal to the $\Theta$ matrix described above. 

To these, we will need to add $a_0^{(2)} = 1$

Finally:

$z^{(3)} = \Theta^{(2)}a^{(2)}$  

$h_\Theta(x) = a^{(3)} = g(z^{(3)})$

Basically, what neural network is doing is similar to what happens with logistic regression, **but** instead of using our original features, it uses the hidden layer features / activators. 