# Machine Learning: Week 4 - Neural Networks
## Non-linear Hypothesis
Performing linear regression with a complex set of data with many features is very unwieldy. Say you wanted to create a hypothesis from three (3) features that included all the quadratic terms:

$$\large
g( \theta _{0} +\theta _{1} x_{1}^{2} +\theta _{2} x_{1} x_{2} +\theta _{3} x_{1} x_{3} +\theta _{4} x_{2}^{2} +_{\theta }{}_{5} x_{2} x_{3} +\theta _{6} x_{3}^{2})
$$


That gives us 6 features. The exact way to calculate how many features for all polynomial terms is the combination function with repetition: http://www.mathsisfun.com/combinatorics/combinations-permutations.html $\frac{( n+r-1) !}{r!( n-1)!}$. In this case we are taking all two-element combinations of three features: $\frac{( 3+2-1) !}{( 2!( 3-1) !)} =\frac{4!}{4} =6$

For 100 features, if we wanted to make them quadratic we would get $\frac{( 100+2-1) !}{( 2!( 100-1) !)} =5050$

We can approximate the growth of the number of new features we get with all quadratic terms with $O(n^2/2)$. And if you wanted to include all cubic terms in your hypothesis, the features would grow asymptotically at $On^3)$. These are very steep growths, so as the number of our features increase, the number of quadratic or cubic features increase very rapidly and becomes quickly impractical.

Example: let our training set be a collection of $50 \times 50$ pixel black-and-white photographs, and our goal will be to classify which ones are photos of cars. Our feature set size is then $n = 2500$ if we compare every pair of pixels.

Now let’s say we need to make a quadratic hypothesis function. With quadratic features, our growth is $O(n^2/2)$. So our total features will be about $25002/2=3125000$, which is very impractical.

Neural networks offers an alternate way to perform machine learning when we have complex hypotheses with many features.

## Neurons and the Brain
Neural networks are limited imitations of how our own brains work. They’ve had a big recent resurgence because of advances in computer hardware.

There is evidence that the brain uses only one “learning algorithm” for all its different functions. Scientists have tried cutting (in an animal brain) the connection between the ears and the auditory cortex and rewiring the optical nerve with the auditory cortex to find that the auditory cortex literally learns to see.

This principle is called “neuroplasticity” and has many examples and experimental evidence.

## Model Representation I
Let’s examine how we will represent a hypothesis function using neural networks.

At a very simple level, neurons are basically computational units that take input (**dendrites**) as electrical input (called “spikes”) that are channeled to outputs (**axons**).

In our model, our dendrites are like the input features $x_1 ... x_n$, and the output is the result of our hypothesis function. In this model our $x_0$ input node is sometimes called the “bias unit.” It is always equal to $1$.

In neural networks, we use the same logistic function as in classification: $\frac{1}{1+e^{-\theta ^{T} x}}$. In neural networks however we sometimes call it a **sigmoid (logistic) activation function**. Our “theta” parameters are sometimes instead called _**“weights”**_ in the neural networks model.

Visually, a simplistic representation looks like:

$$\large
\begin{bmatrix}
x_{0}\\
x_{1}\\
x_{2}
\end{bmatrix}\rightarrow [ \ \ \ ] \ \rightarrow h_{\theta }( x)
$$

Our input nodes (layer 1) go into another node (layer 2), and are output as the hypothesis function. The first layer is called the “input layer” and the final layer the “output layer,” which gives the final value computed on the hypothesis.

We can have intermediate layers of nodes between the input and output layers called the “hidden layer.”

We label these intermediate or “hidden” layer nodes $a_2^0 ... a_2^n$ and call them “activation units.”

$$\large
 \begin{array}{l}
a_{i}^{( j)} =activation\ \ of\ \ unit\ \ i\ \ in\ \ layer\ \ j\\
\Theta ^{( j)} =matrix\ \ of\ \ weights\ \ controlling\ \ function\ \ mapping\ \ from\ \ layer\ \ j\ \ to\ \ layer\ \ j+1
\end{array}
$$

If we had one hidden layer, it would look visually something like:

$$\large
\begin{bmatrix}
x_{o}\\
x_{1}\\
x_{2}\\
x_{3}
\end{bmatrix}\rightarrow \begin{bmatrix}
a_{1}^{( 2)}\\
a_{2}^{( 2)}\\
a_{3}^{( 2)}
\end{bmatrix}\rightarrow h_{\theta }( x)
$$

The values for each of the “activation” nodes is obtained as follows:

$$\large
 \begin{array}{l}
a_{1}^{( 2)} =g\left( \Theta _{10}^{( 1)} x_{0} +\Theta _{11}^{( 1)} x_{1} +\Theta _{12}^{( 1)} x_{2} +\Theta _{13}^{( 1)} x_{3}\right)\\
a_{2}^{( 2)} =g\left( \Theta _{20}^{( 1)} x_{0} +\Theta _{21}^{( 1)} x_{1} +\Theta _{22}^{( 1)} x_{2} +\Theta _{23}^{( 1)} x_{3}\right)\\
a_{3}^{( 2)} =g\left( \Theta _{30}^{( 1)} x_{0} +\Theta _{31}^{( 1)} x_{1} +\Theta _{32}^{( 1)} x_{2} +\Theta _{33}^{( 1)} x_{3}\right)\\
h_{\Theta }( x) =a_{1}^{( 3)} =g\left( \Theta _{10}^{( 2)} a_{0}^{( 2)} +\Theta _{11}^{( 2)} a_{1}^{( 2)} +\Theta _{12}^{( 2)} a_{2}^{( 2)} +\Theta _{13}^{( 2)} a_{3}^{( 2)}\right)
\end{array}
$$

This is saying that we compute our activation nodes by using a $3\times4$ matrix of parameters. We apply each row of the parameters to our inputs to obtain the value for one activation node. Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by yet another parameter matrix $\Theta^{(2)}$ containing the weights for our second layer of nodes.

Each layer gets its own matrix of weights, $\Theta^{(j)}$.

The dimensions of these matrices of weights is determined as follows:

If network has $s_j$ units in layer $j$ and $s_{j+1}$ units in layer $j+1$, then $\Theta^{(j)}$ will be of dimension $s_{j+1}\times(s_j+1)$.

The $+1$ comes from the addition in $\Theta^{(j)}$ of the “bias nodes”, $x_0$ and $\Theta^{(j)}_0$. In other words the output nodes will not include the bias nodes while the inputs will. The following image summarizes our model representation:

![neural_net](./Week4_Images/neural_net.png)

Example: layer 1 has 2 input nodes and layer 2 has 4 activation nodes. Dimension of $\Theta^{(1)}$ is going to be $4\times3$ where $s_j=2$ and $s_{j+1}=4$, so $s_{j+1}\times(s_j+1)=4\times3$.

## Model Representation II
In this section we’ll do a vectorized implementation of the above functions. We’re going to define a new variable $z^{(j)}_k$ that encompasses the parameters inside our $g$ function. In our previous example if we replaced the variable $z$ for all the parameters we would get:

$$\large
 \begin{array}{l}
a_{1}^{( 2)} =g\left( z_{1}^{( 2)}\right)\\
a_{2}^{( 2)} =g\left( z_{2}^{( 2)}\right)\\
a_{3}^{( 2)} =g\left( z_{3}^{( 2)}\right)
\end{array}
$$

In other words, for layer $j=2$ and node $k$, the variable $z$ will be:

$$\large
z_{k}^{( 2)} =\Theta _{k,0}^{( 1)} x_{0} + \Theta _{k,1}^{( 1)} x_{1} +...+\Theta _{k,n}^{( 1)} x_{n}
$$

The vector representation of $x$ and $z^j$ is:

$$\large
x=\begin{bmatrix}
x_{o}\\
x_{1}\\
...\\
x_{n}
\end{bmatrix} \ \ \ \ \ z^{( j)} =\begin{bmatrix}
z_{1}^{( j)}\\
z_{2}^{( j)}\\
...\\
z_{n}^{( j)}
\end{bmatrix}
$$

Setting $x=a^{(1)}$, we can rewrite the equation as:

$$\large
z^{( j)} =\Theta ^{( j-1)} a^{( j-1)}
$$

We are multiplying our matrix $\Theta^{(j-1)}$ with dimensions $s_j\times(n+1)$ (where $s_j$ is the number of our activation nodes) by our vector $a^{(j-1)}$ with height $(n+1)$. This gives us our vector $z^{(j)}$ with height $s_j$.

Now we can get a vector of our activation nodes for layer $j$ as follows:

$$\large
a^{( j)} =g\left( z^{( j)}\right)
$$

Where our function $g$ can be applied element-wise to our vector $z^{(j)}$.

We can then add a bias unit (equal to $1$) to layer $j$ after we have computed $a^{(j)}$. This will be element $a^{(j)}_0$ and will be equal to $1$.

To compute our final hypothesis, let’s first compute another $z$ vector:

$$\large
z^{(j+1)} =\Theta ^{( j)} a^{( j)}
$$

We get this final z vector by multiplying the next theta matrix after $\Theta^{(j-1)}$ with the values of all the activation nodes we just got. This last theta matrix $\Theta^{(j)}$ will have only **one row** so that our result is a single number.

We then get our final result with:
$$\large
h_{\Theta }( x) =a^{( j+1)} =g\left( z^{( j+1)}\right)
$$

Notice that in this **last step**, between layer $j$ and layer $j+1$, we are doing __exactly the same thing__ as we did in logistic regression.

Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting and more complex non-linear hypotheses.

## Examples and Intuitions I
A simple example of applying neural networks is by predicting $x_1$ _AND_ $x_2$, which is the logical ‘and’ operator and is only true if both $x_1$ and $x_2$ are $1$.

The graph of our functions will look like:

$$\large
\begin{bmatrix}
x_{o}\\
x_{1}\\
x_{2}
\end{bmatrix}\rightarrow \left[ g\left( z^{( 2)}\right)\right]\rightarrow h_{\Theta }( x)
$$

Remember that $x_0$ is our bias variable and is always $1$.

Let’s set our first theta matrix as:

$$\large
\Theta ^{( 1)} =[ -30\ \ \ 20\ \ \ 20]
$$

This will cause the output of our hypothesis to only be positive if both $x_1$ and $x_2$ are $1$. In other words:

$$\large
h_{\Theta }( x) =g( -30+20x_{1} +20x_{2})
$$
<br>
$$\large
 \begin{array}{l}
x_{1} =0\ \ and\ \ x_{2} =0\ \ then\ \ g( -30) \approx 0\\
x_{1} =0\ \ and\ \ x_{2} =1\ \ then\ \ g( -10) \approx 0\\
x_{1} =1\ \ and\ \ x_{2} =0\ \ then\ \ g( -10) \approx 0\\
x_{1} =1\ \ and\ \ x_{2} =1\ \ then\ \ g( 10) \approx 1
\end{array}
$$

So we have constructed one of the fundamental operations in computers by using a small neural network rather than using an actual AND gate. Neural networks can also be used to simulate all the other logical gates. The following is an example of the logical operator 'OR', meaning either $x_1$ is true or $x_2$ is true, or both:

![neural_net_intuition](./Week4_Images/neural_net_intuition.png)

## Examples and Intuitions II
The $\Theta^{(1)}$ matrices for AND, NOR, and OR are:

$$\large
 \begin{array}{l}
AND:\\
\ \ \ \ \ \ \ \ \Theta ^{( 1)} =[ -30\ \ \ \ \ 20\ \ \ \ \ 20]\\
NOR:\\
\ \ \ \ \ \ \ \ \Theta ^{( 1)} =[ 10\ \ \ \ -20\ \ \ \ -20]\\
OR:\\
\ \ \ \ \ \ \ \ \Theta ^{( 1)} =[ -10\ \ \ \ \ 20\ \ \ \ \ 20]
\end{array}
$$

We can combine these to get the XNOR logical operator (which gives 1 if $x_1$ and $x_2$ are both $0$ or both $1$).

$$\large
\begin{bmatrix}
x_{o}\\
x_{1}\\
x_{2}
\end{bmatrix}\rightarrow \begin{bmatrix}
a_{1}^{( 2)}\\
a_{2}^{( 2)}
\end{bmatrix}\rightarrow \left[ a^{( 3)}\right]\rightarrow h_{\Theta }( x)
$$

For the transition between the first and second layer, we’ll use a $\Theta^{(1)}$ matrix that combines the values for AND and NOR:

$$\large
\Theta ^{( 1)} =\begin{bmatrix}
-30 & 20 & 20\\
10 & -20 & -20
\end{bmatrix}
$$

For the transition between the second and third layer, we’ll use a $\Theta^{(2)}$ matrix that uses the value for OR:

$$\large
\Theta ^{( 1)} =[ -10\ \ \ \ \ 20\ \ \ \ \ 20]
$$

Let’s write out the values for all our nodes:

$$\large
 \begin{array}{l}
a^{( 2)} =g\left( \Theta ^{( 1)} x\right)\\
a^{( 3)} =g\left( \Theta ^{( 2)} a^{( 2)}\right)\\
h_{\Theta }( x) =a^{( 3)}
\end{array}
$$

And there we have the XNOR operator using a hidden layer with two nodes! The following summarizes the above algorithm:

![XNOR](./Week4_Images/XNOR.png)

## Multiclass Classification
To classify data into multiple classes, we let our hypothesis function return a vector of values. Say we wanted to classify our data into one of four categories. We will use the following example to see how this classification is done. This algorithm takes as input an image and classifies it accordingly: 

![multiclass-classification](./Week4_Images/multiclass_classification.png)

We can define our set of resulting classes as $y$:

$$\large
y^{( i)} =\begin{bmatrix}
1\\
0\\
0\\
0
\end{bmatrix} \ ,\ \begin{bmatrix}
0\\
1\\
0\\
0
\end{bmatrix} \ ,\ \begin{bmatrix}
0\\
0\\
1\\
0
\end{bmatrix} \ ,\ \begin{bmatrix}
0\\
0\\
0\\
1
\end{bmatrix}
$$

Each $y^{(i)}$ represents a different image corresponding to either a car, pedestrian, truck, or motorcycle. The inner layers, each provide us with some new information which leads to our final hypothesis function. The setup looks like:

$$\large
\begin{bmatrix}
x_{o}\\
x_{1}\\
x_{2}\\
...\\
x_{n}
\end{bmatrix}\rightarrow \begin{bmatrix}
a_{0}^{( 2)}\\
a_{1}^{( 2)}\\
a_{2}^{( 2)}\\
...
\end{bmatrix}\rightarrow \begin{bmatrix}
a_{0}^{( 3)}\\
a_{1}^{( 3)}\\
a_{2}^{( 3)}\\
...
\end{bmatrix}\rightarrow ...\rightarrow \begin{bmatrix}
h_{\Theta }( x)_{1}\\
h_{\Theta }( x)_{2}\\
h_{\Theta }( x)_{3}\\
h_{\Theta }( x)_{4}
\end{bmatrix}
$$

Our resulting hypothesis for one set of inputs may look like:

$$\large
h_{\Theta }( x) =\begin{bmatrix}
0\\
0\\
1\\
0
\end{bmatrix}
$$

In which case our resulting class is the third one down, or $h_\Theta(x)_3$, which represents a **motorcycle**.
