# Neural Networks: Representation

## Neural Networks

- **Model Representation I**

    - Neurons are basically computational units that take inputs (**dendrites**) as electrical inputs that are channeled to outputs (**axons**).

    - In neural networks, we use the same logistic function as in classification, $\frac{1}{1+e^{-\theta^Tx}}$, yet we sometimes call it a signmoid (logistic) **activation** function. In this situation, our "theta" parameters are sometimes called "weights".

    - A simplistic representation looks like:
    
    $[x_0x_1x_2] \to [ ] \to h_\theta(x)$

    - Our inputs nodes (layer 1), also known as the "input layer", go into another node (layer 2), which finally outputs the hypothesis function, known as the "output layer". We can have intermediate layers of nodes between the input and output layers called the "hidden layers."

    $x_0x_1x_2x_3 \to [a_1^{(2)}a_2^{(2)}a_3^{(2)}] \to h_\theta(x)$

    $a_i^{(j)}$ = "activation" of unit i in layer j

    $\theta^{(j)} = matrix of weights controlling function mapping from layer j to layer j + 1$

    - The values for each of the "activation" nodes is obtained as follows:

    $a_1^{(2)} = g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3)$

    $a_2^{(2)} = g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3)$

    $a_3^{(2)} = g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3)$

    $h_\Theta(x) = a_1^{(3)} = g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}x_3^{(2)})$

    - Each layer gets its own matrix of weights, $\Theta^{(j)}$.
    - If network has $s_j$ units in layer j and $s_{j+1}$ units in layer j + 1, then $\Theta^{(j)}$ will be of dimension $s_{j+1} x (s_j + 1).$. The +1 comes from the addition in $\Theta^{(j)}$ of the "bias nodes", $x_0$ and $\Theta_0^{(j)}$

- **Model Representation II**

    - Vectozied implementation of above functions, we're going to define a new variable $z_k^{(j)}$ that encompasses the parameters inside our g function:

    $a_1^{(2)} = g(z_1^{(2)})$

    $a_2^{(2)} = g(z_2^{(2)})$

    $a_3^{(2)} = g(z_3^{(1)})$

    - In other words, for layer j = 2 and node k, the variable z will be:

    $z_k^{(2)} = \Theta_{k,0}^{(1)}x_0 + \Theta_{k,1}^{(1)}x_1 + ... + \Theta_{k,n}^{(1)}x_n$

    - The vector representation of x and $z^j$ is:

    $x = \begin{bmatrix}x_0\\x_1\\...\\x_n\end{bmatrix}$
    $z^{(j)} = \begin{bmatrix}z_1^{(j)}\\z_2^{(j)}\\...\\z_n^{(j)}\end{bmatrix}$

    - Setting x = $a^{(1)}$, we can rewrite the equation as:

    $z^{(j)} = \Theta^{(j-1)}a^{(j-1)}$

    - We are multiplying our matrix $\Theta^{j-1}$ with dimensions $s_jx(n+1)$ by our vector $a^{(j-1)}$ with height (n+1). This gives us our vector $z^{(j)}$ with height $s_j$. Now we can get a vector of our activation nodes for layer j as follows:

    $a^{(j)} = g(z^{(j)})$

## Applications

- **Examples and Intuitions I**

    - A simple example of applying neural networks is by predicting $x_1$ AND $x_2$, the graph of our functions will look like:

    $\begin{bmatrix}x_0\\x_1\\x_2\end{bmatrix} \to [g(x^{(2)})] \to h_\Theta(x)$

    - Let's set our first theta matrix as:

    $\Theta^{(1)} = [-30`20`20]$

    - Our hypothesis will be: $h_\Theta(x) = g(-30 + 20x_1 + 20x_2)$

- **Examples and Intuitions II**

    - The $\Theta^{(1)}$ matrices for AND, NOR and OR are:

        AND: $\Theta^{(1)} = [-30'20'20]$

        NOR: $\Theta^{(1)} = [10'-20'-20]$

        OR: $\Theta^{(1)} = [-10'20'20]$

    - We can combine these to get XNOR logical operator:

        $\begin{bmatrix}x_0\\x_1\\x_2\end{bmatrix} \to \begin{bmatrix}a_1^{(2)}\\a_2^{(2)}\end{bmatrix} \to [a^{(3)}] \to h_\Theta(x)$

    - Let's write out the values for all our nodes:

        $a^{(2)} = g(\Theta^{(1)}.x)$

        $a^{(3)} = g(\Theta^{(2)}.a^{(2)})$

        $h_\Theta(x) = a^{(3)}$

    - Note: XNOR = (AND) OR (NOR)

- **Multiclass Classification**

    - To classify data into multiple classes, we let our hypothesis function return a vector of values. Say we wanted to classify our data into one of many categories (in this example is four categories).

    - We can define our set of resulting classes as y:

    $y^{(i)} = \begin{bmatrix}1\\0\\0\\0\end{bmatrix},\begin{bmatrix}0\\1\\0\\0\end{bmatrix},\begin{bmatrix}0\\0\\1\\0\end{bmatrix},\begin{bmatrix}0\\0\\0\\1\end{bmatrix}$