In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import neural_net_helper
%aimport neural_net_helper

nnh = neural_net_helper.NN_Helper()

# Inside a layer: Units/Neurons

## Notation 1
Layer $\ll$, for $1 \le \ll \le L$:
- Produces output vector $\y_\llp$
- $\y_\llp$ is a vector of $n_\llp$ synthetic features
$$
n_\llp = || \y_\llp ||
$$
- Takes as input $\y_{(\ll-1)}$, the output of the preceding layer

- Layer $L$ will typically implement Regression or Classification
- The first $(L-1)$ layers create synthetic features of increasing complexity
- We will use layer $(L+1)$ to compute a Loss

<div>
     <!-- edX: Original: <img src="images/NN_Layers_plus_Loss.png"> replace by EdX created image -->
    <img src="images/Addtl_Loss_Layer_W8_L5_Sl4.png">
</div>

The input $\x$
- Is called "layer 0"
- $\y_{(0)} = \x$

The output $\y_{(L-1)}$ of the penultimate layer $(L-1)$
- Becomes the input of a Classifier/Regression model at layer $L$


# Our first layer type: the Fully Connected/Dense Layer

Let's look inside layer $\ll$ (of a particular type called *Fully Connected* or *Dense*)

<div align="middle">
    <center>Layer</center>
    <br>
    <!-- edX: Original: <img src="images/NN_Layer_multi_unit.png"> replace by EdX created image -->
    <img src=images/Layers_W8_L3_Sl5.png width=60%>
</div>

- Input vector of $n_{(\ll-1)}$ features: $\y_{(\ll-1)}$
- Produces output vector or $n_\llp$ features $\y_\llp$
- Feature $j$ defined by the function 
$$\y_{\llp,j} = \sigma (\y_{(\ll-1)} \cdot \W_{\llp,j} )$$

Each feature $\y_{\llp,j}$ is produced by a *unit* (*neuron*)
- There are $n_\llp$ units in layer $\ll$
- The units are *homogenous*
    - same input $\y_{(\ll-1)}$ to every unit
    - same functional form for every unit
    - units differ only in $\W_{\llp,j}$

*Units* are also sometimes refered to as *Hidden Units*
- They are internal to a layer.
- From the standpoint of the Input/Output behavior of a layer, the units are "hidden"

The functional form
$$\y_{\llp,j} = \sigma(\y_{(\ll-1)} \cdot \W_{\llp,j} )$$

is called a *Dense* or *Fully Connected* unit.

It is called Fully connected since
- each unit takes as input $\y_{(\ll-1)}$, **all** $n_{(\ll-1)}$ outputs of the preceding layer



The *Fully Connected* part can be better appreciated by looking at a diagram of the connectivity
of a *single* unit producing a *single* feature.

A Fully Connected/Dense Layer producing a *single* feature at layer $\ll$ computes
$$
\y_{\llp,1} = a_\llp( \y_{(\ll-1)} \cdot \W_{\llp,1} )
$$

A function, $a_\llp$, is applied to the dot product
- It is called an *activation function*
- A very common choice for activation function is the sigmoid $\sigma$

<div align="middle">
    <center><strong>Fully connected unit, single feature</strong></center>
    <br>
<img src=images/FC_1feature.png>
    </div>

The edges into the single unit of layer $\ll$ correspond to $\W_{\llp,1}$.


A Fully Connected/Dense Layer
with multiple units
producing  *multiple* feature at layer $\ll$ computes
$$
\y_{\llp,j} = a_\llp( \y_{(\ll-1)} \cdot \W_{\llp,j} )
$$

<div align="middle">
    <center><strong>Fully connected, two features</strong></center>
    <br>
<img src=images/FC_2feature.png>
    </div>

The edges into each unit of layer $\ll$ correspond to
- $\W_{\llp,1}, \W_{\llp,2} \ldots$
- Separate colors for each units/row of $\W$

Each unit  $\y_{\llp,j}$ in layer $\ll$ creates a new feature using pattern$\W_{\llp,j}$


The functional form is of
- A dot product $\y_{(\ll-1)} \cdot \W_{\llp,j}$
    - Which can be thought of matching input $\y_{(\ll-1)}$ against pattern $\W_{\llp,j}$
- Fed into an activation function $a_\llp$
    - Here, $a_\llp = \sigma$, 
the *sigmoid* function we have previously encountered in Logistic Regression.

Because the units are homogeneous, we can depict it as

<div>
    <center><strong>Layer</strong></center>
    <br>
    <!-- edX: Original: <img src="images/NN_Layer_Dense.png"> replace by EdX created image -->
    <img src="images/Layers_W8_L3_Sl18.png" width=60%>
</div>

where
- $\y_\llp$ is a vector of length $n_\llp$
- $\W_\llp$ is a matrix
    - $n_\llp$ rows
    - $\W_\llp^{(j)} = \W_{\llp,j}$
    
Written with the shorthand `Dense(`$n_\ll$`)`

We will introduce other types of layers.

- Most will be homogeneous
- Not all will be fully Connected
- The dot product will play a similar role

# Non-linear activation

The sigmoid  function $\sigma$ may be the *most significant part* of the functional form
- The dot product is a *linear* operation
- The outputs of sigmoid are *non-linear* in its inputs

So the sigmoid induces a non-linear transformation of the features $\y_{(\ll-1)}$

The outer function $a_\llp$ which applies a non-linear transformation to linear inputs
- Is called an *activation function*
- Sigmoid is one of several activation functions we will study

- The operation of a layer does not always need to be a dot production
- The activation function of a layer need not always be the sigmoid

More generically we write a layer as

<div>
    <center><strong>Layers<strong></center>
    <br>
    <div align="middle">
    <!-- edX: Original: <img src="images/NN_Layers.png"> <!Image source: NN_Layers.drawio; select only one box for export>replace by EdX created image -->
    <img src="images/Layers_W8_L2_Sl12_2.png" width=50%>
</div>

$$
\y_\llp = a_\llp \left( f_\llp( \y_{(\ll-1)}, \W_{\llp}) \right) 
$$

where
- $f_\llp$ is a function of $\y_{(\ll)-1}$ and $\W_\llp$
- $a_\llp$ is an activation function

So our multi-layer Neural Network (using Dense layers) looks like

<div>
    <center><strong>Layers</strong></center>
    <br>
    <div>
    <!-- edX: Original: <img src="images/NN_Layers.png"> replace by EdX created image -->
    <img src="images/W12_L1_NN_layers1920by1080.png">
</div>

# Pattern matching

We again meet our old friend: the dot product.

We have argued that the dot product is nothing more than pattern matching
- $\W_{\llp,j}$ is a pattern
- That layer $\ll$ is trying to match against layer $(\ll-1)$ output $\y_{(\ll-1)}$

What then, is the role of the Sigmoid activation in the Dense layer ?

The Sigmoid
- converts the intensity of the match (the dot product)
- into the range $[0,1]$
- which we can interpret as a *probability* that
    - the input
    - matches the pattern
    
At the two extremes $0$ and $1$, the Sigmoid output can be interpreted as a binary test

    Does the input match the pattern ?

In [4]:
print("Done")

Done
