<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Introduction to Neural Networks - Additional Reference

_Authors: Justin Pounders (SV), David Yerrington (SF)_

---

In this lesson we will get an overview of basic feed-forward neural networks.  The emphasis will be on terminology and and the fundamental building blocks of these powerful networks.  Later today you will learn how to use Keras, a powerful and (relatively) user-friendly library for building your own networks.

### New Concepts

#### Neurons

A neural network (at its core) is built up of different neurons that are linked together. Each takes in either the original input features or some transformed version of them and puts out a value (or set of values). One neuron looks something akin to this:

![](./images/perceptron.jpg)

Each neuron is going to be the combination of the following:

- A **bias** term (akin to a constant or $B_0$ term in a linear regression)
- The input terms they've received, each multiplied by a **weight**

If our model has one neuron, this looks suspiciously similar to a linear regression:

1. take each term
2. multiply it by a weight
3. sum those new values together 
4. add an additional bias term

That output should, as we train our neural network, get closer and closer to what the output is for that specific set of inputs ($x_1...x_n$). As we'll see, the way we train the network and the way we transform our outputs (plus the number of neurons) distinguishes neural networks from linear regression quite strongly.

### Hidden Layers

What makes neural networks tick is the idea of hidden layers. Hidden does not mean anything particularly devious here, just that it is not the input or the output layer.

Hidden layers can have:
- any number of neurons per layer 
- can be of any number in your model**

At each layer each neuron in that layer receives the same weight. However, each neuron is going to transform the data in a different way, based on how we assign or change the weights and bias in that neuron. 

![](./images/neuralnet.png)

For the network above, we have two hidden layers and one output layer.

- Hidden Layer 1
    - 4 Neurons
    - Each Neuron has 6 weights and 1 bias term
    - Inputs: the original data
    - Outputs: one number each
- Hidden Layer 2
    - 3 Neurons 
    - Each Neuron has 4 weights and 1 bias term
    - Inputs: the four outputs from each Neuron in Hidden Layer 1
    - Outputs: one number each
- Output Layer
    - 1 Neuron
    - The one nNuron has 3 weights and 1 bias term
    - Inputs: the three outputs from each Neuron in Hidden Layer 2
    - Outputs: the final prediction

### Activation

So how exactly does information/data propogate through the network?

We can write the equation for the activation of the $j$th neuron in the $i$th layer as:

### $$ a_j^i = \sigma \left( \sum_k w_{jk}^i a_k^{i-1} + b_j^i \right)$$

There is a decent amount going on in this equation. We will examine the pieces.

- $a_j^i$ represents the activation of the $j$th neuron in the $i$th layer. Note that the superscript corresponds to the layer number and the subscript corresponds to the neuron number within the layer.

- $a_k^{i-1}$ is the activation of the $k$th neuron in the $i-1$th layer.

- $\sigma$ represents an "activation function". More on this later, but it is a function that can transform the activation of neurons. The simplest activation function is the linear activation, $f(x) = x$.

- $w_{jk}^i$ represents the weight of the activation in the $k$th neuron in the $i-1$ layer to the $j$th neuron in the $i$th layer. So, $j$ is the destination neuron in the $i$ layer. $k$ is the departure neuron in the previous layer.

- $b_j^i$ is the "bias" of the $j$th neuron in the $i$th layer. The bias adds a constant to the value of the activation.

The gist of the equation is that each neuron is a sum of the weighted activations of neurons that feed into it plus a "bias" value, all fed through a final activation function.

The formula becomes cleaner in matrix notation. Here is the vectorized version of the formula above:

### $$ a^i = f(W^i a^{i-1} + b^i) $$

Now there is a weight matrix $W^i$ for each layer $i$. The weight matrix defines the weightings on the previous layer neuron activations to the neurons of the current layer. 

![](./images/activation.png)

Neurons also have an activation function that transforms the output in a certain way. Some common examples are:

- [ReLU](https://en.wikipedia.org/wiki/Rectifier_(neural_networks): Also known as a Rectified Linear Unit, this turns the output to 0 if the output would be less than 0 (i.e., take the output and feed it through $f(y) = max(0, y)$). This means that the neuron is activated when its output is positive and not activated otherwise. This has the intuitive effect of turning a neuron "on" in certain cases and off in other cases.
- [Softmax](https://en.wikipedia.org/wiki/Softmax_function): Used frequently at the output layer, this essentially "squishes" a bunch of inputs into a normalized scale of 0-1, which is great for creating something akin to a probability of falling into a given class. 
- [Sigmoid or Logistic](https://en.wikipedia.org/wiki/Logistic_function): Much like how we transformed the linear regression model to change the output to a zero or one through the use of a logistic or sigmoid function, we can do the same as an activation to squash the output to a scale between 0 and 1. 

There's a wealth of information on different types of activation functions within [this article](https://en.wikipedia.org/wiki/Activation_function) -- different activation functions, hidden layers, and neurons per layer can change how effective your neural network will be!


### Check for Understanding 2 (5 Minutes plus Over Break)

Independently or with a partner (your choice) -- pick two of the following activation functions:

- Binary Step
    - If $x \le 0$: $f(x) = 0$; else $f(x) = 1$ 
- ReLU
    - If $x \le 0$: $f(x) = 0$; else $f(x) = x$
- Logistic / Sigmoid
    - $f(x) = \frac{1}{1 + e^{-x}}$
- TanH
    - $f(x) = \tanh(x) = \frac{2}{1 + e^{-2x}} - 1$
- Softsign
    - $f(x)={\frac {x}{1+|x|}}$
    
Write a function in Python for each. Your functions should take in one value ($x$) and output the transformed version of $x$. 

> Note: $e^x$ can be evaluated using `np.exp(x)`.

In [None]:
# Feel free to use this function to plot a range of x values
def activation_plotter(activation_function):
    x = list(range(-10, 11))
    y = [activation_function(val) for val in x]
    plt.plot(x, y)
    plt.ylabel('Activated value for X')
    plt.xlabel('X')

In [None]:
# A: code activationn functions here