*References*
[1]: https://machinelearningmastery.com/gradient-descent-for-machine-learning/ "Machine Learning Mastery: Gradient Descent for Neural Networks" 
[2]: https://en.wikipedia.org/wiki/Graph_theory

# Neural Network Basics

## Gradient Descent
- A good starting point to learn about Neural Networks is with the topic of [gradient descent][1]
- A gradient descent is any algorithm or process that for finding the values or parameters of a function or system for which a maximal or minimal *(maxima)* can be found
- Think of a bowl as a system or function that describes the output of a function, and gradient descent is used to find the minimal value of that function by trying different points on the bowl, and chosing the path of descent that will most minimalize the bowl function's output
- Repeating the gradient descent, eventually the lowest point will be found
- However, gradient descent doesn't guarantee the **global minimum** value, just the one in the path of the gradient descent
- If the bowl were to have two dips in it and the one closer to the starting point of the gradient descent were to dip higher than the other one, the gradient descent would lead to the first dip and not detect any lower point and would get stuck in that **local minimum**

# Perceptrons
- Neural Networks follow the basics of a gradient descent
    1. Take a random point
    2. Find the best path to the next best point
    3. Apply the gradient to the function to get the next point
    4. Repeat till no better path can be found
- Neural Networks use networked math and logic functions to achieve this, the first layer of which are known as **perceptrons**.
- **Perceptrons** look at the incoming data to the neural network and decides through some internal logic or function whether or not it fits a category defined by that function (doesn't need to be perfectly accurate)
- They are essentially a binary classifier, what comes in either is or it isn't something, there's no middle ground for a perceptron.
- The more perceptrons, the more complex and nuanced the system becomes in classifying data
- Perceptrons by themselves can't learn however
- This is where **Weights** become important

## Weights
- **Weights** are applied to perceptrons in order to alter how important its decision should be to the outcome of the neural network
- A high weight means that perceptron is important and will have a bigger effect on the outcome, and a lower (absolute value) will mean it's not considered at all
- During the learning process these weights will become modified to refine the neural network
- In terms of [Graph Theory][2]

## Summing the Input
- Each perceptron receives input from all other previous perceptrons or directly from the input data and creates a linear combination of all of them with the associated weights from each input and then sums them together
- This produces a single data which is then tested by the perceptron to see whether it meets that perceptrons internal requirements for an affermative output
- When writing equations for neural networks, the weights are always represented by some form of the letter **w**
    - Usually a capital italicized *W* when representing a matrix of weights
    - Usually a subscript is used to indentify which weight is used
    - Example: 
    $ \begin{equation} w_{1] \cdot x_{1} = w_{2} \cdot x_{2}  \end{equation}$
- **insert summation equation for a single perceptron**
- The output of this summation then becomes the perceptron's **activation function**
- One of the most basic activation functions are known as a [heaviside step function](https://en.wikipedia.org/wiki/Heaviside_step_function)
    - Basically it's a math function that's defined as *1* if the input is greater than or equal to *0*
    - This function can be modified mathematically to shift the activation input and the magnitude of the output
    - ![heaviside-eq](http://bit.ly/2wlShop)
$$f(h) = \begin{cases}
            a\\
            b
         \end
$$
 
- Applying this to the summation of inputs the activation function for a heaviside perceptron becomes: 
    - [heaviside perceptron]: http://bit.ly/2xaVVyU

## Creating an AND Perceptron
- An AND perceptron is a classic and useful kind of perceptron that can be used to logically *AND* inputs of all of a certain set of qualities
- Below is an example graph and logic table for an AND: ![and-perceptron-logic](http://www.byclb.com/TR/Tutorials/neural_networks/ch8_1_dosyalar/image042.jpg)
- Which would have an activation function using the Heaviside formula: ![and-perceptron-eq](inser here)
- Below is an example of how this would be done in with Python and Pandas

In [2]:
import pandas as pd
# TODO: hide the index in pandas output
# Function to evaluate a perceptron
def validate_perceptron(test_inputs, correct_outputs):
    # Generate and check output
    for test_input, correct_output in zip(test_inputs, correct_outputs):
        linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
        output = int(linear_combination >= 0)
        is_correct_string = 'Yes' if output == correct_output else 'No'
        outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])
    
    # Print output
    num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
    output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', 
                                                  '  Activation Output', '  Is Correct'])
    if not num_wrong:
        print('Nice!  You got it all correct.\n')
    else:
        print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
    print(output_frame.to_string(index=False))
# TODO: Set weight1, weight2, and bias
weight1 = 1
weight2 = 1
bias = -2


# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, False, False, True]
outputs = []

validate_perceptron(test_inputs, correct_outputs)



Nice!  You got it all correct.

Input 1    Input 2    Linear Combination    Activation Output   Is Correct
      0          0                    -2                    0          Yes
      0          1                    -1                    0          Yes
      1          0                    -1                    0          Yes
      1          1                     0                    1          Yes


## OR Perceptron
- **OR** is a logic function where something is true **if at least 1 input is true**
- You can change an **AND** perceptron to an OR by:
    - Increasing the bias
    - Increasing the weights

In [3]:
labels = ["In_1", "In_2", "Out"]
d = { 
    "In_0": [0, 0, 1, 1],
    "In_1": [0, 1, 0, 1],
    "Out":  [0, 1, 1, 1]}
pd.DataFrame(d)

Unnamed: 0,In_0,In_1,Out
0,0,0,0
1,0,1,1
2,1,0,1
3,1,1,1


In [4]:
# OR - Perceptron Exercise
# TODO: Set weight1, weight2, and bias
weight1 = 1
weight2 = 1
bias = -1


# DON'T CHANGE ANYTHING BELOW
# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [False, True, True, True]
outputs = []

validate_perceptron(test_inputs, correct_outputs)

Nice!  You got it all correct.

Input 1    Input 2    Linear Combination    Activation Output   Is Correct
      0          0                    -1                    0          Yes
      0          1                     0                    1          Yes
      1          0                     0                    1          Yes
      1          1                     1                    1          Yes


## NOT
- Below is a good example of how weights can be used to completely ignore an input
- In this example the only input necessary to produce the activation function is the inverse of input 2
- Input 1 has no effect on this activation function so its weight is *0*

In [5]:
# TODO: Set weight1, weight2, and bias
weight1 = 0
weight2 = -1
bias = 0


# DON'T CHANGE ANYTHING BELOW
# Inputs and outputs
test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
correct_outputs = [True, False, True, False]
outputs = []

validate_perceptron(test_inputs, correct_outputs)

Nice!  You got it all correct.

Input 1    Input 2    Linear Combination    Activation Output   Is Correct
      0          0                     0                    1          Yes
      0          1                    -1                    0          Yes
      1          0                     0                    1          Yes
      1          1                    -1                    0          Yes


## Combining Perceptrons to Form a XOR Function
- Neural Networks can be combined to form more complex behavior, in this case the logic function **XOR**
![https://adriantorrie.github.io/images/xor-perceptron.png]
- Exclusive OR (**XOR**) is another useful logic function that is only true if its inputs are not the same, only one input can be true
- Useful as a differentiator because it checks if the inputs are different
- The below table shows the logic
- Below that is how you can lead one perceptron into another to form more complex behavior, in this case an **XOR** function

In [6]:
pd.DataFrame({
    "in_1": [0, 0, 1, 1],
    "in_0": [0, 1, 0, 1],
    "out":  [0, 1, 1, 0]
})

Unnamed: 0,in_0,in_1,out
0,0,0,0
1,1,0,1
2,0,1,1
3,1,1,0


## A Simple yet Complete Neural Network
![simple-neural-network](https://adriantorrie.github.io/images/xor-perceptron.png)
*A simlpe neural network that sums two weighted inputs and a bias, and applies the heaviside function*
**TODO include the equation for the above network** 
- **Sigmoids** are another **activation function** that can be used to give a more analog verson of the heaviside function
![sigmoid](https://d17h27t6h515a5.cloudfront.net/topher/2017/January/58800a83_sigmoid/sigmoid.png)
*The sigmoid function*
$$sigmoid(x) =\frac{1}{1 + e^{-1}}$$
![simple-net](https://d17h27t6h515a5.cloudfront.net/topher/2017/February/589366f0_simple-neuron/simple-neuron.png)
* The simple neural network being implemented*
$$y = f(h) = sigmoid(\sum_{i} w_{i}x_{i} + b)$$


Below is an implementation of this function

In [7]:
import numpy as np

def sigmoid(x):
    # TODO: Implement sigmoid function
    return 1.0 / (1 + np.exp(-1 * x))

inputs = np.array([0.7, -0.3])
weights = np.array([0.1, 0.8])
bias = -0.1

# TODO: Calculate the output
output = 0.0
for i in range(len(inputs)):
    output += weights[i] * inputs[i]
output += bias
output = sigmoid(output)

# validate answer
print("That's correct") if output == 0.43290709503454572 else print("That's wrong") 

That's correct


# Learning the Right Weights
- Defining a neural network explicitly is not a very effictive way to use neural networks
- In fact, defining neural networks explicitly is less efficient *(usually)* than just coding normal equations and algorithms
- Neural networks *(and other learning algorithms)* are used when an optimial system is arrived at automatically through **training**
- To do this the error of the network needs to be evaluated
- Usually done by using the **Sum of Squared Errors** or **SSE** for short
$$ E = \frac{1}{2}\sum_{\mu}\sum_{j}[y^{\mu}_{j} - \hat{y}^{\mu}_{j}]^2 $$
*Sum of Squared Errors (**SSE**)*
    - $\hat{y}$ : prediction
    - $y$ : actual value
    - $\mu$ : data points
    - $j$ : output values
    - essentially summing the squared difference of all data points $\mu$
    - The difference is squared to take of sign issues and to punish outliers more heavily
- For each actual number and predicted number, take the difference, square it, then accumulate it for each permutation of iterators $\mu$ and $j$
- Remember, the output depends on the activation function:
$$\hat{y}^{\mu}_{j} = f(\sum_{i}w_{ij}x^{\mu}_{i})$$
- And therefore the error becomes:
$$ E = \frac{1}{2}\sum_{\mu}\sum_{j}[y^{\mu}_{j} - f(\sum_{i}w_{ij}x^{\mu}_{i})]^2 $$
- So the error ultimately ends up depending on the the input data and the network's weights

## Gradient Descent in Neural Networks
- The method to improve the neural network is by gradient descent
- Since real *(and sometimes [complex numbers](https://en.wikipedia.org/wiki/Complex_number))* are being used, calculus can now be applied to perform [gradient](https://en.wikipedia.org/wiki/Gradient) descent to minimize the error
- Derivatives $f'(x)$ give the rate of change, or slope since gradients are of concern, of a function
- $f = (x^2)$ if the derivative is take, $f'(x) = f'(x^2) = 2x$
- If the current data point is $x=2$, then $f'(x) = 4$, or the gradient is currently 4
- Plotted out it looks like:
![gradient-example](https://d17h27t6h515a5.cloudfront.net/topher/2017/January/587bfcfd_derivative-example/derivative-example.png)
*Plotted example of a gradient*
- Remember, just because the gradient is minimized to zero, that doesn't mean it's the global minimum error
- To avoid this **momentum** can be used
- Weights are updated by finding the best gradient for the weight to use and altered like this: $w_i = w_i + \Delta w_i$
    - The greek letter $\Delta$ or *delta* symbolized the gradient for the weight $w_i$
    $$\Delta w_i \propto -\frac{\partial E}{\partial w_i}$$
        - this basically just says that the gradient of $w_i$ is inversely proporional to the partial derivative of the error of the function with respect to that weight
        - learning some multivariate calculus with focus on [partial derivatives](http://bit.ly/2xaVynP) could be very helpful in understanding these things
- There is a concept of the **learning rate** which is applied to the gradient to alter just how much it can change at each step
$$\Delta w_i = -\eta \frac{\partial E}{\partial w_i}$$
- Solving for the gradient of the network function's error with respect to the given weight involves using the [chain rule](https://en.wikipedia.org/wiki/Chain_rule) ad nauseum, but the result is relatively simple: 
$$\frac{\partial E}{\partial w_i} = -(y - \hat{y}) f'(h) x_i$$
- The gradient of the squared error with respect to the weight is then just the negative difference between the actual and predicted values of the network, times the derivative of the intermediate value $f(h)$ function, times the initial input data
- Applying the learning rate you get:
$$\Delta w_i = \eta (y - \hat{y})f'(h)x_i$$


## Gradient Descent in Code
- Gradient descent weight updates are determined as 
$$\Delta w_i = \eta \delta x_i$$
- Error term $\delta$ is defined by:
$$\delta = (y - \hat{y})f'(h) = (y - \hat{y})f'(\sum{w_i x_i})$$
    - $(y - \hat{y})$ is the output error
    - $f'(h)$ refers to the derivative of the activation function
In code, *finally*, it would look a little something like this:



In [8]:
# define sigmoid for activation function f(h)
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# derivative of sigmoid, f'(h)
def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))

# input data
x = np.array([0.1, 0.3])
# target y
y = 0.2
# input to output weights
weights = np.array([-0.8, 0.5])
# learning rate, eta
learn_rate = 0.5

# linearly combine the inputs to get h
h = np.dot(x, weights) # remember, that dot products are linear combinations

# the neural network output y-hat
nn_out = sigmoid(h)

# output error (y - y-hat)
err = y - nn_out

# output gradient f'(h)
out_grad = sigmoid_prime(h)

# error term (low-case delta)
error_term = err * out_grad

# gradient descent step DELTA of w_i
del_w = learn_rate * error_term * x

del_w

array([-0.0039638 , -0.01189141])

### Gradient exercise

In [10]:
import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1/(1+np.exp(-x))

def sigmoid_prime(x):
    """
    # Derivative of the sigmoid function
    """
    return sigmoid(x) * (1 - sigmoid(x))

learnrate = 0.5
x = np.array([1, 2, 3, 4])
y = np.array(0.5)

# Initial weights
w = np.array([0.5, -0.5, 0.3, 0.1])

### Calculate one gradient descent step for each weight
### Note: Some steps have been consilated, so there are
###       fewer variable names than in the above sample code

# TODO: Calculate the node's linear combination of inputs and weights
h = np.dot(x, w)

# TODO: Calculate output of neural network
nn_output = sigmoid(h)

# TODO: Calculate error of neural network
error = y - nn_output

# TODO: Calculate the error term
#       Remember, this requires the output gradient, which we haven't
#       specifically added a variable for.
error_term = error * sigmoid_prime(h)

# TODO: Calculate change in weights
del_w = learnrate * error_term * x

print('Neural Network output:')
print(nn_output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)

Neural Network output:
0.689974481128
Amount of Error:
-0.189974481128
Change in Weights:
[-0.02031869 -0.04063738 -0.06095608 -0.08127477]


### Data Cleanup
- Before completely implementing a gradient descent