## Side notes 
_(code snippets, summaries, resources, etc.)_

__Further reading:__
- [_Neural Networks_ PDF worksheet by Udactiy]( https://www.evernote.com/shard/s37/nl/1033921335/50316007-f4a1-430e-a914-db8458a7830d/) in Evernote
- [_Gradient Descent - Problem of Hiking Down a Mountain_ PDF worksheet by Udactiy]( https://www.evernote.com/shard/s37/nl/1033921335/f754539a-a88e-4ac1-85f3-dd5d705e4d37/) in Evernote
- Calculus used for Sigmoid Function below is explained at [WolframMathWorld](http://mathworld.wolfram.com/SigmoidFunction.html)

# Neural Networks

## Summary of topics covered
![summary of neural networks](neural_networks_images/summary_neural_networks.png)

## Perceptrons
- Type of _neural net unit_

![neural network in brain](neural_networks_images/neural_network_brain.png)

![artificial neural network](neural_networks_images/artificial_neural_network.png)

### Power of a perceptron unit
![perceptron power](neural_networks_images/perceptron_power.png)

- Generalized to _halfplanes_
- Perceptrons will always be linear functions that compute hyperplanes

### Boolean logic with perceptrons
- Perceptrons with certain combinations of weights and inputs act behave as a kind of "logic gate"
- These perceptrons can be combined to represent any boolean operator
- Particularly helpful for overcoming decision tree's problem with parity, i.e. `XOR` operator (see below)

![perceptron boolean AND](neural_networks_images/perceptron_and.png)

![perceptron boolean OR](neural_networks_images/perceptron_or.png)

![perceptron boolean NOT](neural_networks_images/perceptron_not.png)

![perceptron boolean XOR](neural_networks_images/perceptron_xor.png)

## Training Neural Networks
- That is, _given examples_, find weights that map inputs to outputs
- Rules for training covered below are
    1. Perceptron rule (with thresholds)
    - Gradient descent or delta rule (unthresholded)

### Perceptron rule
- When output _is_ thresholded
- If data is _linearly separable_, the algorithm below will find it! (in a finite number of iterations).
- Algorithm has to be terminated when the weight value is no longer changed at each iteration, i.e. `actual y == y-hat`
- It can be hard to tell if data is linearly separable, especially with lots of dimensions
- If this algorithm does not terminate for a while, this could mean data is not linearly separable, but since _finite_ could be any number, we cannot be certain of that.
    - "if we could solve the halting problem, we could solve this, but not necessarily so that problem could be solved another way..."



![perceptron rule calculation part 1](neural_networks_images/perceptron_rule_calc_1.png)
![perceptron rule algorithm part 2](neural_networks_images/perceptron_rule_calc_2.png)
![perceptron rule algorithm part 3](neural_networks_images/perceptron_rule_calc_3.png)


----------------------------

### Gradient descent or delta rule
- When output _is not_ thresholded
- Most robust to data set that are not linearly separable
    - converges to the limit of the local optimum
- Relies on calculus to minimize the error, i.e. change the weights to push the error down
    - 1/2 in equation does not affect outcome but it makes result of partial derivative calculation cleaner.

![gradient descent calculation](neural_networks_images/gradient_descent_calc.png)

#### Perceptron rule vs. gradient descent

![perceptron rule vs gradient descent](neural_networks_images/perceptron_vs_gradient_d.png)

### Sigmoid unit
- Hack on the gradient descent equation that allows `y-hat` to be substituted and differentiated instead of `a`.
- Uses a _sigmoid function_ to force this jump into a differentiable threshold
- Calculus used to get to final equation below can be explored at [WolframMathWorld - Sigmoid Function](http://mathworld.wolfram.com/SigmoidFunction.html)

![sigmoid for differentiable threshold](neural_networks_images/sigmoid.png)

### Back propogation in Neural Networks 
- "A computationally beneficial organization of the chain rule."
- Convenient method to compute derivatives with repsect to all the different weights in the network
- Network learns through:
    - Information flows from inputs to outputs
    - Then, error information flows back from the outputs to the inputs
- Could also be called _error back propogation_
- Can be applied to units of another differentiable function
- Error function, in this case some of least squares, can have multiple "local" optima / minima
    - a single unit's error function will have one local optimum, bottom of one parabola, but globally multiple parabolas are combined from all units.

![back propogation](neural_networks_images/back_propogation.png)

### Optimizing Weights, brief intro
- Techniques to solve problem of multiple local optima, which will cause algorithm to get stuck in one minima even if it is not the global optima.
- for image below: red bullet points are aspect that add to a model's complexity

![opitmizing weights](neural_networks_images/optimizing_weights.png)

### Restriction Bias
__definition__ of restriction bias:
- Describes the _representational power_ of a particular data structure, e.g. of a network of neurons
- Restricts the hypotheses that will be considered

#### Evaluating restriction bias
- _perceptron unit:_ linear, only considering planes
    - to _Networks of perceptrons:_ allows boolean functions like `XOR`
    - to _Networks of units with sigmoids & other arbitrary functions:_ allows lots of layers and nodes that can become much more complex, not many restrictions at all
- Neural networks can represent _any_ mapping of inputs to outputs, like:
    - _boolean:_ with network of threshold-like units
    - _continuous:_ as long as smooth curves, connected / no jumps
        - using single hidden layer of nodes
        - each node covers some portion of function
        - nodes are then "stitched together" to give output
    - _arbitrary:_ functions that aren't continuous
        - requires two hidden layers
        - with additional hidden layer, output can be stitched together even with gaps in the function.
        
#### Overfitting
- Danger of overfitting neural network can even represent noise in our training set
- To solve this, restrict number of hidden nodes and layer in network
    - Neural network can only capture as much of a function as its bounds allow
    - i.e. the particular network architecture can have restrictions even though an unbounded neural network will not.
- Other solutions are ones that are applied to other learners like:
    - Cross validation to decide how many nodes per layer, how large weights can get before stopping. 
- Complexity of a neural network is not only in the nodes and layers, but also in its weights, i.e. how _much_ it is trained

![restriction bias](neural_networks_images/restriction_bias.png)

### Preference Bias
__definition__ of preference bias:
- Characteristics that determine whether one subclass of algorithm would be selecteed over another.
    - e.g. preferred decision trees are correct ones, one with top nodes having the most information gain, ones that aren't longer than necessary, etc.

#### Evaluating preference bias
- For _neural networks with gradient descent:_
    - prefers models with lower complexity (Occam's razor)
        
![preference bias](neural_networks_images/preference_bias.png)
