# Artificial Neural Networks


## From Biological to Artificial Neurons
- first introduced in 1943 by neurophysiologist Warren McCulloch and Walter Pitts
    - presented a simplified computational model of how biological neurons might work together to perform complex computations using propositional logic
- ### Biological Neurons
    - An unusual-looking cell found in animal brains
    - composed of 
        - cell body - contains the nucleus and most of the cmoplex components
        - Dendrites - branching extensions
        - axon - very long extension
            - axon splits up into many branches called telodendria
                - the tip is the synaptis terminals (synapses) which are connected to the dendrites or cell bodies of other neurons
    - Makes short electircal impulses caleed action potentials which makes the synapses release neurotransmitters
    - With neurotransmitters within a few milliseconds, a neuron fires its own electrical impulses
    - Neurons behave in a simple way but they are interconnected in a network of billions of other neurons, with each neuron connected to thousands of others
    - Complex computations can be performed with a network of simple neurons

*Biological Neural Networks (BNNs) are still subject to active research*. Although, some parts of the brain have been mapped and it seems that neurons are organized in onsecutive layers

- ### Artificial Neuron
    - one of more binary inputs and one binary output
    - the output is activated when more than a threshold of certain number of inputs are active
    - a simple network of artificial neurons can perform complex tasks

## The Perceptron
- one of the simplest ANN architectures
- inveted by Frank Rosenblatt
- It is based on a slightly different artificial neuron called *threshold logic units (TLUs)* aka *linear threshold units (LTUs): **the inputs are number rather than binary values, and each input connection is associated with a weight
- the TLU computes the weighted sum of its inputs then applies a step function to that sum
    - the most common step funciton used in Perceptrons is Heaviside step function (heavisize(z) = 0 if z < 0 else 1 if z >= 0)
- composed of a **single** layer of TLUs w/ each TLU connected to all the inputs
- **Dense layer:** when all neurons in that layer is connected to every neuron in the previous layer (in this case the input neurons)
- all the input neurons form the inputs layer
- bias: an extra neuron that represents the bias is added. It always outputs 1

equation for the output of a fully connected layer: h(X) = activation_function(XW+b)
- X = matrix of input features. one row per instance, one column per feature
- W = connection weights excl. bias neuron. one row per input neuron, one column per artificial neuron in the layer
- b = matrix of biases. one bias term per artificial neuron

### The TLU/LTU
- one TLU can be used for simple binary classification

### how is a perceptron trained?
- The connection between two neurons tends to increase when they fire simultaenously - "Hebb's rule"
    - perceptrons are made by a variant of this rule that takes into account the error made by the network when it makes a prediction
- It is fed one training instance at a time, and for each instance, predicts the label
- For every output neuron that produced a wrong predction, the connections to the weights from the inputs that would have contributed to the correction is strengthened/reinforced
- training rule: w_next_step = w + learning_rate(target_output - predicted_output)*[ith input val of the current training instance]

**perceptrons have a linear decision boundary so it cannot learn complex patterns like logisitc regression does**

Perceptron convergence theorem: If the dataset is linearly separable, the algorithm converges to a solution

**output of perceptron**: predictions on hard threshold. It does not output probabilities

### Problem with single-layer perceptrons: cannot solve exclusive OR classification problems (true for any linear classification model)
- **solution**: stack multiple perceptrons together
    - Multilayer Perceptron, which can solve XOR problems

## Multilayer Perceptrons (MLPs)
- stack of perceptrons
- Composed of:
    - input layer
    - hidden layers - one or more layers of TLUs
        - lower layers - layers closer to input layer
        - upper layers - layers closer to output layer
    - output layer - final layer of TLUs
    - bias neuron as every layer except output layer
    
**Deep learning: when an ANN contains <u>deep</u> stacks of hidden layers**. People say that even shallow NNs are deep learning

- there was no way to train MLPs until **backpropagation**
- Backpropgation: two passes in the network (one forward, one backward)
    - computes the gradient of the network's error with regard to everry single model parameter. Basically it determines how each connection weight and each bias term should be tweaked in order to reduce the error.
    - Once it has those gradients, it does a gradient descent step and repeats until it converges.
    
#### Backpropagation in more detail
- handles one mini-batch at a time (e.g. 32 instances each batch) and goes through the full training set multiple times. Each pass is called an *epoch* (one forward + backward pass)
- Each mini-batch is passed to the network's input layer, which sends it to the first hidden layer
    - The algorithm computes the output of all neurons in the first hidden layer, then passes it to the next layer, until the output layer
- **up to this point is called the "forward pass"** it is like making prediction
- measures the network's output error w/ a loss function
- computes how much each output connection contributed to the error using the chain rule (the most fundamental rule in calculus), which makes this step fast and precise
- measures how much each connection contributed to the error in the layer below using the chain rule, working backward until the input layer
- final step: performs gradient descent step to tweak all the connection weights in the network, using the gradients it just computed

#### Backpropagation in summary:
- for each training instance:
    - the backprop algo first makes a prediction (forward pass) and measures the error
- then goes through each layer in reverse to measure the error contribtion from each connection (reverse pass)
- tweaks the connection weights to reduce the error (Gradient descent step)

*Initialize hidden layer's connection weights randomly* **DO NOT WEIGH THEM EQUALLY**:
- the backprop algo will treat all neurons equally and their connection weights equally, which goes against the point of backpropagation
- Conquer the enemy and break the symmetry

The MLPs activation functions were changed from the step function to the sigmoid (logistic) *sigmoid(z)=1/(1+exp(-z)* function because it is continuous and enables gradient descent to improve because the sigmoid's derivative is nonzero at every point

other activation functions:
- hyperbolic tangent tanh(z) = 2sigma(2z)-1
    - S-shaped
    - output from -1 to 1
- Rectified Linear Unit (ReLU) ReLU(z) = max(0,z)
    - not differntiable at z=0
    
#### So what the heck is the point of activation functions??
- Since perceptron is basically chaining functions in each layer, those chained functions will still be linear without activation functions.
- The point of MLPs is to solve more complex problems, not just linear ones

## types of MLPs
### Regression MLPs
- for single value predictions, only need a single output neuron
    - for double val preds, will need two output neurons
    - etc. 
    - 1 output neuron per dimension
- In general, do not use an activation function for the output neurons so they can output any range of values. Scenarios to use **output activation functions:**
    - can use ReLU or softplus to output only positive values
    - can use logistic function of hyperbolic tangent to guarantee values fall within a certain range
- Loss functions:
    - MSE
    - MAE
    - Huber loss: combination of MSE and MAE
    
### Classification MLPs
- # of neurons:
    - for binary classification, use single output neuron using logistic activation function so the ouput will be b/t 0 and 1 (can be interpreted as the estimated probability for the positive class)
    - for multi binary classification that predicts if an email is ham or spam AND if its urgent or not:
    - use 2 neurons w/ both using logistic activation function (first neuron outputs spam/ham, second outputs urgent/nonurgent). 
        - probabilities do not have to add up to 1 b/c spam/ham and urgent/nonurgent cases are not mutually exclusive. The outputs of each neuron are not mutually exclusive
    - Basically one output neuron for each positive class
    - for multiclass classification:
        - if each instances only belongs to one class out of three of more possible classes (0-9 for digit img classification) have one output neuron per class and use softmax activation function so that the probabilities are b/t 0 and 1 and add up to 1 b/c they classes are mutually exclusive
- loss function:
    - it is predicting probability distributions so use cross-entropy (log) loss

0.3