## 1. Artificial Neuron

![Alt text](images/image-2.png)

- **x<sub>i</sub>** is the **input** (axes of features to data)(can have multi dimensions: price, size, color, etc.)
- **w<sub>i</sub>** is the **weight** for input x<sub>i</sub> that we learn for this particular input (multiplied with each dimension, and sum them up)
- **b** is the **bias** (scalar), a weight we learn with no input
- **f** is the **activation function** that determines how our output changes with the sum of all weight-input products
- **y** is the **output** such as the class an image belongs to

What are we training here? --> THE **w** and the **b**

In practice, we are representing x and w as vectors and matrix to do multiplications

***
## 2. Activation Function

### Linear Activation Function (Separating 2 classes)

Suppose the activation function is a simple linear function

$$y = w * x + b$$

A line --> <mark>Decision boundary</mark> is the "line" that separates 2 sets of data that our task is trying to do

![Alt text](images/image-3.png)

(y = 1 or y = -1, not the result of activation function) is 2 separated classes --> Then, if output of activation function > 0, y = 1. Else, y = -1

### Neuron with a Linear Activation Function

*What's **wrong** with a linear activation function?*

- Most real datasets are not linearly separable, e.g. we can't find a line that separates classes well in a classification problem 
- We can learn non-linear transformations of our data to help (activation function is sin, cos, etc.)  
- Multiple layers with non-linear transformations help
- <mark>**No advantage from multiple linear layers**</mark> (since then, the composite of layers is a linear layer)

$$ w3(w2(w1*x + b1) + b2) + b3 = (w3w2w1) * x + (w3w2b1 + w3b2 + b3) = w' * x + b' $$

![Alt text](images/image-4.png)

### Early Activation Functions: Perceptrons

Biological neurons only either output/no output --> these guys where following exactly this model (heaviside (unit) step function)

--> This is called the decision boundary

**Problem**: These functions are not differentiable, continuous, or smooth

![Alt text](images/img3.png)

### Sigmoid Activation Function

> We rely on derivatives and gradients to find signals for **w** and **b**

- Easily differentiable, smooth, continuous
- Range between [-1, 1] or [0, 1]

There are many sigmoid functions, the most common are:

$$f(x) = tanh(x)$$
$$f(x) = {1 \over 1 + e^{-x}}$$

**Problem**: Saturated neurons “kill” the gradients. Gradients become **vanishingly small very quickly** away from x=0  

--> In most cases we do not have any signals to train our model (since the derivative is just 0)

![Alt text](images/img4.png)

### ReLU Activation Function (MODERN)

<mark>Rectified Linear Unit (ReLU)</mark> based activation functions:

ReLU

$$ReLU(x) = (x)^+ = max(0,x)$$

LeakyReLU

$$
LeakyReLU(x) = 
\begin{cases}
    x & \text{if } x > 0 \\
    negativeslope * x & \text{otherwise} \\
\end{cases}
$$

Parametric ReLU

$$
PReLU(x) = 
\begin{cases}
    x & \text{if } x > 0 \\
    ax & \text{otherwise (some constant a)} \\
\end{cases}
$$

ReLU have **very easy derivatives (either 0 or 1), use derivative = 0 at x = 0**

<img src="images/img7.png" width="70%" height="70%">

**** If we have billions of neurons, each with this function, given that the derivative is always 0 or 1.

**** Then, we can approximate any function using many small lines (data points)

***
## 3. Training Artificial Neurons (Supervised Learning)

So, how to we "train" the **w** and **b** based on the prediction error?

> input: **x**, predicted output: y, gound truth label (correct output): t, Neuron M(**w;x**)

(in other words, we have y = M(w;x) and we want to match this as close as possible to t)

> 1. Make a prediction for some input data x, with a known correct output t
>
>           y = M(w;x)
>
> 2. Compare the correct output with our predicted output to compute **loss** (<mark>loss function</mark>):
>
>           E = Loss(y, t)
> 
> 3. **Adjust the weights/bias** to make the prediction closer to the ground truth, i.e. minimize error
>
> 4. **Repeat** until we have an acceptable level of error

#### **Inference time**

Now, after these 4 steps in the training process, we will use the **w** and **b** to **predict the output of new x**

#### **<mark>Forward pass and Backward pass</mark>**

- "Forward pass" refers to calculation process, values of the output layers from the inputs data. It's traversing through all neurons from first to last layer. A loss function is calculated from the output values.
- "Backward pass" refers to process of counting changes in weights using gradient descent algorithm (or similar). Computation is made from last layer, backward to the first layer.

Backward and forward pass makes together one "iteration".

***

## 4. Loss Function

A loss function computes how bad **predictions are compared to the ground truth labels**.

- Large loss: the network’s prediction differs from the ground truth
- Small loss: the network’s prediction matches the ground truth

We want to calculate the error over **all training samples (average error)**

![Alt text](images/img9.png)

<mark>Each iteration is called an Epoch</mark>

As we can see, the model will learn better as we iterate more  
**BUT, the validation result will become worse after a while (it starts to memorize the model, and will lose generalization)**  
even though training error keeps decreasing

![Alt text](images/img10.png)

Here, we want to classify 3 classes --> we use 3 neurons

The **x** column input, in this case for example, is just the representation of the input (vectorized version of pixels) (colors, feature, etc.)

### Interpreting the answer:

- The prediction is the one with the highest score (obviously)
- How can we output the **<mark>confidence</mark> of the answer?**

<mark>**logits**</mark> = raw output before normalization

<mark>**Softmax function**</mark> → normalizes the logits into a categorical probability distribution over all possible classes.

![Alt text](images/img11.png)

This outputs probabilities (confidence) of the prediction

#### We cannot compare "437.9" to cat, how do we do so?

**One-hot encoding** --> Maps categories to vector representation at the beginning **(ground truth label)**

![Alt text](images/img12.png)

THEN, apply **Cross Entropy (CE)** --> Mostly used for **classification problems**

![Alt text](images/img14.png)

THEN, apply **Mean Squared Error** --> Mostly used for **regression problems**

![Alt text](images/img13.png)

NOTE: What we use for comparing errors are normalized --> we are not omitting information by just using the class of the expected class

![Alt text](images/img15.png)

### Forward-Pass with Error Calculations

(Normally we would use pandas instead of doing for loops - this is just to illustrate)

NOTE: This example is classifying 2 classes (either 0 or 1). 4 ground truth labels --> 4 examples (inputs)

In [3]:
import math
x = [[1.0, 0.1,-0.2],   # data
    [1.0,-0.1, 0.9],
    [1.0, 1.2, 0.1],
    [1.0, 1.1, 1.5]]
t = [0, 0, 0, 1]        # labels
w = [1, -1, 1]          # initial weights

# x(4 x 3) * w(3 x 1) = y(4 x 1) -> Compared with t

def simple_ANN(x, w, t):
    e_bce = []
    e_mse = []
    y = []
    for n in range(len(x)):
        v = 0
        for d in range(len(x[0])):
            v += x[n][d] * w[d]
        y.append(1/(1+math.e**(-v))) # sigmoid
        e_bce.append(-t[n]*math.log(y[n])-(1-t[n])*math.log(1-y[n]))
        e_mse.append((y[n]-t[n])**2)
    
    total_e_bce = sum(e_bce)/len(x) # average error - BCE
    total_e_mse = sum(e_mse)/len(x) # average error - MSE
    return ((y, w, total_e_bce) , (y, w, total_e_mse))


***
## 5. Gradient Descent

### Neural Network Layer (Vector, Matrices, and Tensors)

A neural network layer with two neurons:

$$y1 = f(w1*x+b1)$$
$$y2 = f(w2*x+b2)$$

Say N = number of characteristics  
M = number of neurons (classes to be classified)

Can represent NN layer easier with a weight matrix,  
e.g. where **each neuron's weight vector is a row** of the weight matrix W (M x N)  
and the **input is a column vector x**: (N x 1)

$$\implies y = f(Wx + b)$$

is M x 1 (result for each class)

![Alt text](images/img27.png)

### Single layer training: Delta rule

**How do we edit each of our neuron's weights $w_{ji}$ to reduce E (at every step)?**

--> Through <mark>**derivatives (gradient)**</mark>

Vector of partial derivatives for all weights is the **gradient**
- Direction of the gradient is the direction in which the function increases most quickly
- Magnitude of the gradient is the rate of increase

Adjusting weights according to the slope (gradient) will guide us the minimum (or maximum) error

![Alt text](images/img17.png)

![Alt text](images/img18.png)

In higher dimensional space (Deep Learning), the probability of finding (stucking in) a **local minima** will be very small

### Delta Rule for Single Weight/Training Sample

Using **chain rule**

In this example:

- E = f(y) = MSE formula
- y = f(a) = Sigmoid activation function
- a = weight * x + b

![Alt text](images/img19.png)

In [7]:
# Forward-Pass & Backward-Pass

def simple_ANN(x, w, t, iter, lr):
    '''x: input
    w: weight
    t: ground truth
    iter: num iterations (learn iter times)
    lr: learning rate (w new = w old + lr * dE/dw)'''
    
    for i in range(iter):
        e, y = [], []
        for n in range(len(x)):
            v = 0
            for d in range(len(x[0])):
                v += x[n][d] * w[d]
            y.append(1/(1+math.e**(-v))) # sigmoid
            e.append((y[n]-t[n])**2) # MSE
            
            # gradient descent to update weights
            for p in range(len(w)):
                d = 2*x[n][p]*(y[n]-t[n])*(1-y[n])*y[n]
                w[p] -= lr*d
    total_e = sum(e)/len(x)
    return (y, w, e)

***
## 6. Neural Network Architectures

### Multiple Layers are Important: XOR

Having a single decision boundary (a single NN layer) is not enough to solve many problems

The most famous such problem is the XOR function, which needs two decision boundaries to solve

We solve this by having **at least 2 neural network layer** (1 hidden)

In fact in the limit of an infinitely-wide neural network with at least one hidden layer, NN is a **universal function approximator**

![Alt text](images/img21.png)

### Backpropagation: Solving Credit Assignment Problem

When the last layer receives feedback, the credit asignment problem arises when it **doesn't know which previous steps needs updating**

$\implies$ Use DP to calculate intermediate steps for gradient calculations (save into a look up table)

- **Backpropagation** is a way of propagating the total loss back into the NN to know how much of the loss every node is responsible for, and updating those weights

![Alt text](images/img22.png)

### Multiple Layers with Non-Linearity

Each layer is a projection from one space to another

$\implies$ Neural networks can take in a data, project to another space so many times such that **in the final space, a linearity can be used to separate your data**  
$\implies$ So in the final space, normally we only use linear activation function

![Alt text](images/img20.png)

NN can be viewed as a way of **learning features** directly and end-to-end from raw input data

Each layer (before the last layer) have **activations** that can be used as **high-level features** representing input data

![Alt text](images/img23.png)

**Feed-Forward Network**: Information only flows forward from one layer to a later layer, from the input to the output.

**Fully-Connected Network**: Neurons between adjacent layers are fully connected.

**Number of Layers**: Number of hidden layers + output layer.

An architecture of a NN describes neurons and their **connectivity**

Selection of architecture greatly affect performance.

In future weeks we will introduce more NN architecture