# **Elements of Fully Connected Neural Networks**

**Objectives:**
- Understand the elements of a perceptron and a multi-layer perceptron (feed forward NN)
- Understand how to setup a fully connected neural network
- Understand practices in applied machine learning: k-fold cross-validation and regularisation

Here's a *fully-connected neural network*:

<div style="text-align: center;">
    <img src="images/dense_nn.png" width="300" height="300">
</div>

Neural networks can allow us to learn complex non-linear mappings from an input to the output space, when large and informative data is available.

### Perceptron/Node

A neural network is made up of layers of nodes. Each layer has W nodes, where W is known as the width of the neural network. The layers between the input and output layers are known as hidden layers.

A node computes a weighted sum of its inputs; it applies **weights** to each input, and includes an additional **bias** term. Then it applies an activation function to this term.

$$
y_i = f\left(b_i + \sum_{j=1}^{N} x_j w_{ij} \right)
$$

where,
- **$y_i$** → Output of the node (perceptron) after applying the activation function.  
- **$f(\cdot)$** → Activation function that introduces non-linearity (e.g., ReLU, sigmoid).  
- **$\sum_{j=1}^{N} x_j w_{ij}$** → Weighted sum of inputs before activation.  
- **$x_j$** → Input feature \( j \) to the node.  
- **$w_{ij}$** → Weight associated with input \( x_j \) for node \( i \).  
- **$N$** → Total number of input features to the node.

The values of the weights and biases of each node in a NN is learned by minimising an error metric.

$$
(\mathbf{w}_{opt}, b_{opt}) = \arg\min_{\mathbf{w}, b} L(\mathbf{w}, b)
$$

$$
\nabla_{\mathbf{w}, b } \; L(\mathbf{w}, b) = 0
$$


### Gradient Descent

Gradient descent is an iterative method used to find the minima of a function. The idea is to start at a random point, and move in the negative gradient direction. We keep moving in this direction until we reach a 0 gradient position, which is the minima. Note, this method will find a local minima, but not necessarily the global minima. Futhermore, the learning rate controls how big your step is in the negative direction. If this is too big, we might overshoot the minimum. If it is too small, it will take a lot more time to reach the minima. 

Say that we are trying to find the value of $x$ that minimises the function $f(x)$. At each step of the algorithm, we update the paramater using the formula:

$$
x_t = x_{t-1} - \alpha \nabla f(x_{t-1})
$$

where,
- $x_t$ is the parameter at iteration $t$
- $\alpha$ is the learning rate
- $\nabla f (x_{t-1})$ is the gradient of the function at the current point

The iterative process continues until the algorithm converges at a minimum point, where the gradient becomes very small (close to 0).

<div style="text-align: center;">
    <img src="images/gradient_descent.jpg" width="500" height="300">
</div>

So, we solve the weights and bias optimisation problem using gradient descent. The gradient descent equations for neural networks are:

$$
\mathbf{w}_{t} = \mathbf{w}_{t-1} - \alpha \nabla_{\mathbf{w}} L(\mathbf{w}_{t-1}, b_{t-1})
$$

$$
b_{t} = b_{t-1} - \alpha \nabla_b L(\mathbf{w}_{t-1}, b_{t-1})
$$

FROM HERE, PAGE 3 LECTURE 2