## 1. Neural Network Algorithm

<img src="img/pic49.png" width=400 height=400 />

#### 1.1 The forward pass

<img src="img/pic42.png" width=500 height=500 />

#### 2.1 The backward pass

<img src="img/pic43.png" width=500 height=500 />

#### Output layer derivatives

<img src="img/pic44.png" width=500 height=500 />

<img src="img/pic45.png" width=500 height=500 />

#### Hidden layer derivatives

<img src="img/pic46.png" width=500 height=500 />

<img src="img/pic48.png" width=300 height=300 />

<img src="img/pic47.png" width=500 height=500 />

**Modern deep learning frameworks such as PyTorch and TensorFlow calculate the derivatives automatically, given the model specification. This is known as algorithmic differentiation.**

_________________________

### 2. Measuring performance

- With suﬀicient capacity (i.e., number of hidden units), a neural network model will often perform perfectly on the training data. 

- However, this does not necessarily mean it will generalize well to new test data.

**Generalization** capability could be related to 
- (i) the inherent uncertainty in the task,
- (ii) the amount of training data, and
- (iii) the choice of model (hyperparameter search)

#### Choosing hyperparameters

- It is typical to divide the data into three parts:
    - **training data** (to learn the model parameters)
    - **validation data** (to choose the hyperparameters)
    - **test data** (to estimate the final performance)
<br>

- However, this division may cause problems where the total number of data examples is limited; if the number of training examples is comparable to the model capacity, then the variance will be large.

- One way to mitigate this problem is to use k-fold cross-validation

<img src="img/grid_search_cross_validation.png" width=400 height=400 />

See [Model selection and evaluation](https://scikit-learn.org/stable/model_selection.html) on scikit learn

<img src="img/pic50.png" width=500 height=500 />

#### Why could the model fail to generalize?


There are three possible sources of error, which are known as **noise, bias, and variance** respectively

<img src="img/pic51.png" width=500 height=500 />

#### Noise
- The data generation process includes the addition of noise, so there are multiple possible valid outputs $y$ for each input $x$
- So even if the model exactly replicates the true underlying function (black line), the noise in the test data (gray points) means that some error will remain
- **Can we reduce noise?** The noise component is insurmountable

#### Bias 
- The model is not flexible enough to fit the true function perfectly
- **We reduce bias** by making the model more flexible (increasing the model capacity). In NN, adding more hidden units and/or hidden layers

#### Variance

-  We have limited noisy training data (orange points). When we fit the model, we don’t recover the best possible function from panel

-  Variance is the difference between the true underlying function and the learned function.
  
>> In practice, there might also be additional variance due to the stochastic learning algorithm, which does not necessarily converge to the same solution each time.

- **We can reduce variance** by increasing the quantity of training data to average out the noisy samples

#### Bias-variance trade-off

- For a fixed-size training dataset, the variance term typically increases as the model capacity increases.
- Consequently, increasing the model capacity does not necessarily reduce the test error. This is known as the bias-variance trade-off.

<img src="img/pic52.png" width=500 height=500 />