# Model Selection, Underfitting, and Overfitting

#### Overfitting - the phenomenon of fitting our training data more closely than we fit the underlying distribution (training error is significantly lower than our validation error). The reason for which overfitting may occur is the  complexity of the model. To cope with an overfitting we used regularization - dropout.
###### When we have simple models and abundant data, we expect the generalization error to resemble the training error. When we work with more complex models and fewer examples, we expect the training error to go down but the generalization gap to grow. 
#### Model selection - selecting our final model after evaluating several candidate models.
#### Why to use validation data?
Validation Dataset - the sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. It helps to cope with overfitting and can be used for model selection. Models with very few hyperparameters will be easy to validate and tune, so you can probably reduce the size of your validation set, but if your model has many hyperparameters, you would want to have a large validation set as well(although you should also consider cross validation).
#### K -Fold Cross-Validation - a method used to constitute a proper validation set. t
The original training data is split into  K  non-overlapping subsets. Then model training and validation are executed  K  times, each time training on  K−1  subsets and validating on a different subset. Finally, the training and validation errors are estimated by averaging over the results from the  K  experiments.
#### Underfitting - the phenomenon when a model is not able to reduce the training error, training error is much lower than validation error (we have reason to believe that we could get away with a more complex model). 
Overfitting or underfitting can depend both on the complexity of our model and the size of the available training datasets
#### NORMAL
![image.png](attachment:image.png)
#### UNDERFITTING
![image-2.png](attachment:image-2.png)
#### OVERFITTING
![image-3.png](attachment:image-3.png)


# Weight Decay

#### Overfitting can be mitigated by collecting more training data. 
However, it can be costly, time consuming, or entirely out of our control.
#### Regularization is a common method for dealing with overfitting.  Regularization is the process which adds a complexity term to the loss function on the training set that would give a bigger loss for complex models  to reduce the complexity of the learned model. 
One of choices for keeping the model simple is weight decay using an  L2  penalty.
#### L1 (Lasso Regression) - adds “absolute value of magnitude” of coefficient as penalty term to the loss function.
![image-4.png](attachment:image-4.png)
Tries to estimate the median of the datax.
We take an arbitrary value from the data. If then move  in the backward direction, then while calculating loss, the values to the one side щf the chosen point will have a lesser loss value while on another side will contribute more in the loss function calculation. Therefore, to minimize the loss function, we should try to estimate a value that should lie at the mid of the data distribution. That value will also be the median of the data distribution mathematically.
#### L2 (Ridge Regression) - dds “squared magnitude” of coefficient as penalty term to the loss function.
![image-3.png](attachment:image-3.png)
Tries to estimate the mean of the data.
#### Training without Regularization
![image.png](attachment:image.png)
#### Using Weight Decay
![image-2.png](attachment:image-2.png)
#### Co-adaptation - the phenomenon when the powerful connections are learned more while the weaker ones are ignored. 
Over many iterations, only a fraction of the node connections is trained. And the rest stop participating.




# Dropout

#### Bias - an error of incorrect assumptions in the learning algorithm. Used to delay the activation function. High bias can lead to underfitting.
#### Variance - the difference between a validation error and a learning error. The bigger the difference, the better the chances of overfitting
#### Bias-variance trade off - a conflict, when simultaneously optimizing these two parameters (variance and bias).
#### Dropout - is a separate layer that accidentally shuts down some neurons so that they do not participate in predictions.
This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.
#### This causes the layer to look like a layer with a different number of nodes
![image.png](attachment:image.png)
Typically, we disable dropout at test time. For intermediate layers, choosing (1-p) = 0.5 for large networks is ideal. For the input layer, (1-p) should be kept about 0.2 or lower. This is because dropping the input data can adversely affect the training.

# Forward Propagation, Backward Propagation, and Computational Graphs

#### Forward propagation (forward pass) - give input to the input of the neural network and it gives us some prediction. We don't know if it's right or not. 
The last step in the forward pass is to estimate the predicted result s against the expected result y.
Estimation between s and y occurs by means of function of cost C.
Based on the value of C, the model "knows" how much you need to adjust your parameters to get closer to the expected result y. This is done using an backpropagation.
![image.png](attachment:image.png)
#### Backward propagation (backpropagation) - repeatedly adjusts the weights of the connections in the network to minimize the difference between the actual output vector of the network and the desired output vector.
Backpropagation aims to minimize the cost function by adjusting the weight and shifting the network. The level of adjustment is determined by the gradients of the cost function with respect to those parameters.

Backpropagation reuses the stored intermediate values from forward propagation to avoid duplicate calculations. One of the consequences is that we need to retain the intermediate values until backpropagation is complete. This is also one of the reasons why training requires significantly more memory than plain prediction.

# Numerical Stability and Initialization

#### Vanishing gradients problem - describes the situation where a deep multilayer feed-forward network or a recurrent neural network is unable to propagate useful gradient information from the output end of the model back to the layers near the input end of the model.
Vanishing and exploding gradients are common issues in deep networks. Great care in parameter initialization is required to ensure that gradients and parameters remain well controlled.

Initialization heuristics are needed to ensure that the initial gradients are neither too large nor too small.

ReLU activation functions mitigate the vanishing gradient problem. This can accelerate convergence.

Random initialization is key to ensure that symmetry is broken before optimization.

Xavier initialization suggests that, for each layer, variance of any output is not affected by the number of inputs, and variance of any gradient is not affected by the number of outputs.
#### Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training.
This has the effect of your model being unstable and unable to learn from your training data.
An error gradient is the direction and magnitude calculated during the training of a neural network that is used to update the network weights in the right direction and by the right amount. Solve - using a smaller batch size or careful initialization.
#### Initialization heuristics are needed to ensure that the initial gradients are neither too large nor too small.

#### ReLU activation functions mitigate the vanishing gradient problem. This can accelerate convergence.

#### Random initialization is key to ensure that symmetry is broken before optimization.

#### Xavier initialization suggests that, for each layer, variance of any output is not affected by the number of inputs, and variance of any gradient is not affected by the number of outputs.

# Environment and Distribution Shift

#### Distribution shift - when training and test sets do not come from the same distributio.
#### The risk - the expectation of the loss over the entire population of data drawn from their true distribution. 
However, this entire population is usually unavailable. 
#### Empirical risk - an average loss over the training data to approximate the risk. 
In practice, we perform empirical risk minimization.
#### Covariate - the change in the distribution of network activations due to the change in network parameters during training.
In neural networks, the output of the first layer feeds into the second layer, the output of the second layer feeds into the third, and so on. When the parameters of a layer change, so does the distribution of inputs to subsequent layers.
Under the corresponding assumptions, covariate and label shift can be detected and corrected for at test time. Failure to account for this bias can become problematic at test time.
#### Batch normalization is a method intended to mitigate internal covariate shift for neural networks.
#### Batch normalization - a method used to make artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling.
Batch normalisation normalises a layer input by subtracting the mini-batch mean and dividing it by the mini-batch standard deviation. Mini-batch refers to one batch of data supplied for any given epoch, a subset of the whole training data.