# Overfitting and underfitting
- While minimizing the training error, it is of utter importance to minimize the gap between training error and test error, in order to make the ML algorithm robust against previously unseen examples 
- If we don't do those things two situations arise:
  - Overfitting: a model that is too complex for the chosen learning problem. This means the model behaves well on the training set, but poorly on the test set.
  - Underfitting: a model that has a too big training error

![Screenshot from 2023-06-19 17-44-32.png](<attachment:Screenshot from 2023-06-19 17-44-32.png>)

**Formal definition**  
An hypothesis h overfits a training set if there another hypothesis h' that performs (eg accuracy) worse on the training set, but better on the entire dataset

### Solution 1 to over(under)fitting: modifying the **CAPACITY** 
- Capacity indicates the ability to fit a wide variety of functions
  - Low capacity tends to underfit (few parameters)
  - High capacity tends to overfit (too many parameters)
- We can regulate the capacity by changing the hypothesis space of a model
  - Example : linear regression becomes polynomial regression
  - When the hypothesis space becomes too powerful, we overfit

### Solution 2 to overfitting: Regularization
- Basic idea is that it "deactivates" some features by making their corresponding weights very small
- It means big weights, unless they significantly enhance the cost function
- Small weights are preferrable to avoid overfitting because a change in the inputs doesn't change much the network
  - This means that, by regularizing, we make the network more robust to small noise

![Screenshot from 2023-06-19 19-03-25.png](<attachment:Screenshot from 2023-06-19 19-03-25.png>)

**But practically how do we achieve this goal?**  
- By adding a so called regularization term to the objective function
- $ \lambda $ penalizes weights that cause overfitting
  - $\lambda \uparrow \implies w \downarrow \implies $ lower variance (towards flat line) $\implies$ less overfitting
  - $\lambda \downarrow \implies w \uparrow \implies $ lower bias (towards wiggly curve)$\implies$ more overfitting  

![Screenshot from 2023-06-20 10-34-35.png](<attachment:Screenshot from 2023-06-20 10-34-35.png>)

**And the gradient?**
- We will add a term $\frac{\lambda}{m}w_j$ to $\frac{\partial J}{\partial w_j}$
- This has the effect of shrinking $w_j$ at each update

### Solution 3 to overfitting: Dropout