In [1]:
import utils
import warnings

warnings.filterwarnings('ignore')
utils.set_css_style('style.css')

# 2. Bias & Variance

Let's look at the following simple example. If you fit a straight line to the data, maybe a simple linear regression, then this is not a very good fit to the data. This is problem of a high bias, we also say that our model is underfitting the data. On the opposite end, if you fit an incredibly complex regressor, maybe a deep neural network, or a neural network with many hidden units, it's possible that you fit the data perfectly, but that doesn't look like a great fit either. So this is called a regressor of high variance and we also say that this model is overfitting the data. A model in between, with a medium level of complexity, fits the data correctly. 

As mentioned earlier, in machine learning we usually split our data into two subsets: training data and testing data (and sometimes to three: train, validate and test), and fit our model on the train data, in order to make predictions on the test data. When we do that, one of two thing might happen: we overfit our model or we underfit our model. 

## 2.1. Overfitting

**Overfitting** means that model we trained has trained “too well” and is now, well, fit too closely to the training dataset. This is also called a **high variance** problem. This usually happens when the model is too complex (i.e. too many features/variables compared to the number of observations). This model will be very accurate on the training data but will probably be very not accurate on untrained or new data. It is because this model is not generalized and can’t make any inferences on new unseen data, which is, ultimately, what you are trying to do. Basically, when this happens, the model learns the “noise” in the training data instead of the actual relationships between variables in the data.

## 2.2. Underfitting

In contrast to overfitting, when a model is underfitted, it means that the model does not fit the training data and therefore misses the trends in the data. It also means the model cannot be generalized to new data. As you probably guessed (or figured out!), this is usually the result of a very simple model (not enough predictors/independent variables). It could also happen when, for example, we fit a linear model (like linear regression) to data that is not linear. It almost goes without saying that this model will have poor predictive ability on the training data.

<img src="figures/bias-variance.png" alt="bias-variance" style="width: 700px;"/>

One way to check if your model is suffering from high bias or high variance is to compare the training error with the validation set error.


## 2.3. Bias & variance diagnosis

We need to distinguish whether bias or variance is the problem contributing to bad predictions.

The training error will tend to decrease as we increase the complexity of the model (for example the degree $d$ of the polynomial of a linear regression model), whereas the validation/test error will tend to decrease as we increase the complexity up to a point, and then it will increase as complexity is increased, forming a convex curve.

* **High bias** (underfitting): both $J_{train}(\theta)$ and $J_{test}(\theta)$ will be high. Also, $J_{test}(\theta) \approx J_{train}(\theta)$.

* **High variance** (overfitting): $J_{train}(\theta)$ will be low and $J_{test}(\theta)$ will be much greater than $J_{train}(\theta)$.

The is summarized in the figure below:

<img src="figures/bias-variance-diagnosis.png" alt="bias-variance-diagnosis" style="width: 400px;"/>



## 2.4. Bias & variance correction

In the previous section, we saw how looking at training error and validation error can help you diagnose whether your algorithm has a bias or a variance problem, or maybe both. Knowing whether your model is overfitting or underfitting your data helps you take the correct measures in order to improve your algorithms' performance systematically.

If your algorithm has a high bias, the following are some of the possible remedies:
* Try to make your model more complex
* Add more features if possible
* Try a different model that is suitable for your data.
* Train your model longer.

On the other hand, if your algorithm has a high variance, you can:
* Get more data.
* Use regularization.
* Try a different model that is suitable for your data.

<img src="figures/bias-variance-tradeoff.jpeg" alt="bias-variance" style="width: 600px;"/>




## 2.5. Regularization

Consider the problem of predicting $y$ from $x \in R$. The left most figure below shows the result of fitting a $y = \theta_0+\theta_1 x$ to a dataset. We see that the data doesn’t really lie on straight line, and so the fit is not very good.

Instead, if we had added an extra feature $x^2$, and fit $y = \theta_0 + \theta_1 x + \theta_1 x^2$, then we obtain a slightly better fit to the data (See middle figure). Naively, it might seem that the more features we add, the better. However, there is also a danger in adding too many features: The rightmost figure is the result of fitting a 5^{th} order polynomial:

\begin{equation}
h_{\theta}(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \theta_4 x^4 + \theta_5 x^5  
\end{equation}

We see that even though the fitted curve passes through the data perfectly, we would not expect this to be a very good predictor. This is a problem of **overfitting**.
 
<img src="figures/reg_example.png" alt="reg-example" style="width: 700px;"/>
  
If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their corresponding cost. Let's suppose we want to reduce the influence of $\theta_4 x^4$ and $\theta_5 x^5$. Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function, and the optimisation problem becomes:
  
\begin{equation}
\min_{\theta} \dfrac {1}{2m} \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2 + 1000 \times \theta_4^2 + 1000 \times \theta_5^2
\end{equation}

The reason we've added two extra terms at the end is to inflate the cost of $\theta_4$ and $\theta_5$ in order to reduce the impact of the correponding features. Now, in order for the cost function to get close to zero, we will have to reduce the values of $\theta_4$ and $\theta_5$. As a result, we may see that the new hypothesis fits the data better.
 
We could also **regularize** all of our $\theta$ parameters in a single summation as:

\begin{equation}
\min_{\theta} \dfrac {1}{2m} \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2 + \lambda \sum_{j=1}^n \theta_j^2
\end{equation}

 
The $\lambda$, or lambda, is the **regularization parameter**. It determines how much the costs of our $\theta$ parameters are inflated.

Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If $\lambda$ is chosen to be too large, it may smooth out the function too much and cause underfitting. 

There are many other regularization techniques, this one is known as the L2 regularization. Other Regularization types include: 

* Early Stopping
* Parameter Norm Penalties 
    * L1 regularization
    * L2 regularization
    * Max-norm regularization
* Dataset Augmentation
* Noise Robustness (Dropout..)
* Sparse Representations
* ...

For more information, you can [check this article](https://medium.com/inveterate-learner/deep-learning-book-chapter-7-regularization-for-deep-learning-937ff261875c)!

## 2.6. Learning curves

Learning curves are often a very useful thing to plot. If either you wanted to sanity check that your algorithm is working correctly, or if you want to improve the performance of the algorithm.

Learning Curve Theory:

* Graph that compares the performance of a model on training and testing data over a varying number of training instances
* We should generally see performance improve as the number of training points increases
* When we separate training and testing sets and graph them individually
    * We can get an idea of how well the model can generalize to new data
* Learning curve allows us to verify when a model has learned as much as it can about the data, when it occurs
    * The performances on the training and testing sets reach a plateau
    * There is a consistent gap between the two error rates
* The key is to find the sweet spot that minimizes bias and variance by finding the right level of model complexity
* Of course with more data any model can improve, and different models may be optimal

Types of learning curves:

* **Bad Learning Curve: High Bias**
    - When training and testing errors converge and are high
    - No matter how much data we feed the model, the model cannot represent the underlying relationship and has high systematic errors
    - Poor fit
    - Poor generalization
* **Bad Learning Curve: High Variance**
    - When there is a large gap between the errors
    - Require data to improve
    - Can simplify the model with fewer or less complex features
* **Ideal Learning Curve**
    - Model that generalizes to new data
    - Testing and training learning curves converge at similar values
    - Smaller the gap, the better our model generalizes
    
    
The following example is a typical case of high variance:

<img src="figures/learning-curve-high-variance.jpg" alt="learning-curve-high-variance" style="width: 500px;"/>

The following diagram is a typical case of high bias.

<img src="figures/learning-curve-high-bias.jpg" alt="learning-curve-high-bias" style="width: 500px;"/>
