# Machine Learning: Week 6 - Advice for Applying Machine Learning
## Deciding What to Try Next
Errors in your predictions can be troubleshooted by:
- Getting more training examples
- Trying smaller sets of features
- Trying additional features
- Trying polynomial features
- Increasing or decreasing $\lambda$

Don’t just pick one of these avenues at random. We’ll explore diagnostic techniques for choosing one of the above solutions in the following sections.

## Evaluating a Hypothesis
A hypothesis may have a low error for the training examples but still be inaccurate (because of overfitting). Thus, to evaluate a hypothesis, given a dataset of training examples, we can split up the data into two sets: a **training set** and a **test set**. Typically, the training set consists of $70\%$ of your data and the test set is the remaining $30\%$.

The new procedure using these two sets is then:
- Learn $\Theta$ and minimize $J_{train}(\Theta)$ using the training set
- Compute the test set error $J_{test}(\Theta)$

### The test set error
1. For linear regression: 

$$\large
J_{test}( \Theta ) =\frac{1}{2m_{test}}\sum _{i=1}^{m_{test}}\left( h_{\Theta }\left( x_{test}^{( i)}\right) \ -\ y_{test}^{( i)}\right)^{2}
$$

2. For classification ~ Missclassification error (aka $0/1$ misclassification error):

$$\large
err( h_{\Theta }( x) ,y) =\begin{cases}
1\ \ \ if\ \ h_{\Theta }( x) \geqslant 0.5\ \ and\ \ y=0\ \ or\ \ h_{\Theta }( x) < 0.5\ \ and\ \ y=1\\
0\ \ \ otherwise
\end{cases}
$$

This gives us a binary $0$ or $1$ error result based on a misclassification. The average test error for the test set is:

$$\large
Test\ Error=\frac{1}{m_{test}}\sum _{i=1}^{m_{test}} err\left( h_{\Theta }\left( x_{test}^{( i)}\right) ,y_{test}^{( i)}\right)
$$

This gives us the proportion of the test data that was misclassified.

## Model Selection and Train/Validation/Test Sets
Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis. It could over fit and as a result your predictions on the test set would be poor. The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than the error on any other data set. 

Given many models with different polynomial degrees, we can use a systematic approach to identify the 'best' function. In order to choose the model of your hypothesis, you can test each degree of polynomial and look at the error result.

One way to break down our dataset into the three sets is:
- Training set: $60\%$
- Cross validation set: $20\%$
- Test set: $20\%$

We can now calculate three separate error values for the three different sets using the following method:
1. Optimize the parameters in $\Theta$ using the training set for each polynomial degree.
2. Find the polynomial degree d with the least error using the cross validation set.
3. Estimate the generalization error using the test set with $J_{test}(\Theta^{(d)})$ ($d$ = theta from polynomial with lower error);

This way, the degree of the polynomial $d$ has not been trained using the test set.

## Diagnosing Bias vs. Variance
In this section we examine the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis.
- We need to distinguish whether **bias** or **variance** is the problem contributing to bad predictions.
- High bias is underfitting and high variance is overfitting. Ideally, we need to find a golden mean between these two.

The training error will tend to **decrease** as we increase the degree d of the polynomial.

At the same time, the cross validation error will tend to **decrease** as we increase d up to a point, and then it will **increase** as d is increased, forming a convex curve.

**High bias (underfitting):** both $J_{train}(\Theta)$ and $J_{CV}(\Theta)$ will be high. Also, $J_{CV}(\Theta) \approx J_{train}(\Theta)$

**High variance (overfitting):** $J_{train}(\Theta)$ will be low and $J_{CV}(\Theta)$ will be much greater than $J_{train}(\Theta)$

The tradeoff is summarized in the figure below:

![BiasVariance](./Week6_Images/BiasVariance.png)


## Regularization and Bias/Variance

![regularization](./Week6_Images/regularization.png)

Instead of looking at the degree d contributing to bias/variance, now we will look at the regularization parameter $\lambda$.
- Large $\lambda$: High bias (underfitting)
- Intermediate $\lambda$: just right
- Small $\lambda$: High variance (overfitting)

A large lambda heavily penalizes all the $\Theta$ parameters, which greatly simplifies the line of our resulting function, so causes underfitting.

The relationship of $\lambda$ to the training set and the variance set is as follows:
- Low $\lambda$: $J_{train}(\Theta)$ is low and $J_{CV}(\Theta)$ is high (high variance/overfitting).
- Intermediate $\lambda$: $J_{train}(\Theta)$ and $J_{CV}(\Theta)$ are somewhat low and $J_{train}( \Theta ) \approx J_{CV}( \Theta )$.
- Large $\lambda$: both $J_{train}(\Theta)$ and $J_{CV}(\Theta)$ will be high (underfitting /high bias)

The figure below illustrates the relationship between lambda and the hypothesis:

![bias_variance_lambda](./Week6_Images/bias_variance_lambda.png)

In order to choose the model and the regularization term λ, we need to:
- Create a list of lambdas (i.e. $\lambda \in \{0,\ 0.01,\ 0.02,\ 0.04,\ 0.08,\ 0.16,\ 0.32,\ 0.64,\ 1.28,\ 2.56,\ 5.12,\ 10.24\}$);
- Create a set of models with different degrees or any other variants.
- Iterate through the $\lambda$s and for each $\lambda$ go through all the models to learn some $\Theta$.
- Compute the cross validation error using the learned $\Theta$ (computed with $\lambda$) on the $J_{CV}(\Theta)$ without regularization or $\lambda = 0$.
- Select the best combo that produces the lowest error on the cross validation set.
- Using the best combo $\Theta$ and $\lambda$, apply it on $J_{test}(\Theta)$ to see if it has a good generalization of the problem.