#### Machine Learning Diagnostics

A test you can run to gain insight into what is/isn't working with a learning algorith

Diagnostic can take time to implement, but doing so can be a very good use of time.


#### Evaluating a Hypothesis

A hypothesis may have a lower error number on the training set but still be innacurate (because of overfitting). To evaluate a hypothesis, it's recommended to split up the training data into at least two sets *training set* and *test set*. Other methods involve adding a third set *cross validation set*.

Break the dataset into around 70% training set and the remainder as test set.

Procedure using two sets is then:

- Learn $\theta$ and minimise $J_{train}(\Theta)$ using the training set
- Compute the test set error $J_{train}(\Theta)$

###### Test set error function

- For linear regression: $J_{test}(\Theta) = \dfrac{1}{2m_{test}} \sum_{i=1}^{m_{test}}(h_\Theta(x^{(i)}_{test}) - y^{(i)}_{test})^2$

- For classification: $err(h_\Theta(x),y) = \begin{matrix} 1 & \mbox{if } h_\Theta(x) \geq 0.5\ and\ y = 0\ or\ h_\Theta(x) < 0.5\ and\ y = 1\newline 0 & \mbox otherwise \end{matrix}$

This gives us a binary 0 or 1 error result based on a misclassification. The average test error for the test set is:

$$\text{Test Error} = \dfrac{1}{m_{test}} \sum^{m_{test}}_{i=1} err(h_\Theta(x^{(i)}_{test}), y^{(i)}_{test})$$

This gives us the proportion of the test data that was misclassified.


* Model selection using Train -> Validation -> Test sets

Given models with different polynomial degrees, we can use a systematic approach to identify the 'best' function. We can test each degree of polynomical and look at the error result.

Break the dataset into around 60% training set, 20 cross validation (_cv_) and the remaining 20% as test set (_test_).

Procedure using three set is:
- Optimize the parameters in $\Theta$ using the training set for each polynomial degree

- Find the polynomial degree d with the least error using the cross validation set

- Estimate the generalization error using the test set with $J_{test}(\Theta^{(d)}$ (d = theta from polynomial with lower error);

This way, d has not been trained using the test set.


###### Training/testing procedure for Linear Regression

Learn parameter $\theta$ from training data (minimizing training error $J(\theta)$

Compute test set error

$J_test(\theta) = 1/2m_test$


*Training error*:
$$J_{train} (\theta) = \frac{1}{2m} \sum\limits_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2$$


*Cross Validation error*:
$$J_{cv} (\theta) = \frac{1}{2m}_{cv} \sum\limits_{i=1}^{m_{cv}}(h_\theta(x_{cv}^{(i)}) - y_{cv}^{(i)})^2$$

*Test error*:
$$J_{test} (\theta) = \frac{1}{2m}_{test} \sum\limits_{i=1}^{m_{test}}(h_\theta(x_{test}^{(i)}) - y_{test}^{(i)})^2$$


#### Diagnosing Bias vs. Variance

To more accurately find the best degee of polynomial d, to avoid underfitting or overfitting of our hypothesis, we need to distinguish wether *bias* or *variance* is the problem contributing to bad predictions.

High bias is underfitting and high variance is overfitting. Ideally, we need to find a golden mean between these two.

The training error tends to decrease as we in crease the degree d of polynomial.

At the same time the cross validation tends to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.

![image.png](attachment:image.png)

High bias (Underfitting): both $J_{train}(\Theta)$ and $J_{cv}$ will be high. Also, $J_{CV}(\Theta) \approx J_{train}(\Theta)$

High variance (Overfitting): $J_{train}(\Theta)$ will be low and $J_{cv}$ will be greater.

#### Linear regression with regularisation

To prevent overfitting regularisation is necessary. 

As $\lambda$ increases our fit becomes more rigid. On the other hand, as $\lambda$ aproaches 0, we tend to overfit the data. To just the 'just right' parameter $\lambda$ we need to:

- Create a list of lambdas (i.e. $\lambda ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});$
- Create a set of models with different degrees or any other variants.
- Iterate through the λs and for each λ go through all the models to learn some Θ.
- Compute the cross validation error using the learned Θ (computed with λ) on the $J_{CV}(\Theta)$ without regularization or λ = 0.
- Select the best combo that produces the lowest error on the cross validation set.
- Using the best combo Θ and λ, apply it on $J_{test}(\Theta)$ to see if it has a good generalization of the problem.

#### Learning Curves

Experiencing high bias:

Low training set size: causes $J_{train}(\Theta)$ to be low and $J_{cv}(\Theta)$ to be high.

Large training set size: causes both $J_{train}(\Theta)$ and $J_{cv}(\Theta)$ to be high with $J_{train}(\Theta)\equiv J_{cv}(\Theta)$.


Experiencing high variance:

Low training set size: $J_{train}(\Theta)$ will be low and $J_{cv}(\Theta)$ will be high.

Large training set size: $J_{train}(\Theta)$ increases with training set size and $J_{cv}(\Theta)$ continues to decrease without leveling off. Also,  $J_{train} < J_{cv}(\Theta)$ but the difference between them remains significant.

If a learning algorithm is suffering from high variance, getting more training data is likely to help. (Adding more hidden layers isn't likely to help)

In [None]:
#### Conclusion

Our decision process can be broken down as follows:

Getting more training examples: Fixes high variance
Trying smaller sets of features: Fixes high variance
Adding features: Fixes high bias
Adding polynomial features: Fixes high bias
Decreasing λ: Fixes high bias
Increasing λ: Fixes high variance.