# Week 6: Advice for Applying Machine Learning

What approaches / suggestions are good for what promising avenues are the most effective when deciding what ML tool to use.

## Debug a learning algo:

Suppose implemented lin reg to predict house prices:

$$J(\theta) = \frac{1}{2m}\left[ \sum_i^m (h_\theta(x^{(i)} - y^{(i)})^2 + \lambda \sum_j^m \theta_j^2 \right]$$

You find testing on new set of houses, there are large errors on predicitons.  

**What to do next?**
- get more training samples? 
    - :( sometimes this does not help!
- try smaller set of features
    - to prevent overfitting
- try getting additional features
- try adding polynomial featurs ($x_1 x_2$)
- try increasing/decrease $\lambda$

**fortunately there is simple teqn to rule out half the list above** which saves a lot of time pursuiting something that will not work.

## Machine Learning Diagnostic

what is or is not working with an algo?

What are promising things to try to improve performance?

Can take time to implement.

## Evaluating A Hypothesis

Get low training error? 
- problem is overfitting

How can you tell overfitting?
- can plot for low dimensional problem / few features.
- what about for many many features?
    - split the data: training set (70%) + test set (30%)

### Training/Testing Process for Linear Reg

This is the standard teqn for evaluating how good learned hypothesis is.

- learn $\theta$ from training data (minimize $J(\theta)$).
- compute test set error: $J_{test}(\theta) = \frac{1}{2m_{test}} \sum_i^{m_{test}}  h_\theta(x^{(i)}_{test}) - y^{(i)}_{test}$
- $ J_{\text {test }}(\theta)=-\frac{1}{m_{\text {test }}} \sum_{i=1}^{m_{\text {test }}} y_{\text {test }}^{(i)} \log h_{\theta}\left(x_{\text {test }}^{(i)}\right)+\left(1-y_{\text {test }}^{(i)}\right) \log h_{\theta}\left(x_{\text {test }}^{(i)}\right) $

- Misclassification error:
\begin{align}
    \operatorname{err}\left(h_{\theta}(x), y\right)=\left\{\begin{array}{ll}
1 & \text { if } h_{0}(x) \geqslant 0.5, \quad y=0 \\
& \text { or if } h_{0}(x)<0.5, & y=1 \\
0 & \text{otherwise}
\end{array}\right\} \text { error }\end{align}

\begin{align}
\text { Test error }=\frac{1}{m_{test}} \sum_{i=1}^{m_{\text {test }}} \operatorname{err}\left(h_{\theta}\left(x_{\text {test }}^{(i)}\right), y^{(i)}\right) \text { . }
\end{align}

## Model Selection and Train/Validation/Test Sets

Once params are fit to the training set (or similar), then error is measured on that data, which is likely to be lower than the general error.

### Model Selection

Each model type will yield a different $\theta$ fit. 
Each one has corresponding $J_{test}$. We can take the model w lowest $J_{test}$.

<img src="figures/fig1.png" width=500>

**How well does this model generalize?**

We can report $J_{test}(\theta^{(d)})$.  
Problem is likely to be an optimistic estimate of generalized error.

Essentially, are fitting parameter $d$ to the test set, bc we are **choosing** the degree of the polynomial.  

We fit the param $d$ to the test set, the performance of the hyp on the test set is likely to be misleading about the general performance.

### How to Evaluate The Hypothesis:
This is what do instead:  

training set = 60%, cross-validation 20%, test set = 20%

$\Rightarrow$ $(x^{(i)}, y^{(i)} )$,   $(x_{cv}^{(i)}, y_{cv}^{(i)} )$,   $(x_{test}^{(i)}, y_{test}^{(i)} )$

Produces three errors:  

Training error:
$$J_{train}(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2$$

Cross Validation error:
$$J_{train}(\theta) = \frac{1}{2m_{cv}} \sum_{i=1}^m (h_{\theta}(x_{cv}^{(i)}) - y_{cv}^{(i)})^2$$

Test error:
$$J_{train}(\theta) = \frac{1}{2m_{test}} \sum_{i=1}^m (h_{\theta}(x_{test}^{(i)}) - y_{test}^{(i)})^2$$

## Model Selection part 2

Instead of using test set to select model, instead use cross validation to select model. Do same steps / algo as before but only on cv data. Choose lowest $J_{cv}(\theta^{(d)})$.

<img src="figures/fig2.png" width=500>

Now estimate generalization error for test set $J_{test}(\theta^{(d)})$. This gives independence between the choice $d$ and the test/train data.  It is considered to have seperate train, validation, and test set.

## Diagnosing Bias vs. Variance

<img src="figures/fig3.png" width=500>

Suppose using polynomial error $d$. 

Plot error vs. $d$ and layer it with the cross validation error:

<img src="figures/fig4.png" width=600>

Suppose we have learning algo, and it's performing not as well. Is it a bias problem or a varias problem?

Look at the regions:    
region 1 has hi bias. CV error is hi b/c we have an underfit, and the train error is also high for similar reasons. The model is truly not capturing enough of the relationships to make it work.

region 2 has hi variance. CV error is hi b/c we have overfit to the data, so our model is not resilient to data changes. Train error is low because we have overfit to this data set. The model is overtrained to the training dat set.

Bias (underfit):  
- $J_{train}(\theta)$ high
- $J_{cv}(\theta) \approx J_{train}(\theta)$ 

Variance (overfit):
- $J_{train}(\theta)$ low
- $J_{cv}(\theta) >> J_{train}(\theta)$

<img src="figures/fig5.png" width=300>

## Regularization and Bias/Variance

Large $\lambda$ -> high bias (underfit). We punish having too many terms.  

Intermediate $\lambda$ -> good fit.  

Small $\lambda$ -> high variance (overfit). We do not eliminate any terms, leading potentially to an overfit.



<img src="figures/fig6.png" width=600 >

### Choosing $\lambda$

For this setting w regularization, we have these error functions as before.  

We outline how auto-choose $\lambda$ using these.

\begin{array}{l}
h_{\theta}(x)=\theta_{0}+\theta_{1} x+\theta_{2} x^{2}+\theta_{3} x^{3}+\theta_{4} x^{4} \\[2ex]
%
%
J(\theta)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}+\frac{\lambda}{2 m} \sum_{j=1}^{m} \theta_{j}^{2} \\[2ex]
%
%
J_{t r a i n}(\theta)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2} \\
J_{c v}(\theta)=\frac{1}{2 m_{c v}} \sum_{i=1}^{m_{c v}}\left(h_{\theta}\left(x_{c v}^{(i)}\right)-y_{c v}^{(i)}\right)^{2} \\
J_{\text {test }}(\theta)=\frac{1}{2 m_{\text {test }}} \sum_{i=1}^{m_{\text {test }}}\left(h_{\theta}\left(x_{\text {test }}^{(i)}\right)-y_{\text {test }}^{(i)}\right)^{2}
\end{array}

Given model:  
$$h_\theta = \sum \theta_i x^i$$

$$J(\theta) = \frac{1}{2m} \sum_i^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_j^m \theta_j^2$$
    
Try $\lambda=0$ and minimize $J(\theta)$ then get $\theta_1$.  

Repeat with $\lambda = 0.01$ and minimize $J(\theta)$ then get $\theta_2$.   

Repeat ...

**Then** Evaluate $J_{cv}(\theta^{(\lambda)})$ on the **cv** set and choose the minimum of these.

**Test Error:** $J_{test}(\theta^{(\lambda)})$.

This is similar to the previous model-selection scheme where this minimum choice awas based on $d$.  



We can plot how each $J(\theta)$, it will look opposite that of how it is plot wrt $d$.

As $\lambda \to \infty$, we get increasing bias problem (underfit), because we are penalizing the number of basis functions.   

As $\lambda \to 0$, we get increasing variance problem (overtrain/fit), because we are not removing enough basis functions.

The ideal here is some middle point that's **just right**.

<img src="figures/fig7.png" width=500>

## Learning Curves

Artificially reduce training set size (10, 20, 40 train examples), then plot training and cv errors on them.  

\begin{align}
    \begin{array}{l}
    J_{t r a i n}(\theta)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2} \\[2.0ex]
    J_{c v}(\theta)=\frac{1}{2 m_{c v}} \sum_{i=1}^{m_{c v}}\left(h_{\theta}\left(x_{c v}^{(i)}\right)-y_{c v}^{(i)}\right)^{2}
    \end{array}
\end{align}

Watch what happens as we increase the number of training examples...

<img src="figures/fig8.png" width=300>

Then we plot the training error on only the points used.

<img src="figures/fig9.png" width=300>

### What if High Bias?

Hi bias means we are underfitting. Let's see what the CV error curves look like.  

Even if we add more and more data, we will **plateau out** on the improvements made (diminish returns).  

The cv and training errors will be close to each other if in the mode of high bias.  

We can prevent wasting time of collecting more data by observing this is starting to happen.    

Also has **high error**.

<img src="figures/fig10.png" width=400>

<img src="figures/fig11.png" width=300>

### What if High Variance?

What if we are overfitting/hi variance?  

The training set error will be **low** because we ar overfitting to the data here.  The CV error will be farther away from the training error.

**More data is likely to help** in a high variance problem.

<img src="figures/fig12.png" width=300>

<img src="figures/fig13.png" width=300>

## Deciding What To Do Next Part 2

If things dont' work as well as you'd like, what to do next:

- get more training samples **high variance**
- try smaller sets of features **high variance**
- try getting additional features **high bias**
- try adding polynomial features ($x_1x_2\dots$) **high bias** 
- try decreasing $\lambda$ **high bias**
- try increasing $\lambda$ **high variance**

## Overfitting Neural Networks

 Usually the larger, the better but the main disadvantage is computational \$\$\$.  
 
For larger networks, use $\lambda$.

<img src="figures/fig14.png" width=700>