# Week 6

## 1. Evaluating a Learning Algorithm

### - The test set error
* For liner regression<br>
&nbsp; $J_{test}(\Theta)=\sum_{i=1}^{m_{test}}(h_\Theta(x_{test}^{(i)})-y_{test}^{(i)})^2$
* For classification<br>
&nbsp; $J_{test}(\Theta)=\frac{1}{m_{test}}\sum_{i=1}^{m_{test}}err(h_\Theta(x_{test}^{(i)}), y_{test}^{(i)})$<br>
&nbsp; where $err(h_\Theta(x), y) = 1$ if $h_\Theta(x) \geq 0.5$ and $y=0$ or $h_\Theta(x) \lt 0.5$ and $y=1$

### - Model Selection and Training/Validation/Test Sets
1. Optimize the parameters in $\Theta$ using the training set for each polynomial degree.(e.g. d=1~10)
2. Find the polynomial degree d with the least error using the cross validation set.
3. Estimate the generalization error using the test set with $J_{test}(\Theta^{(d)})$

## 2. Bias vs Variance

* High Bias (Underfitting)<br>
: Both $J_{train}(\Theta)$ and $J_{CV}(\Theta)$ will be high. Also, $J_{train}(\Theta)\approx J_{CV}(\Theta)$
* High Variance (Overfitting)<br>
:  $J_{train}(\Theta) \ll J_{CV}(\Theta)$

### - Regularization and Bias/Variance
* Linear regression with regularization<br>
$J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2+\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2$<br>
Small $\lambda \rightarrow $ High Variance (overfit)<br>
Large $\lambda \rightarrow $ High Bias (underfit)<br><br>
* To choose a right parameter $\lambda$,
1. Create a list of lambdas (i.e. $\lambda \in$ {0,0.01,0.02,0.04,0.08,..., 10.24})<br>
2. Create a set of models with different degrees or any other variants.
3. Iterate through the $\lambda$'s and for each $\lambda$ go through all the models to learn some $\Theta$
4. Compute the cross validation error using the learned $\Theta$ (computed with $\lambda$) on the $J_{CV}(\Theta)$ without regularization or $\lambda$  = 0.
5. Select the best combo that produces the lowest error on the cross validation set.
6. Using the best combo $\Theta$ and $\lambda$, apply it on $J_{test}(\Theta)$ to see if it has a good generalization of the problem.

### - Learning Curves
* When experiencing high bias: getting more training data will **NOT** (by itself) help much.<br>
&nbsp;$\rightarrow$ Low training set size causes $J_{train}(\Theta)$ to be low and $J_{CV}(\Theta)$ to be high.<br>
&nbsp;$\rightarrow$ Large training set size causes both $J_{train}(\Theta)$ and $J_{CV}(\Theta)$ to be high with $J_{train}(\Theta)\approx J_{CV}(\Theta)$<br>
* When experiencing high variance: getting more training data is likely to **help**.<br>
&nbsp;$\rightarrow$ Low training set size : $J_{train}(\Theta)$ will be low and $J_{CV}(\Theta)$ will be high.<br>
&nbsp;$\rightarrow$ Large training set size : $J_{train}(\Theta)$ increases with training set size and $J_{CV}(\Theta)$ continues to decrease without leveling off. Also, $J_{train}(\Theta)\lt J_{CV}(\Theta)$ but the difference between them remains significant.


### - Debugging
* Getting more training examples: Fixes high variance
* Trying smaller sets of features: Fixes high variance
* Adding features: Fixes high bias
* Adding polynomial features: Fixes high bias
* Decreasing $\lambda$ : Fixes high bias
* Increasing $\lambda$ : Fixes high variance.

### - Diagnosing Neural Networks
* A neural network with fewer parameters is prone to **underfitting**. It is also **computationally cheaper**.
* A large neural network with more parameters is prone to **overfitting**. It is also **computationally expensive**. In this case you can use regularization (increase $\lambda$) to address the overfitting.


## 3. Approach to Machine Learning Problems
* Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
* Plot learning curves to decide if more data, more features, etc. are likely to help.
* Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.

## 4. Handling Skewed Data
### - Precision vs Recall
* Precision : True positives / Predicted positives
* Recall : True positives / Actual positives
* cf) Accuracy : (True positives + True negatives) / Total examples

### - Trading off Precision and Recall
* Predict 1 if $h_\theta(x) \gt$ threshold<br>
$\rightarrow$ high threshold $\Rightarrow$ higher precision, lower recall<br>
$\rightarrow$ low threshold $\Rightarrow$ lower precision, higher recall

### - $F_1$ Score
= $2\frac{PR}{P+R}$ (P: Precision, R: Recall)