# Coursera ML - Week 6

https://www.coursera.org/learn/machine-learning/home/week/6

## Improving ML algorithm performance
### E.g. linear regression
- Get more training examples
- Try smaller set of features - prevent overfitting
- Try getting additional features
- Try adding polynomial features
- Try increasing $\lambda$
- Try decreasing $\lambda$

** These can be time-consuming **

### Diagnostic

Need a test to see what is/isn't working

#### Training Set vs Test Set
- Shuffle data randomly!
- Split data into two sets, e.g. 70/30

#### Training/testing
1. Learn parameters $\theta$, minimizing $J(\theta)$
2. Compute test set error $J_\textrm{test}(\theta)$
    - use appropriate cost function for linear regresson, classification, ...
3. Misclassification error (0/1 misclassification error):
$$ \textrm{err}(h,y) = \begin{cases}
1, \textrm{wrong classification (after thresholding/max)}\\
0, \textrm{correct classification}
\end{cases}$$

$$\textrm{Test error} = \frac{1}{m_\textrm{test}} \sum^{m_\textrm{test}} \textrm{err}(h,y)$$

## Model Selection

E.g. choose polynomial order of linear regression model

Order $d = 1..10$, test set errors $J_\textrm{test}(\theta^{(d)})$

**Choosing $d$ based on  $J_\textrm{test}$ is fitting $d$ to the test set!!!** (just like $\theta$ is fitted to training set)

### Training - Cross Validation - Test Sets
Split say:
- Training set - 60%
- Cross validation (CV) set - 20%
- Test set - 20%

Define error for each: $J_\textrm{train}$, $J_\textrm{CV}$, $J_\textrm{test}$

- Select $d$ based on **CV set**
- Estimate generalization error for **test set**

## Diagnosing Bias (underfit) vs Variance (overfit)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Bias (underfit): $J_\textrm{train} \approx J_\textrm{CV}$, **both high**

Variance (overfit): $J_\textrm{train}$ **low**, $J_\textrm{CV}$ **high**

## Regularization and Bias/Variance

Large $\lambda$ - high bias - underfit

Small $\lambda$ - high variance - overfit

- $J(\theta)$: training set, **with reguralization term**
- $J_\mathrm{train}(\theta)$: training set, **w/o** reguralization term
- $J_\mathrm{cv}(\theta)$: CV set, **w/o** reguralization term
- $J_\mathrm{test}(\theta)$: test set, **w/o** reguralization term

![image.png](attachment:image.png)

## Learning Curves

plot  $J_\textrm{train}$ or $J_\textrm{CV}$ vs training set size $m$

### High bias
![image.png](attachment:image.png)

### High variance
![image.png](attachment:image.png)

## Solutions to pursue:
### High bias (underfit)
- get additional features
- add polynomial features
- decrease $\lambda$

### High variance (overfit)
- get more training examples
- use smaller set of features
- increase $\lambda$


## Neural networks and overfitting

### Small
- prone to underfitting
- computationally cheap

### Large
- prone to overfitting
- computationally expensive

---

# Machine Learning System Design
## Text Classifier
e.g. spam
X = vector of 1s and 0s for occurance of say 10000-50000 most frequent words

"stemming"

## Approach
- start with simple algorithm
- plot learning curves to decide if more date, features etc will help 
- **Error analysis**: manually examine miss-classified examples
- use cross-validation error - **numerical evaluation**

## Skewed classes
say 99% of examples are in the same class 0

simple prediction:
```
y = 0  # for any x!!!
```

classification accuracy can be a poor indicator; need alternative

### Precision/Recall
||Actual: 1|Actual:0|
|---|---|---|
|Predicted: 1| True positive | False positive|
|Predicted: 0| False negative | True negative|

`Precision = True pos/(True pos + False pos)`

`Recall = True pos/(True pos + False neg)`

#### Trading off precision/recall
Logistic regression: $0\leq h \leq 1$

##### Avoid false positives:

Modify threshold:
- predict 1 if $h\geq0.9$
- predict 0 if $h < 0.9$

"predict cancer only of 90% certain"

**-> High precision/low recall**

##### Avoid false negatives:

Modify threshold:
- predict 1 if $h\geq0.3$
- predict 0 if $h < 0.3$

"predict cancer if 30% chance"

**-> High recall/low precision**

How to choose best pair (precision, recall) from different algorithms/parameters?

#### F score:
$$F_1 = 2\frac{PR}{P+R}$$

---

## Data

Could a human expert predict $x$ from $y$? --> data can help!