## Evaluating a Hypothesis

Once we have done some trouble shooting for errors in our predictions by:

* Getting more training examples
* Trying smaller sets of features
* Trying additional features
* Trying polynomial features
* Increasing or decreasing $\lambda$

We can move on to evaluate our new hypothesis.

A hypothesis may have a low error for the training examples but still be inaccurate (because of overfitting). Thus, to evaluate a hypothesis, given a dataset of training examples, we can split up the data into two sets: a training set and a test set. Typically, the training set consists of 70 % of your data and the test set is the remaining 30 %.

The new procedure using these two sets is then:

1. Learn $\Theta$ and minimize $J_{train}(\Theta)$ using the training set
2. Compute the test set error $J_{test}(\Theta)$

The test set error

1. For linear regression: $J_{test}(\Theta) = \dfrac{1}{2m_{test}} \sum_{i=1}^{m_{test}}(h_\Theta(x^{(i)}_{test}) - y^{(i)}_{test})^2$
2. For classification ~ Misclassification error (aka 0/1 misclassification error):

>$err(h_\Theta(x),y) = \Big[^{ 1 \quad \text{if } \; h_\Theta(x) \geq 0.5 \; \text{and } \; y = 0 \; \text{or} \; h_\Theta(x) < 0.5 \; \text{and }\; y = 1}_{0 \quad \text{otherwise} }$

This gives us a binary 0 or 1 error result based on a misclassification. The average test error for the test set is:

>$\text{Test Error} = \dfrac{1}{m_{test}} \sum^{m_{test}}_{i=1} err(h_\Theta(x^{(i)}_{test}), y^{(i)}_{test})$

This gives us the proportion of the test data that was misclassified.

## Model Selection and Train/Validation/Test Sets

Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis. It could over fit and as a result your predictions on the test set would be poor. The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than the error on any other data set.

Given many models with different polynomial degrees, we can use a systematic approach to identify the 'best' function. In order to choose the model of your hypothesis, you can test each degree of polynomial and look at the error result.

One way to break down our dataset into the three sets is:

* Training set: 60%
* Cross validation set: 20%
* Test set: 20%

We can now calculate three separate error values for the three different sets using the following method:

1. Optimize the parameters in Θ using the training set for each polynomial degree.
2. Find the polynomial degree d with the least error using the cross validation set.
3. Estimate the generalization error using the test set with $J_{test}(\Theta^{(d)})$, (d = theta from polynomial with lower error);

This way, the degree of the polynomial d has not been trained using the test set.

## Bias vs. Variance

### Diagnosing Bias vs. Variance

In this section we examine the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis.

* We need to distinguish whether bias or variance is the problem contributing to bad predictions.
* High bias is underfitting and high variance is overfitting. Ideally, we need to find a golden mean between these two.

The training error will tend to decrease as we increase the degree d of the polynomial.

At the same time, the cross validation error will tend to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.

High bias (underfitting): both $J_{train}(\Theta)$ and $J_{CV}(\Theta)$) will be high. Also, $J_{CV}(\Theta) \approx J_{train}(\Theta)$.

High variance (overfitting): $J_{train}(\Theta)$ will be low and $J_{CV}(\Theta)$ will be much greater than $J_{train}(\Theta)$.

The is summarized in the figure below:

<img src="img/01-LME.png" align="left">

### Regularization and Bias/Variance

Note: \[The regularization term below and through out the video should be $\frac \lambda {2m} \sum _{j=1}^n \theta_j ^2$ and NOT $\frac \lambda {2m} \sum _{j=1}^m \theta_j ^2$\]

<img src="img/02-LME.png">
          
In the figure above, we see that as $\lambda$  increases, our fit becomes more rigid. On the other hand, as $\lambda$ approaches 0, we tend to over overfit the data. So how do we choose our parameter $\lambda$  to get it 'just right' ? In order to choose the model and the regularization term $\lambda$ , we need to:

1. Create a list of lambdas (i.e. $\lambda \in \{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24\}$);
2. Create a set of models with different degrees or any other variants.
3. Iterate through the $\lambda$s and for each $\lambda$  go through all the models to learn some $\Theta$.
4. Compute the cross validation error using the learned $\Theta$ (computed with $\lambda$ ) on the $J_{CV}(\Theta)$ without regularization or $\lambda = 0$.
5. Select the best combo that produces the lowest error on the cross validation set.
Using the best combo Θ and λ, apply it on $J_{test}(\Theta)$ to see if it has a good generalization of the problem.

## Learning Curves

Training an algorithm on a very few number of data points (such as 1, 2 or 3) will easily have 0 errors because we can always find a quadratic curve that touches exactly those number of points. Hence:

* As the training set gets larger, the error for a quadratic function increases.
* The error value will plateau out after a certain m, or training set size.

**Experiencing high bias**:

**Low training set size**: causes $J_{train}(\Theta)$ to be low and $J_{CV}(\Theta)$ to be high.

**Large training set size**: causes both $J_{train}(\Theta)$ and $J_{CV}(\Theta)$ to be high with $J_{train} \Theta))\approx J_{CV}(\Theta)$.

If a learning algorithm is suffering from **high bias**, getting more training data will not (**by itself**) help much.

<img src="img/03-LME.png">

**Experiencing high variance**:

**Low training set size**: $J_{train}(\Theta)$ will be low and $J_{CV}(\Theta)$ will be high.

**Large training set size**: $J_{train}(\Theta)$ increases with training set size and $J_{CV}(\Theta)$ continues to decrease without leveling off. Also, $J_{train}(\Theta) < J_{CV}(\Theta)$) but the difference between them remains significant.

If a learning algorithm is suffering from **high variance**, getting more training data is likely to help.

<img src="img/04-LME.png">

## Deciding What to Do Next Revisited

Our decision process can be broken down as follows:

* **Getting more training examples**: Fixes high variance

* **Trying smaller sets of features**: Fixes high variance

* **Adding features**: Fixes high bias

* **Adding polynomial features**: Fixes high bias

* **Decreasing $\lambda$**: Fixes high bias

* **Increasing $\lambda$**: Fixes high variance.

**Diagnosing Neural Networks**

* A neural network with **fewer parameters** is **prone to underfitting**. It is also **computationally cheaper**.
* A large neural network with **more parameters** is **prone to overfitting**. It is also **computationally expensive**. In this case you can use regularization (increase $\lambda$) to address the overfitting.

Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set. You can then select the one that performs best.

**Model Complexity Effects**:

* Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.
* Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.
* In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.

# Machine learning system design 

## Prioritizing what to work on: Spam  classification example :

Given a data set of emails, we could construct a vector for each email. Each entry in this vector represents a word. The vector normally contains 10,000 to 50,000 entries gathered by finding the most frequently used words in our data set. If a word is to be found in the email, we would assign its respective entry a 1, else if it is not found, that entry would be a 0. Once we have all our x vectors ready, we train our algorithm and finally, we could use it to classify if an email is a spam or not.


**Building a Spam Classifier**

Supervised learning. $x$ features of email. $y$ spam (1) or not spam (0). 
Features    : Choose 100 words indicative of spam/not spam. 

<img src="img/01-spam-classifier.png">

Note: In practice, take most frequently occurring words ( 10,000 to 50,000) in training set, rather than manually pick 100 words.

So how could you spend your time to improve the accuracy of this classifier?

* Collect lots of data (for example "honeypot" project but doesn't always work)
* Develop sophisticated features (for example: using email header data in spam emails)
* Develop algorithms to process your input in different ways (recognizing misspellings in spam).

It is difficult to tell which of the options will be most helpful.

## Error Analysis

The recommended approach to solving machine learning problems is to:

* Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
* Plot learning curves to decide if more data, more features, etc. are likely to help.
* Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.

For example, assume that we have 500 emails and our algorithm misclassifies a 100 of them. We could manually analyze the 100 emails and categorize them based on what type of emails they are. We could then try to come up with new cues and features that would help us classify these 100 emails correctly. Hence, if most of our misclassified emails are those which try to steal passwords, then we could find some features that are particular to those emails and add them to our model. We could also see how classifying each word according to its root changes our error rate:

<img src="img/02-spam-classifier.png">

It is very important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm's performance. For example if we use stemming, which is the process of treating the same word with different forms (fail/failing/failed) as one word (fail), and get a 3% error rate instead of 5%, then we should definitely add it to our model. However, if we try to distinguish between upper case and lower case letters and end up getting a 3.2% error rate instead of 3%, then we should avoid using this new feature. Hence, we should try new things, get a numerical value for our error rate, and based on our result decide whether we want to keep the new feature or not.

## Error Metrics for Skewed Classes
It is sometimes difficult to tell whether a reduction in error is actually an improvement of the algorithm.

**For example**: In predicting a cancer diagnoses where 0.5% of the examples have cancer, we find our
learning algorithm has a 1% error. However, if we were to simply classify every single example as a 0,
then our error would reduce to 0.5% even though we did not improve the algorithm.

<img src="img/skewed-classes.png">

This usually happens with `skewed classes`; that is, when our class is very rare in the entire data set.
Or to say it another way, when we have lot more examples from one class than from the other class.

For this we can use **Precision/Recall**.
Predicted: 1, Actual: 1 --- True positive
Predicted: 0, Actual: 0 --- True negative
Predicted: 0, Actual, 1 --- False negative
Predicted: 1, Actual: 0 --- False positive

**Precision**: of all patients we predicted where  = 1, what fraction actually has cancer?

$ \displaystyle \frac {\text{True Positives}}{\text{Total number of predicted positives}} = \frac {\text{True Positives}} {\text{True Positives}+\text{False positives}} $

**Recall**: Of all the patients that actually have cancer, what fraction did we correctly detect as having cancer?

$ \displaystyle \frac {\text{True Positives}}{\text{Total number of actual positives}} = \frac {\text{True Positives}} {\text{True Positives}+\text{False negatives}} $

These two metrics give us a better sense of how our classifier is doing. We want both precision and **recall** to be high.
In the example at the beginning of the section, if we classify all patients as 0, then our recall will be
$ \frac {0} {0+f}=0$, so despite having a lower error percentage, we can quickly see it has worse recall.
* **Note 1**: if an algorithm predicts only negatives like it does in one of exercises, the precision is not defined, it is impossible to divide by 0. `F1 score` will not be defined too.
* **Note 2**: a manual calculation of precision and other functions is a error prone process. it is very easy though to create an Excel file for this. Put into it a table 2*2 for all necessary input values, label them like "TruePositives", "FalsePositives", and on other cells of Excel add formulas like =SUM(TruePositive,FalsePositive, TrueNegative, FalseNegative), label this one AllExamples. Then on another cell label Accuracy and a formula: =SUM(TruePositive,TrueNegative)/AllExamples. The same with others.

So for the problem of skewed classes `precision/recall` gives us more direct insight into how the learning algorithm is doing and this is often a much better way to evaluate our learning algorithms,than looking at classification error or classification accuracy, when the classes are very skewed.
 

## Trading Off Precision and Recall
We might want a confident prediction of two classes using logistic regression. One way is to increase our
threshold:
>$\begin{align*} & \text{Predict 1 if:}\quad ℎ_\theta(x) \geq 0.7 \\ & \text{Predict 0 if:} \quad ℎ_\theta(x) < 0.7 \end{align*}$

This way, we only predict cancer if the patient has a 70% chance.
Doing this, we will have **higher precision** but **lower recall** (refer to the definitions in the previous section).
In the opposite example, we can lower our threshold:
>$\begin{align*} & \text{Predict 1 if:}\quad ℎ_\theta(x) \geq 0.3 \\ & \text{Predict 0 if:} \quad ℎ_\theta(x) < 0.3 \end{align*}$

That way, we get a very **safe** prediction. This will cause **higher recall** but **lower precision**.
    The greater the threshold, the greater the precision and the lower the recall.
    The lower the threshold, the greater the recall and the lower the precision.
In order to turn these two metrics into one single number, we can take the **F value**.
One way is to take the average:

$\displaystyle \frac{(P + R)}{2}$

This does not work well. If we predict all $y = 0$ then that will bring the average up despite having 0 recall. If
we predict all examples as $y = 1$, then the very high recall will bring up the average despite having 0 precision.
A better way is to compute the **F Score** (or *F1 score*):

$\displaystyle \text{F Score}=2\frac{PR}{P + R}$

In order for the F Score to be large, both precision and recall must be large.
We want to train precision and recall on the **cross validation set** so as not to bias our test set.


So,we talked about the notion of trading off between precision and recall, and how we can vary the threshold that we use to decide whether to predict y=1 or y=0. So it's the threshold that says, do we need to be at least 70% confident or 90% confident, or whatever before we predict y=1. And by varying the threshold, we can control a trade off between precision and recall. We also talked about the F Score, which takes precision and recall, and again, gives you a single real number evaluation metric. And of course, if your goal is to automatically set that threshold to decide what's really y=1 and y=0, one pretty reasonable way to do that would also be to try a range of different values of thresholds. So you try a range of values of thresholds and evaluate these different thresholds on, say, your cross-validation set and then to pick whatever value of threshold gives you the highest F Score on your cross validation. And that be a pretty reasonable way to automatically choose the threshold for your classifier as well. 

## Data for Machine Learning
How much data should we train on?
In certain cases, an "inferior algorithm," if given enough data, can outperform a superior algorithm with less
data.


We must choose our features to have **enough** information. A useful test is: Given input $x$, would a human expert be able to confidently predict $y$?

**Rationale for large data**: if we have a **low bias** algorithm (many features or hidden units making a very
complex function), then the larger the training set we use, the less we will have overfitting (and the more
accurate the algorithm will be on the test set).