# Week 6 - Advice for Applying Machine Learning & Machine Learning System Design

In Week 6, we will be learning about systematically improving our learning algorithm. We will cover how to tell when a learning algorithm is doing poorly, and describe the best practices for how to debug your learning algorithm and go about improving its performance.

We will also be covering machine learning system design. To optimize a machine learning algorithm, you’ll need to first understand where the biggest improvements can be made. In these lessons, we discuss how to understand the performance of a machine learning system with multiple parts, and also how to deal with skewed data.

When you're applying machine learning to real problems, a solid grasp of this week's content will easily save you a large amount of work.

The topics we'll cover this week are
* Advice for Applying Machine Learning
  * Evaluating a Learning Algorithm
    * Deciding What to Try Next
    * Evaluating a Hypothesis
    * Model Selection and Train/Validation/Test Sets
  * Bias vs. Variance
    * Diagnosing Bias vs. Variance
    * Regularization and Bias/Variance
    * Learning Curves
    * Deciding What to Try Next, Revisited
* Machine Learning System Design
  * Building a Spam Classifier
    * Prioritizing What to Work On
    * Error Analysis
  * Handling Skewed Data
    * Error Metrics for Skewed Classes
    * Trading Off Precision and Recall
  * Using Large Data Sets
  
## Advice for Machine Learning

### Evaluating a Learning Algorithm

#### Deciding What to Try Next

By now, we know a good deal about some machine learning techniques, but one thing that will help greatly is knowing how to appropriately apply these techniques and how to perform debugging.

Let's use the example of predicting housing prices. Let's say we've implemented regularized linear regression and minimized our cost function on your training set. Now say you run this regression on a test case and find the regression has unacceptably large errors in its predictions. What do you try next?
1. You can try adding more training samples. It could be that your training set wasn't complete enough. But sometimes this doesn't help and can waste a lot of time.
2. You could also try using a smaller set of features to avoid overfitting
3. Maybe you need to *add* features if your features don't cover the important bases.
4. We could also add polynomial features
5. We could decrease the regularization coefficient
6. We could increase the regularization coefficient

Unfortunately, people lean on "gut feelings" to find the best choice here. And often this can lead to a huge waste of time. There is a simple technique to narrow our options to choose. We'll learn a "machine learning diagnostic" to figure out what is/isn't working with a learning algorithm and gain guidance on how to improve performance. Implementing the diagnostics can take some time but can be very useful.

#### Evaluating a Hypothesis

One goal of choosing our hypothesis (linear, polynomial, logistic, etc.) is to minimize the cost function. But we've seen already that overfitting is problematic because it can fail to generalize the new examples that aren't in the training set. How do you tell if a hypothesis is overfitting? The standard way to evaluate this is as follows:
1. Randomly split the training data (usually about a 70% to 30% split) to be a training set and a testing set.
2. Learn the parameters in the hypothesis on the training set, and then compute the test set error:

   $$ J_{\text{test}} = \frac{1}{2 m_\text{test}} \sum_{i=1}^{m_\text{test}} \Big( h_\theta x_\text{test}^{(i)} - y_\text{test}^{(i)} \Big)^2 .$$

   Or for logistic regression:
   
   $$ J_{\text{test}} = - \frac{1}{m_\text{test}} \sum_{i=1}^{m_\text{test}} \Big( y_\text{test}^{(i)} \log h_\theta (x_\text{test}^{(i)}) + (1-y_\text{test}^{(i)}) \log h_\theta (x_\text{test}^{(i)}) \Big) .$$

   There's an alternative method for logistic regression, called the misclassification error (0/1 misclassification error):
   
   $$\text{err}(h_\theta(x),y) = 
      \begin{cases}
       1 & \text{ if } h_\theta(x) \geq 0.5 \text{ and } y = 0 \\
         & \text{ else if } h_\theta(x) < 0.5 \text{ and } y = 1, \\
       0 & \text{ otherwise }
      \end{cases}$$
      
   And the test error is then
   
      $$ \frac{1}{m_\text{test}} \sum_{i=1}^{m_\text{test}} \text{err} \big( h_\theta x_\text{test}^{(i)}, y_\text{test}^{(i)} \big) .$$
      
#### Model Selection and Training/Validation/Test Sets

Suppose you want to decide to what degree of a polynomial to fit the data set. Or perhaps you want to decide which regularization parameter to use for a learning algorithm. This is the model selection processes.

As we've seen with overfitting, the training set error alone is not a good predictor for how well the hypothesis will do on a new example. So the training dataset error isn't a good demonstration of how well the model performs.

Let's say we are trying to decide to what degree of a polynomial to use as a hypothesis. So in addition to the parameters of the model, there's essentially a new parameter to solve for; the degree of the polynomial. Let's look at an example,

1. $d = 1 : h_\theta (x) (x) = \theta_0 + \theta_1 x$
2. $d = 2 : h_\theta (x) (x) = \theta_0 + \theta_1 x + \theta_2 x^2$
3. $d = 3 : h_\theta (x) (x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 \\ \vdots $
10. $d = 10 : h_\theta (x) (x) = \theta_0 + \theta_1 x + \cdots + \theta_{10} x^{10}$

For each model, there will be a new set of parameters, denoted as

$$ \theta^{(d)} $$

One thing we can do is to examine the test set error for the different sets of parameters, as such

$$ J_\text{test} (\theta^{(d)}) $$

Let's say we did this and choose the five degree polynomial. Now let's see how the model generalized. Simply looking at the test error for this particular model will likely be only an *optimistic* estimate of generalization error. We're basically now setting our *new* parameter to fit the test set, which is essentially just another layer of overfitting.

Instead of splitting the dataset into a training set and a test set, it's better to add yet another layer:
* training set (about 60%)
* cross-validation set (about 20%)
* test set (about 20%)

So for our linear regression model, our cost function is

$$ J_{\text{cv}} = \frac{1}{2 m_\text{cv}} \sum_{i=1}^{m_\text{cv}} \Big( h_\theta x_\text{cv}^{(i)} - y_\text{cv}^{(i)} \Big)^2 .$$
   
Back to our polynomial example, what we would do is minimize the cost function on our training set to get our parameters

$$ \theta^{(d)} $$

and then evaluate the cost function using those parameters on the cross-validation data set,

$$ J_{\text{cv}} (\theta^{(d)}) .$$

So now we'd pick the lowest cost function as our appropriate model and then report the generalization error. For example, if the four degree polynomial was the lowest cross-validation cost, we'd report

$$ J_{\text{test}} (\theta^{(4)}) $$

as the generalization error.

### Bias vs. Variance

#### Diagnosing Bias vs. Variance

If you've ran the learning algorithm and it doesn't work as as you'd hope, it's almost always a result of high bias or high variance, and it's important to figure out which one it is. High bias is "underfitting" and high variance is "overfitting". 

As we increase the degree of the polynomial, the training sample error decreases, since at a certain degree, the hypothesis exactly matches the training sample available. However, when we consider the error of the cross validation set, the error initially drops down until the model reaches some minimal error for a median degree, and increases again as the degree increases to overfitting of the training set. This is shown in the figure below:

![Error vs. Degree](./images/error-vs-deg.png "Error vs. Degree")

So if the algorithm has a high bias problem,
* $J_\text{train}$ will be high
* $J_\text{test} (\approx J_\text{train})$ will be high

If there is a high variance problem,
* $J_\text{train}$ will be low
* $J_\text{test} (\gg J_\text{train})$ will be high

#### Regularization and Bias/Variance

Regularization can prevent overfitting, but how does it come into play here in bias vs. variance. Let's look at an example of linear regression with regularization. Suppose we have a 4 degree polynomial:

$$ h_\theta (x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \theta_4 x^4 \\
J (\theta) = \frac{1}{2 m} \sum_{i=1}^{m} \Big( h_\theta(x^{(i)}) - y^{(i)} \Big)^2  + 
\frac{\lambda}{2 m} \sum_{j=1}^{m} \theta_j^2 $$

Let's assume that the *best* fit for the training set is a quadradic model. The effect of a large regularization coefficient is to greatly simplify the model, leading to high bias. Making it too small can leave the model in high variance. So somewhere in between is the optimal choice for the regularization coefficient. A good way to find an optimal coefficient is to step through coefficient values in multiples of 2:

$$ \lambda \in {0,0.1,0.2,0.4,0.8,\cdots,10.24} $$

Each of these regularization coefficient options will result int heir own set of parameters. We then use the cross validation set cost to select the optimal coefficient. Then we examine the test set error using the regularization coefficient to make sure we don't have high bias or variance. 

If we compare the training and cross validation cost for increasing lambda, we should find that the training sample starts small for small coefficients and increases, since we're effectively ranging from high variance to high bias for increasing lambda. Meanwhile, the cross validation error will start off large for a high variance data set with small lambda, reach a minimum at an optimal coefficient, and then increase again as the large coefficient leads to high bias.

#### Learning Curves

Learning curves are a sanity check that the algorithm is working correctly. We plot our training and cross-validation costs as function of the training set size. We do this by artificially restricting the size of the the training sets. At the extreme, with a training sample of 1, we are guaranteed that any model will intersect with that one point. As the training set grows larger, a given model will will struggle harder and harder to fit all of the training set samples and the cost function will gradually increase. On the other hand, for a training set of 1, but with many points in our cross-validation set, the cost would be large. As we increase the number of samples in the training set, the cross validation cost will diminish, as few erpoints are going to contribute to error as they are swapped into the training set.

If we have high bias, increasing the training size will increase the training error, while the cross validation error will drop slightly, just as we'd expect. But what changes is that, for large training set sized, the training and cross validation errors are both large and quite close. This emphasizes the point that if we have high bias, getting more training data alone *will not* help much to solve the model problems.

![High Bias](./images/learning-curve_bias.png "High Bias")

Meanwhile, if we have high variance, the training error will be lower, and the cross-validation error will larger. The training set error will still grow for increasing training set size, while the cross-validation error will decrease. This time the difference is a lower training cost and a larger cross-validation error, and the gap between the two is much larger. This shows that getting more training data will likely help a model suffering from high-variance.

![High Variance](./images/learning-curve_variance.png "High Variance")

Of course, the description of these curves is idealistic, but the realistic behavior in the errors will probably be a bit more noisy.

#### Deciding What to Do Next, Revisited

Earlier, we said that after implementing our model on new data and finding unacceptably larger errors, we have the following options:
1. You can try adding more training samples.
   * We saw that this can fix high variance
2. You could also try using a smaller set of features to avoid overfitting
   * This also fixes high variance
3. Maybe you need to add features if your features don't cover the important bases.
   * This usually fixes high bias
4. We could also add polynomial features
   * This also usually fixes high bias
5. We could decrease the regularization coefficient
   * This also usually fixes high bias
6. We could increase the regularization coefficient
   * This usually fixes high variance
   
So it's a good idea to plot these curves to figure out what problem we're having and which solution would help.

We can apply this to NNs as well. A "small" NN (fewer parameters, one hidden later) is more prone to underfitting although computationally cheapter. A "large" NN (more parameters, possibly multiple hidden layers) is more probe ot overfitting while being computationally more expensive. To address this concern of overfitting, we use regularization.

## Machine Learning Design

In this section, we'll talk about some best practices and things that will help potentially save a lot of time.

### Building a Spam Classifier

We'll work first with a motivating example of a spam classifier.

#### Prioritizing What to Work On

Let's say we are using supervised learning where
* $x$ is the features of the email
* $y$ = spam (1) or not spam (0).

For our features, let's start by choose 100 words indicative of spam or not spam emails. Some examples are "deal", "buy", "discount", "Dylan", "now", .... So we can look at an email, and identify which of those keywords were present. Our features for each email will have the values

$$x_j = \begin{cases} 1 & \text{if word } j \text{ appears  in email} \\ 0 & \text{otherwise} \end{cases}$$

In practice, how this is really done is to take the most frequent occuring words (10,000 to 50,000 words) in a training set, rather than manually selecting a list of words like this.

So what's the best use of your time to make sure the filter has low error?
* One idea is to collect lots of data
  * This will often help, but not always
* Develop more sophisticated information
  * Can look at things like the email routing information in the email header
* Develop more sophisticated features for message body
  * Is discount and discounts to be treated as the same work? Deal vs. dealer? Punctuation?
* Develop methods of tracking mispellings
  * m0rtgage, w4tches, med1cine, etc.
  
It's hard to know which option is best to work on. We'll talk next about how to identify what may be a good use of time.

#### Error Analysis

Recommended approach:
* Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
* Plot learning curves to decide if more data, more features, etc. are likely to help
* Error analysis: manually examine the examples (in cross validation set) that you algorithm made errors on. See if you spot any systematic trend in what type of examples the model is struggling to predict correctly.

Let's look at a specific example, using the spam filter again. Let's say our cross-validation set has 500 samples, and the algorithm misclassifies 100 emails. We can manually examine the errors and categorize them basid on
1. What type of email it is
   * e.g., 12 pharma, 4 replica/fakes, 53 password theft, 31 other emails
2. What cues (features) you think would have helped the algorithm classify them correctly.
   * e.g., 5 deliberate misspellings, 16 unusual email routing, 32 unusual punctuation
   
By manually examining the errors and categorizing like this, we can identify what to best change about the model to improve.

Another useful tip is to have a numerical evaluation of the learning algorithm. Maybe that's accuracy or error, but it should be a numerical performance indicator. As an example, let's consider if discount/discounts/discounted/discounting should be treated as the same word. This can unfortunately combine words like universe/university, so maybe this word stemming isn't the best method. So the only solution is to try it and see if it works. We may implement stemming and see that without stemming, there's 5% error,but with stemming, the error drops to 3%. So maybe it's worthwhile to fully implement.

The point is that you'll be trying out a lot of different ideas, and manually examining the errors for each new idea can be cumbersome. So have a single numerical performance indicator will be valuable.

### Handling Skewed Data

#### Error Metrics for Skewed Classes

Sometimes, our data can be skewed. Let's consider the example of classifying cancer in patients. We can train a logistic model which finds that you get 1% error on the test set, which is pretty good. However, only 0.5% of patients have cancer. So a *highly* simple model that predicts *no one* has cancer is only going to have, for a proper sample, 0.5% error. This is clearly not a good model, but it performs better than our logistic regression. 

Because so few cases are positive, this is an example of *skewed classes*. Skewed classes represent a problem because it's hard to know if you're improving the model by comparing classification accuracy. We can look at the different cases through precision/recall where we're looking at positive or negative cases in the presence of a rare positive class we want to detect.

| &nbsp; | **Actual Positive (1)** | **Actual Negative (0)** |
| --- | --- | --- |
| **Predicted Positive (1)** | True Positive | False Positive |
| **Predicted Negative (0)** | False Negative | True Negative |

Here's a different way to look at the classification accuracy that gives more insight of our skewed classification accuracy.
* Precision - of all of the cases predicted positive, what fraction was *actually* positive?

  $$\frac{\text{number of true positives}}{\text{number of predicted positives}} = 
    \frac{\text{number of true positives}}{\text{number of true positives + number of false positives}}$$

* Recall - of all the cases that are *actually* positive, what fraction did we correctly predict?

  $$\frac{\text{number of true positives}}{\text{number of actual positives}} = 
    \frac{\text{number of true positives}}{\text{number of true positives + number of false negatives}}$$
    
So our bad example of predicting that no one ever has cancer would have a recall of 0%, since none of the cases could be true positives. In this precision/recall usage, we set the positive case to be the rare case by convention.

#### Trading Off Precision and Recall

Recall that, for our logistic regression, 

$$ 0 \leq h_\theta (x) \leq 1 \\
   p = \begin{cases} 1 & \text{ if } h_\theta (x) \geq 0.5 \\ 
                     0 & \text{ if } h_\theta (x) < 0.5  \end{cases} $$
                     
Suppose we want to predict positive cases only if we're *very* confident. So we might change the above as

$$ p = \begin{cases} 1 & \text{ if } h_\theta (x) \geq 0.7 \\ 
                     0 & \text{ if } h_\theta (x) < 0.7  \end{cases} $$
                     
In this case, we'll have higher precision, because we're more sure that our predicted positives are true positives. However, our recall is reduced because we're making predictions of positive cases for a smaller set of patients.

But suppose we want to avoid missing too many cases of cancer (avoid false negatives). So when in doubt, we want to assume the cases are positive. We might instead change our prediction threshold as such:

$$ p = \begin{cases} 1 & \text{ if } h_\theta (x) \geq 0.3 \\ 
                     0 & \text{ if } h_\theta (x) < 0.3  \end{cases} $$
                     
This time, we'll have higher recall, but lower precision. The higher recall because we're correctly predicting positive a larger fraction of the actually positive cases, but our precision is lower because fewer of the predicted positive cases are actually positive. 

So there is a tradeoff with precision and recall. The larger the precision, the lower the recall, and the larger the recall, the lower the precision. So is there a way to best choose our prediction threshold?

| &nbsp; | Precision | Recall |
| --- | --- | --- |
| **Algorithm 1** | 0.5 | 0.4 |
| **Algorithm 2** | 0.7 | 0.1 |
| **Algorithm 3** | 0.02 | 1.0 |

One thing we can do is look at the average of precision and recall, but this isn't always a great idea. Our Algorithm 3 is an example of an algorithm that predicts positive cases *always* and here it has the best average, while sufferent greatly in precision.

| &nbsp; | Precision | Recall | Average |
| --- | --- | --- | -- |
| **Algorithm 1** | 0.5 | 0.4 | 0.45 |
| **Algorithm 2** | 0.7 | 0.1 | 0.4 |
| **Algorithm 3** | 0.02 | 1.0 | 0.51 |

In contrast, we can look at the "F score":

$$ F_1 = 2 \frac{ P R } {P + R} $$

| &nbsp; | Precision | Recall | F Score |
| --- | --- | --- | -- |
| **Algorithm 1** | 0.5 | 0.4 | 0.444 |
| **Algorithm 2** | 0.7 | 0.1 | 0.175 |
| **Algorithm 3** | 0.02 | 1.0 | 0.0392 |

And this is actually what people frequently use in ML fields. Note that if if P or R are 0, the F = 0, and a perfect case is if P = 1 and R = 1, then F = 1.

### Using Large Datasets

Under certain conditions, obtaining a lot of data can be useful for training your learning algorithm. Let's lay out some assumptions for large data rationale.
* Assume the features has sufficient information to predict the output accurately
  * An example where this is not true is to predict housing proces from only the size of the house and no other features.
  * A useful test is, given the input features, can a human expert confidently preduct the output?
* Use a learning algorithm with many parameters
  * These are therefore low-bias algorithms, and hopefully the training error will be small
* Use a very large training set (unlikely to overfit)
  * There are therefore low-variance, and the training set and the test set will similar
  * Given the previous assumption, the test error will also be small