## ML system design

### System Design Example: supervised learning example spam/not spam

Given a data set of emails, we could construct a vector for each email. Each entry in this vector represents a word. The vector normally contains 10,000 to 50,000 entries gathered by finding the most frequently used words in our data set.  If a word is to be found in the email, we would assign its respective entry a 1, else if it is not found, that entry would be a 0. Once we have all our x vectors ready, we train our algorithm and finally, we could use it to classify if an email is a spam or not.

<br>
<img src="../img/system_design/spam_ham_example.png" width="600"/>

So how could you spend your time to improve the accuracy of this classifier?
1. Collect lots of data (for example "honeypot" project but doesn't always work)
2. Develop sophisticated features (for example: using email header data in spam emails)
3. Develop algorithms to process your input in different ways (recognizing misspellings in spam)

### Error Analysis

The recommended approach to solving machine learning problems is to:
- Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
- Plot learning curves to decide if more data, more features, etc. are likely to help.
- Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.

<br>
<img src="../img/system_design/nlp_stemming.png" width="600"/>

### Error metrics for Skewed Classes

<img src="../img/error_metrics/precision_recall.png" width="600"/>
<br>
<img src="../img/error_metrics/trading_off_precision_recall.png" width="600"/>
<br>
<img src="../img/error_metrics/f1_score.png" width="600"/>

### Data For Machine Learning

#### It's not who has the best algo that wins. It's who has the most data.

<img src="../img/system_design/the_most_data.png" width="600"/>

### QUESTIONS:

Having a large training set can help significantly improve a learning algorithm’s performance. However, the large training set is unlikely to help when:

- The features x do not contain enough information to predict y accurately (such as predicting a house’s price from only its size), and we are using a simple learning algorithm such as logistic regression.
- The features x do not contain enough information to predict y accurately (such as predicting a house’s price from only its size), even if we are using a neural network with a large number of hidden units.


1.
<img src="../img/system_design/recall_calculation.png" width="900"/>

2. Suppose a massive dataset is available for training a learning algorithm. Training on a lot of data is likely to give good performance when two of the following conditions hold true.

- We train a learning algorithm with a large number of parameters (that is able to learn/represent fairly complex functions).
    (You should use a "low bias" algorithm with many parameters, as it will be able to make use of the large dataset provided. If the model has too few parameters, it will underfit the large training set.)

- The features 'x' contain sufficient information to predict 'y' accurately.  (For example, one way to verify this is if a human expert on the domain can confidently predict y when given only x).
    (It is important that the features contain sufficient information, as otherwise no amount of data can solve a learning problem in which the features do not contain enough information to make an accurate prediction.)

3. The classifier is likely to now have higher recall. Suppose you have trained a logistic regression classifier which is outputing hθ(x) Currently, you predict 1 if hθ(x)≥threshold and predict 0 if hθ(x)<threshold where currently the threshold is set to 0.5.

- The classifier is likely to now have higher recall.
    (Lowering the threshold means more y = 1 predictions. This will increase the number of true positives and decrease the number of false negatives, so recall will increase.)

4. Suppose you are working on a spam classifier, where spam emails are positive examples (y=1) and non-spam emails are negative examples (y=0). You have a training set of emails in which 99% of the emails are non-spam and the other 1% is spam. Which of the following statements are true?

- If you always predict non-spam (output y=0), your classifier will have an accuracy of 99%.
    (Since 99% of the examples are y = 0, always predicting 0 gives an accuracy of 99%. Note, however, that this is not a good spam system, as you will never catch any spam.)

- A good classifier should have both a high precision and high recall on the cross validation set.
    (For data with skewed classes like these spam data, we want to achieve a high F1 score, which requires high precision and high recall)

- If you always predict non-spam (output y=0), your classifier will have 99% accuracy on the training set, and it will likely perform similarly on the cross validation set.
    (The classifier achieves 99% accuracy on the training set because of how skewed the classes are. We can expect that the cross-validation set will be skewed in the same fashion, so the classifier will have approximately the same accuracy.)


5.
- On skewed datasets (e.g., when there are more positive examples than negative examples), accuracy is not a good measure of performance and you should instead use F1 score based on the precision and recall.
    (You can always achieve high accuracy on skewed datasets by predicting the most the same output (the most common one) for every input. Thus the F1 score is a better way to measure performance.)

- Using a very large training set makes it unlikely for model to overfit the training data.
    (A sufficiently large training set will not be overfit, as the model cannot overfit some of the examples without doing poorly on the others.)