# Week 1: ML Strategy

Streamline and optimize your ML production workflow by implementing strategic guidelines for goal-setting and applying human-level performance to help define key priorities.
Learning Objectives

* Explain why Machine Learning strategy is important
* Apply satisficing and optimizing metrics to set up your goal for ML projects
* Choose a correct train/dev/test split of your dataset
* Define human-level performance
* Use human-level performance to define key priorities in ML projects
* Take the correct ML Strategic decision based on observations of performances and dataset

---

## Table of Contents

---

## Orthogonalization in ML

This section introduces the concept of **orthogonalization** as a strategy for developing effective machine learning systems, advocating for clarity on *what* to tune to achieve a single, specific effect.

### Concept of Orthogonalization

Orthogonalization means designing controls (or hyperparameters) so that each control ideally affects only one specific dimension of the system's performance.

For example, a well-designed TV has separate knobs for height, width, rotation, etc. A non-orthogonal knob would change all these aspects simultaneously, making precise tuning nearly impossible.

Having orthogonal controls makes the process of tuning and debugging much easier because you know exactly which knob to turn when a specific problem is identified.

### The Four Performance Criteria

For a supervised learning system to perform well, it generally needs to achieve four goals sequentially. Orthogonalization helps by providing a distinct "knob" for addressing each potential failure point:

#### 1.  Fit the Training Set Well (Low Bias)

* **Problem:** The algorithm is not performing well on the training data.
* **Orthogonal Knobs:** Use a bigger network (more capacity) or switch to a better optimization algorithm (e.g., Adam).

#### 2. Fit the Development Set Well (Low Variance)

* **Problem:** The algorithm fits the training set well but performs poorly on the dev set (overfitting).
* **Orthogonal Knobs:** Apply regularization (to reduce variance) or get a bigger training set (to improve generalization).

#### 3. Fit the Test Set Well

* **Problem:** Performance is good on the dev set but poor on the test set.
* **Orthogonal Knob:** Get a bigger dev set. This indicates the dev set itself was too small, leading to overfitting the dev set metrics.

#### 4. Perform Well in the Real World

* **Problem:** Performance is good on the test set metric, but the system doesn't deliver the desired real-world value.
* **Orthogonal Knob:** Change the Dev/Test set distribution or change the cost function because the current metric is not accurately measuring the real-world goal.

### Example of Non-Orthogonal Control

**Early Stopping:** This technique is considered less orthogonal because it simultaneously affects two criteria:
1.  It improves dev set performance (reducing variance).
2.  It reduces training set performance (increasing bias/not fitting training data as well).

While it is not a bad technique, using more orthogonal controls makes the system easier to reason about and tune.

### Summary

The goal is to diagnose the exact bottleneck in performance (which of the four criteria is failing) and then use a specific, orthogonal set of controls to fix only that problem.

---

## Single Number Evaluation Metric

This section emphasizes the critical role of a **single, real-number evaluation metric** in accelerating the iterative process of developing and improving machine learning algorithms.

### The Need for a Single Metric

A single real-number evaluation metric allows a team to quickly and definitively determine whether a new idea, hyperparameter change, or algorithm modification is better or worse than the previous one.
* **Empirical Process:** Machine learning development is highly empirical (Idea $\rightarrow$ Code $\rightarrow$ Experiment $\rightarrow$ Refine). A single metric speeds up this iterative loop.

### Dealing with Multiple Criteria

When using multiple metrics (like Precision and Recall), it becomes difficult to choose between competing classifiers when none of them dominates all metrics.

#### Example 1: Precision and Recall

* **Metrics:**
    * **Precision (P):** Of the examples identified as positive (e.g., cats), what percentage are correct?
    * **Recall (R):** Of all the true positive examples (actual cats), what percentage were correctly identified?
* **Solution:** Combine multiple metrics into a single number. The standard way to combine Precision and Recall is the **F1 Score**, which is Harmonic Mean of $P$ and $R$ (This averages P and R while favoring high values for both):

$$F1 = \frac{2}{\frac{1}{P} + \frac{1}{R}}$$

#### Example 2: Error Across Geographies

* **Metrics:** Tracking error rates across multiple markets (e.g., US, China, India, Other).
* **Solution:** Compute a simple average of the error rates across all markets. This provides a single number to compare algorithms quickly, assuming average performance is a reasonable proxy for overall success.

#### Conclusion

A well-defined development (dev) set combined with a single, real-number evaluation metric is essential for efficiently comparing and selecting the best algorithm during the development process.

---

## Satisficing and Optimizing Metric

This section introduces the concepts of **Optimizing** and **Satisficing** metrics as a practical strategy for evaluating and selecting machine learning models when multiple factors are important and difficult to combine into a single formula.

### When to Use

This approach is useful when you care about multiple performance metrics (e.g., accuracy, speed, cost) but find it difficult or arbitrary to combine them mathematically (e.g., using a weighted sum). If you have $N$ metrics you care about, it's often practical to:
* Choose one metric to be optimizing.
* Choose the remaining $N-1$ metrics to be satisficing.

### Optimizing Metric

The single metric you wish to maximize (or minimize, depending on the goal). You aim to achieve the best possible performance on this metric.

**Example (Cat Classifier):** Accuracy (or F1 Score).

### Satisficing Metric

One or more metrics that only need to reach a **"good enough" threshold**. Once the threshold is met, further improvement on this metric is not prioritized.

**Example (Cat Classifier):** Running Time must be $\le 100$ milliseconds.
* In a set of classifiers, you first filter out any that fail the satisficing criteria.
* You then select the classifier from the remaining set that has the highest value for the optimizing metric. (E.g., Classifier B maximizes accuracy while still meeting the running time requirement).

### Summary

By defining one optimizing metric and one or more satisficing metrics, you create a clear, almost automatic decision rule for quickly selecting the "best" algorithm among many choices, thus speeding up the development iteration cycle.

---

## Train/Dev/Test Distributions

This section provides crucial advice on setting up development (dev) and test sets to maximize the efficiency and effectiveness of machine learning development, primarily by ensuring they share the same data distribution.

### Importance of Setup

The way you define your training, development (dev), and test sets significantly impacts how quickly your team can make progress. The *dev set + single real-number evaluation metric* defines the *target* that the team aims for.

### Bad Practice: Different Distributions

If the dev and test sets come from different distributions, the team might spend months optimizing performance for the dev set, only to find the model performs poorly on the test set. This is akin to training for one target and then suddenly being asked to hit a different, unexpected target.

### Best Practice: Same Distribution

The dev set and the test set should come from the same distribution. For example:

* Take all available data (e.g., data from all eight geographic regions, or all income levels) and **randomly shuffle** it before splitting it into dev and test sets.
* Ensure the dev and test sets reflect the same type of data you expect to encounter and want to perform well on in the future.

### Real-World Pitfall Example

A team spent months optimizing a loan approval model using a dev set composed of data from *medium-income postal codes*. They later tested the model on data from *low-income postal codes* (a different distribution), and the model failed completely, wasting three months of work.

### Summary

* The dev set and the evaluation metric define the target your team works to hit.
* By ensuring the dev and test sets are drawn from the same, representative data distribution, you ensure that optimizing performance on the dev set directly translates to good, expected performance on the test set, leading to much more efficient iteration.

---

### Size of dev and test sets

This section discusses how the guidelines for splitting data into training, development (dev), and test sets have changed in the deep learning (big data) era, shifting away from traditional percentage splits.

---

#### Key Bullet Points: Sizing Dev and Test Sets

##### Shift in Data Splitting Rules
* **Old Rule of Thumb:** In earlier eras with smaller datasets (hundreds to tens of thousands of examples), the common split was **70% Train / 30% Test** or **60% Train / 20% Dev / 20% Test**.
* **Modern Deep Learning Trend:** Due to the large size of modern datasets (e.g., millions of examples) and the high data hunger of deep learning algorithms, the trend is to allocate a much **larger fraction to the training set** and a much **smaller fraction to the dev/test sets**.

##### Guidelines for Modern Splitting
1.  **Training Set (Largest Fraction):** The training set should consume the largest fraction of the data, potentially **98% or more** when dealing with millions of examples.
2.  **Dev Set (Sufficient for Evaluation):** The dev set's purpose is to evaluate different ideas and choose the best algorithm. It needs to be big enough to give confidence in rank-ordering different models (e.g., 1% of 1 million examples, or 10,000 examples, might be sufficient).
3.  **Test Set (Sufficient for Final Confidence):** The test set's purpose is to provide an unbiased evaluation of the final system before deployment.
    * It must be **big enough to give high confidence** in the system's overall performance.
    * Similar to the dev set, this size may be far **less than 30%** of the total data (e.g., 1% or 10,000 examples).

##### Train/Dev/Test Set vs. Train/Dev Set
* **Ideal Practice:** It is reassuring to maintain a separate test set to get an unbiased estimate of the final system's performance.
* **Train/Dev Only:** In some situations, a team might only use a Train/Dev split and not worry about a separate test set, especially if the dev set is very large. However, this is unusual and not generally recommended, as it means the team is optimizing directly to their final evaluation set.

### When to change to Dev/Test Sets and Metrics

This section explains that the chosen evaluation metric and development (dev) set are like a **target** for a machine learning team, and if they stop correctly rank-ordering algorithms based on real-world preferences, they must be changed.

---

#### Key Bullet Points: Moving the Target (Changing Metrics and Data)

##### When to Change the Target
The core guideline is that if doing well on your current metric and dev/test set does not correspond to doing well on the application you actually care about, change the metric and/or the data.

1.  **Metric Misranks Algorithms (Unacceptable Errors):**
    * **Problem:** An algorithm (A) achieves better performance on the simple classification error metric (e.g., 3% error) but is fundamentally worse because it fails a crucial constraint (e.g., letting through unacceptable content like pornography).
    * **Solution:** Change the Evaluation Metric - Introduce **weighted misclassification error** where unacceptable errors (e.g., mislabeling a pornographic image as a cat) are penalized much more heavily (e.g., a weight of 10x or 100x).
      $$\text{Error} = \frac{1}{\sum_iw^{(i)}}\sum_i w^{(i)}I(\hat y^{(i)}, y^{(i)})$$

      where $w^{(i)}$ is defined as below:
      $$w^{(i)} = \begin{cases} 1 & \text{if } x^{(i)} \text{ is non-porn} \\ 10 & \text{if } x^{(i)} \text{ is porn} \end{cases}$$
      
2.  **Dev/Test Set Does Not Reflect Reality (Data Mismatch):**
    * **Problem:** The dev/test set contains high-quality, well-framed images (e.g., downloaded from the internet), but the deployed application uses low-quality, blurry, or poorly framed user-uploaded images. An algorithm that performs better in the real app might perform worse on the high-quality dev set.
    * **Solution:** Change the Dev/Test Set Distribution - Update your evaluation data to better reflect the true distribution of data the algorithm will encounter in production (e.g., include blurrier, less professional photos from the mobile app).

##### Orthogonalization Principle
* **Target Setting (Step 1):** Defining the metric is a distinct step from achieving good performance on that metric.
* **Aiming and Shooting (Step 2):** After defining the metric, the separate step is figuring out how to do well on it (e.g., changing the neural network's internal cost function, $J$, to align with the new weighted external metric).

#### Final Guidance
* **Start Quickly:** Don't wait for the "perfect" metric or dev set; quickly set something up to start the iteration process. Running without any metric and dev set slows down team efficiency.
* **Be Flexible:** It is perfectly acceptable to change the metric and/or data later if you discover a better approach that more accurately captures your application's true performance requirements.

### Why human-level performance?

This section explains that **Human-Level Performance (HLP)** serves as a crucial estimate for the **Bayes Optimal Error** (the theoretically lowest possible error rate) and is essential for properly diagnosing whether a model's poor performance is due to **high bias** or **high variance**.

---

#### Key Bullet Points: Human-Level Performance and Error Analysis

##### 1. Human-Level Error as a Proxy for Bayes Error
* **Bayes Optimal Error (Bayes Error):** The lowest possible theoretical error rate for any classifier on a given dataset. You cannot perform better than this without overfitting.
* **Human-Level Performance (HLP):** For tasks humans are very good at (like computer vision), HLP is a reasonable, practical proxy or estimate for the Bayes Error.
* **Goal:** The ultimate goal is to reach or closely approach the Bayes Error. You should avoid reducing training error below the Bayes Error, as this likely leads to overfitting.

##### 2. Decomposing Error: Avoidable Bias vs. Variance
The comparison between HLP, Training Error, and Dev Error determines the focus of improvement:

| Error Component | Calculation | Problem Type | Tactics to Reduce |
| :--- | :--- | :--- | :--- |
| **Avoidable Bias** | $\text{Training Error} - \text{HLP (Bayes Error estimate)}$ | High Bias (Underfitting) | Train a bigger network, run longer training, try a better optimization algorithm. |
| **Variance** | $\text{Dev Error} - \text{Training Error}$ | High Variance (Overfitting) | Apply regularization**, get more training data. |

##### 3. Impact of HLP on Strategy (Example Comparison)

| Scenario | HLP (Bayes Error Est.) | Training Error | Dev Error | Avoidable Bias | Variance | Focus |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **Case 1** | **1.0%** | 8.0% | 10.0% | **7.0%** | 2.0% | **Reduce Bias** (7.0% gap is large) |
| **Case 2** | **7.5%** | 8.0% | 10.0% | **0.5%** | 2.0% | **Reduce Variance** (2.0% gap is larger) |

In Case 2, reducing the training error below 7.5% offers little theoretical benefit and risks overfitting, so efforts shift to reducing the variance gap.

#### Accuracy Improvement Over Time

![Human Level Performance](images/hlp.png)