# Week 2: ML Strategy

---

## Table of Contents

---

## Carrying out Error Analysis

This section explains the process of **Error Analysis**, a crucial manual diagnostic procedure used in machine learning to systematically prioritize which types of mistakes are most worthwhile to fix.

### Purpose of Error Analysis

* Error Anlaysis is used when a learning algorithm's performance is below the desired level (often human-level performance).
* The goal is to quickly estimate the **ceiling on performance** (maximum potential improvement) for fixing a specific type of error, thereby helping to prioritize development effort.
* The methodology is to manually examining a sample of mislabeled or misclassified examples from the development (dev) set.

### The Simple Counting Procedure

1.  **Collect Sample:** Get a sample of mislabeled dev set examples (e.g., 100 examples).
2.  **Manual Inspection:** Manually examine each mislabeled example.
3.  **Count and Estimate:** Count how many errors fall into a specific category (e.g., dogs misclassified as cats).
4.  **Calculate Ceiling:** Estimate the maximum possible improvement in accuracy if that specific error category were completely solved.

* **Example:** If the current error rate is $10\%$ and $5\%$ of the errors are due to dogs:
    * Maximum improvement is $5\%$ of the total error.
    * New error ceiling: $10\% - (10\% \times 0.05) = 9.5\%$ error. (A small relative gain.)
* **Example (High Potential):** If $50\%$ of the errors are due to dogs:
    * New error ceiling: $10\% - (10\% \times 0.50) = 5\%$ error. (A huge relative gain, worth significant effort.)

### Error Analysis with Multiple Categories

When considering multiple ideas for improvement, a structured table or spreadsheet is recommended:

| Idea/Category | Description | Data Collection |
| :--- | :--- | :--- |
| **Setup:** | List misclassified images (e.g., 1 to 100). | Create columns for each error idea (e.g., Dogs, Great Cats, Blurry Images). |
| **Execution:** | For each image, place a checkmark in the relevant column(s). | Use a comments section to briefly describe the mistake (e.g., "Pit bull picture," "Lion, rainy day at zoo"). |
| **Prioritization:** | Calculate the percentage of total errors belonging to each category. | Focus effort on the categories that account for the largest fraction of the errors (highest performance ceiling). |
| **Adaptability:** | New error categories can be added on the fly during the manual inspection process (e.g., "Instagram Filters"). | This allows the analysis to be guided by what the data is actually showing. |

### Conclusion

Error analysis is a fast, low-effort procedure (often 5-10 minutes for 100 examples) that provides crucial data for making strategic, high-impact decisions, helping developers avoid spending months of work on problems with a low performance ceiling.

---

## Cleaning Up Incorrectly Labeled Data

This section addresses the issue of **incorrectly labeled examples** (errors in the true $Y$ values) in a dataset and provides guidelines on whether and how to fix them, particularly in the context of error analysis.

### Training Set Errors

* **Robustness to Random Errors:** Deep learning algorithms are generally robust to random errors in the training set labels, provided the total dataset size is large and the percentage of errors is not too high. It's often acceptable to leave minor random errors as they are.
* **Vulnerability to Systematic Errors:** Deep learning algorithms are less robust to systematic errors (e.g., if the labeler consistently labels all white dogs as cats). Systematic errors must be addressed as they introduce damaging bias.
* **Training vs. Dev/Test:** It is less critical to fix labels in the training set than in the dev/test sets, which are used for crucial evaluation.

### Dev/Test Set Errors (The Priority)

The dev and test sets are used to evaluate models and choose between them, making label accuracy here more important.

* **Error Analysis Integration:** During the manual error analysis process, an extra column should be added to count the percentage of mistakes where the classifier disagreed with the label simply because the label itself was incorrect in the dev set.
* **Decision Criterion:** Fix incorrect labels in the dev/test sets only if they make a significant difference to the ability to accurately evaluate and compare classifiers.

| Scenario | Overall Dev Error | Error Due to Incorrect Labels | Error Due to Algorithm ($9.4\%$) | Conclusion |
| :--- | :--- | :--- | :--- | :--- |
| **Case 1 (Low Impact)** | $10\%$ | $0.6\%$ ($6\%$ of total error) | $9.4\%$ | **Low Priority:** The $0.6\%$ error is a small fraction of the total $10\%$ error. Focus on the larger $9.4\%$ algorithmic error. |
| **Case 2 (High Impact)** | $2\%$ | $0.6\%$ ($30\%$ of total error) | $1.4\%$ | **High Priority:** The $0.6\%$ error is now a large fraction ($30\%$) of the total $2\%$ error. This noise makes it difficult to reliably compare two high-performing classifiers (e.g., $2.1\%$ vs $1.9\%$ error). **Fix labels first.** |

### Guidelines for Fixing Labels

If you decide to manually fix labels in the dev/test sets, follow these principles:

1.  **Apply to Both:** Apply the label correction process consistently to both the dev and test sets to ensure they maintain the same data distribution.
2.  **Examine Correct and Incorrect Examples (Ideal):** Ideally, you should examine the labels for both examples the algorithm got wrong and examples it got right.
    * *Reality Check:* This is often impractical if the algorithm is highly accurate ($98\%$ correct), as it would require checking $98\%$ of the data. Often, teams only check labels for examples the classifier got wrong.
3.  **Training Set Optional:** You can choose to fix labels only in the smaller dev/test sets and leave the errors in the much larger training set. This is acceptable, although it introduces a slight distribution difference between the training and dev/test sets.

### Importance of Human Insight

* **Beyond Automaton:** While deep learning emphasizes feeding data to an algorithm, building practical systems still requires manual error analysis and human insight.
* **Prioritization Tool:** Spending a short amount of time (minutes or a few hours) manually examining data to count error categories is an extremely efficient way to prioritize development directions.

---

## Build your First System Quickly, then Iterate

This section advises adopting an **iterative, build-it-quick** approach when starting a brand new machine learning application to avoid prematurely prioritizing the wrong technical direction.

### The Challenge of Prioritization

For any new machine learning application (e.g., speech recognition), there are dozens of plausible areas for improvement (e.g., noisy backgrounds, accented speech, far-field speakers, children's speech, output fluency).

Even experts find it difficult to prioritize which direction to focus on without first analyzing the problem's specific characteristics and current system performance.

### The Recommended Iterative Strategy

The recommended approach for starting a new machine learning application is to "Build the first system quickly, then iterate."

1.  **Quick Setup (The Target):** Quickly set up your dev/test sets and define the core evaluation metric (It's okay if this target needs to be adjusted later).
2.  **Quick Build (The Baseline):** Build an initial, functional machine learning system quickly. This should be a "quick and dirty" implementation â€” don't overcomplicate it.
3.  **Diagnosis and Prioritization:** Use the trained initial system to diagnose performance:
    * **Bias/Variance Analysis:** Determine if the problem is primarily due to avoidable bias (underfitting) or variance (overfitting).
    * **Error Analysis:** Manually examine the mistakes the system is making to quantify and prioritize the most frequent error categories (e.g., how many errors are due to "far-field speech").
4.  **Iterate:** Use the results of the diagnosis (e.g., high error rate on far-field speech) to rationally prioritize the next development step.

### The Value of the Initial System

The main value of the initial quick-and-dirty system is that it allows the team to localize the problems through analysis:

* It moves the team from guessing what the biggest problem is to knowing the biggest problem is (e.g., $50\%$ of errors are far-field).
* It provides the necessary data to apply strategic tools like Bias/Variance analysis and Error Analysis.

### When This Advice Applies

* **Strongly Applies:** When tackling a brand new application area where the team lacks significant prior experience.
* **Less Strongly Applies:** When working in an application area with significant prior experience or a large body of academic literature (e.g., face recognition) that provides a clear starting architecture.

### Common Pitfall

Often teams do wrong by overthinking and building a system that is too complicated at the start, wasting valuable time before even knowing if they are prioritizing the right issues.

---

## Training and Testing on Different Distributions

This section addresses the common practice in the deep learning era of training models on data from a different distribution than the target distribution (dev/test sets) to maximize the training size.

### The Deep Learning Data Dilemma

Deep learning algorithms perform best with large amounts of labeled training data. Often, the available labeled data comes from two sources:
1.  A small amount of data from the target distribution (the data you actually care about).
2.  A large amount of data from a different, easily accessible distribution (e.g., crawled web images).

### Strategy for Setting Up Train/Dev/Test Sets

The critical rule is that the Dev and Test sets must come from the target distribution to ensure the team is optimizing performance where it matters most.

* **Scenario:** Building a cat classifier for a mobile app (target distribution: blurry cell phone photos) using supplemental web-crawled data (different distribution: professional high-res photos).

| Splitting Option | Dev/Test Distribution | Outcome | Recommendation |
| :--- | :--- | :--- | :--- |
| **Option 1 (Random Shuffle)** | Mixed (mostly web photos) | Sets the team's target on optimizing for the *web image* distribution, which is not the product's goal. | **Avoid this option.** |
| **Option 2 (Targeted Split)** | All Mobile App photos | Aims the target correctly, forcing the algorithm to generalize to the data that matters (mobile app photos). | **Recommended strategy.** |

* **Recommended Data Split Example (Option 2):**
    * **Training Set:** Large, combined dataset (e.g., 200k web images + 5k mobile app images).
    * **Dev Set:** Small dataset (e.g., 2.5k) entirely from the mobile app distribution.
    * **Test Set:** Small dataset (e.g., 2.5k) entirely from the mobile app distribution.

### Implications of Different Distributions

* **Advantage:** Allows the team to utilize a much larger training set, leading to better overall performance.
* **Disadvantage:** The training distribution now differs from the dev/test distributions. This introduces the new challenge of **Data Mismatch**, which requires specialized analysis techniques (to be discussed later).

### General Application

This strategy applies broadly whenever the data available for training differs from the data on which the final product will be evaluated (e.g., training a car speech system using general purpose speech data accumulated over years).

---

## Bias and Variance with Mismatched Data Distributions

This section explains how the analysis of bias and variance must be modified when the training data distribution differs from the dev/test data distribution, introducing the new diagnostic tool: the **Training-Dev Set**.

### The Challenge of Data Mismatch

When the training set distribution differs from the dev/test set distribution, a large gap between training error and dev error can no longer be definitively attributed to variance alone. The increase in error could be due to:
1.  **Variance:** The algorithm failed to generalize to unseen data from the *same* distribution.
2.  **Data Mismatch:** The dev set distribution is inherently *harder* or different than the training set distribution.

### Introducing the Training-Dev Set

To isolate the effects of variance and data mismatch, a new dataset subset is required:

* **Training-Dev Set:** A subset of data randomly carved out from the Training Set distribution (the large source data).
* **Purpose:** The model is trained only on the Training Set proper, not the Training-Dev set. The Training-Dev error is measured on data that is unseen but from the same distribution as the training data.

### Diagnostic Gaps in Data Mismatch Setting

By measuring error across four points (HLE, Training, Training-Dev, Dev), three distinct problems can be diagnosed:

| Gap | Calculation | Problem | Focus for Improvement |
| :--- | :--- | :--- | :--- |
| **Gap 1** | HLE - Training Error | Avoidable Bias | Focus on fitting the training data better (e.g., bigger model, better optimization). |
| **Gap 2** | Training-Dev Error - Training Error | Variance | Focus on generalization (e.g., regularization, more data from the training distribution). |
| **Gap 3** | Dev Error - Training-Dev Error | Data Mismatch | Focus on making the model robust to differences between the source data and the target data. |
| **Gap 4** | Test Error - Dev Error | Overfitting to Dev Set | Indicates the team over-optimized the dev set. Fix: Get a larger dev set. |

### Example Scenarios

| Training Error | Training-Dev Error | Dev Error | Diagnosis | Primary Focus |
| :---: | :---: | :---: | :--- | :--- |
| $1\%$ | $9\%$ | $10\%$ | **High Variance** (Large Gap 2) | Reduce variance (regularization). |
| $1\%$ | $1.5\%$ | $10\%$ | **High Data Mismatch** (Large Gap 3) | Address data mismatch (techniques for distribution shift). |
| $10\%$ | $11\%$ | $12\%$ | **High Avoidable Bias** (Large Gap 1, assuming HLE $\approx 0\%$) | Reduce bias (bigger model). |
| $10\%$ | $11\%$ | $20\%$ | High Bias + **High Data Mismatch** (Large Gaps 1 & 3) | Address bias and data mismatch. |

---

## Addressing Data Mismatch

This section focuses on the practical steps and inherent risks of addressing data mismatch.

### Diagnosis and Root Cause Analysis

When error analysis (Dev Error - Training-Dev Error) indicates a significant **data mismatch problem**, the first step is to gain human insight into the distribution difference.

* **Manual Error Analysis:** Manually examine misclassified examples in the **Dev Set** (not the Test Set, to avoid overfitting the final evaluation).
* **Identify Differences:** Try to understand *how* the Dev Set data is different or harder than the Training Set data.
    * *Example (Speech Recognition):* The Dev Set contains a high frequency of **car noise** (a new source of error) or requires accurate recognition of **street numbers** (a new priority).

### Strategic Solutions (Fixing the Mismatch)

Once the cause is identified, the goal is to make the Training Data distribution more similar to the Dev/Test distribution. The two main approaches are:

| Strategy | Action | Example |
| :--- | :--- | :--- |
| **Data Collection** | Deliberately collect more real data that matches the Dev/Test distribution features. | Collect more audio recordings of people speaking street addresses. |
| **Artificial Data Synthesis (ADS)** | Programmatically manipulate existing clean data to simulate the hard-to-collect noise/features found in the Dev Set. | Synthesize in-car noise by adding separately recorded car audio to large amounts of clean speech audio. |

### Cautions and Risks of Artificial Data Synthesis (ADS)

While ADS can provide significant performance boosts, it carries a major risk: **Overfitting to the Synthesized Features.**

* **Risk of Impoverished Subset:** If a large dataset of clean data (e.g., 10,000 hours of speech) is combined with a very small, unique noise source (e.g., 1 hour of car noise) that is simply repeated, the model may overfit to the subtle patterns of that single hour of noise.
* **The Human Perception Trap:** The synthesized data may sound perfectly fine to a human ear, but the algorithm might be synthesizing data from only a tiny, unrepresentative subset of the total possible noise space.
* **Best Practice:** Ideally, the unique component being added (e.g., car noise) should also be highly unique and varied (e.g., 10,000 unique hours of car noise) to match the scale of the clean data and prevent the model from overfitting to specific noise artifacts.