# Week 2: ML Strategy

---

## Table of Contents

---

## Carrying out Error Analysis

This section explains the process of **Error Analysis**, a crucial manual diagnostic procedure used in machine learning to systematically prioritize which types of mistakes are most worthwhile to fix.

### Purpose of Error Analysis

* Error Anlaysis is used when a learning algorithm's performance is below the desired level (often human-level performance).
* The goal is to quickly estimate the **ceiling on performance** (maximum potential improvement) for fixing a specific type of error, thereby helping to prioritize development effort.
* The methodology is to manually examining a sample of mislabeled or misclassified examples from the development (dev) set.

### The Simple Counting Procedure

1.  **Collect Sample:** Get a sample of mislabeled dev set examples (e.g., 100 examples).
2.  **Manual Inspection:** Manually examine each mislabeled example.
3.  **Count and Estimate:** Count how many errors fall into a specific category (e.g., dogs misclassified as cats).
4.  **Calculate Ceiling:** Estimate the maximum possible improvement in accuracy if that specific error category were completely solved.

* **Example:** If the current error rate is $10\%$ and $5\%$ of the errors are due to dogs:
    * Maximum improvement is $5\%$ of the total error.
    * New error ceiling: $10\% - (10\% \times 0.05) = 9.5\%$ error. (A small relative gain.)
* **Example (High Potential):** If $50\%$ of the errors are due to dogs:
    * New error ceiling: $10\% - (10\% \times 0.50) = 5\%$ error. (A huge relative gain, worth significant effort.)

### Error Analysis with Multiple Categories

When considering multiple ideas for improvement, a structured table or spreadsheet is recommended:

| Idea/Category | Description | Data Collection |
| :--- | :--- | :--- |
| **Setup:** | List misclassified images (e.g., 1 to 100). | Create columns for each error idea (e.g., Dogs, Great Cats, Blurry Images). |
| **Execution:** | For each image, place a checkmark in the relevant column(s). | Use a comments section to briefly describe the mistake (e.g., "Pit bull picture," "Lion, rainy day at zoo"). |
| **Prioritization:** | Calculate the percentage of total errors belonging to each category. | Focus effort on the categories that account for the largest fraction of the errors (highest performance ceiling). |
| **Adaptability:** | New error categories can be added on the fly during the manual inspection process (e.g., "Instagram Filters"). | This allows the analysis to be guided by what the data is actually showing. |

### Conclusion

Error analysis is a fast, low-effort procedure (often 5-10 minutes for 100 examples) that provides crucial data for making strategic, high-impact decisions, helping developers avoid spending months of work on problems with a low performance ceiling.

---

## Cleaning Up Incorrectly Labeled Data

This section addresses the issue of **incorrectly labeled examples** (errors in the true $Y$ values) in a dataset and provides guidelines on whether and how to fix them, particularly in the context of error analysis.

### Training Set Errors

* **Robustness to Random Errors:** Deep learning algorithms are generally robust to random errors in the training set labels, provided the total dataset size is large and the percentage of errors is not too high. It's often acceptable to leave minor random errors as they are.
* **Vulnerability to Systematic Errors:** Deep learning algorithms are less robust to systematic errors (e.g., if the labeler consistently labels all white dogs as cats). Systematic errors must be addressed as they introduce damaging bias.
* **Training vs. Dev/Test:** It is less critical to fix labels in the training set than in the dev/test sets, which are used for crucial evaluation.

### Dev/Test Set Errors (The Priority)

The dev and test sets are used to evaluate models and choose between them, making label accuracy here more important.

* **Error Analysis Integration:** During the manual error analysis process, an extra column should be added to count the percentage of mistakes where the classifier disagreed with the label simply because the label itself was incorrect in the dev set.
* **Decision Criterion:** Fix incorrect labels in the dev/test sets only if they make a significant difference to the ability to accurately evaluate and compare classifiers.

| Scenario | Overall Dev Error | Error Due to Incorrect Labels | Error Due to Algorithm ($9.4\%$) | Conclusion |
| :--- | :--- | :--- | :--- | :--- |
| **Case 1 (Low Impact)** | $10\%$ | $0.6\%$ ($6\%$ of total error) | $9.4\%$ | **Low Priority:** The $0.6\%$ error is a small fraction of the total $10\%$ error. Focus on the larger $9.4\%$ algorithmic error. |
| **Case 2 (High Impact)** | $2\%$ | $0.6\%$ ($30\%$ of total error) | $1.4\%$ | **High Priority:** The $0.6\%$ error is now a large fraction ($30\%$) of the total $2\%$ error. This noise makes it difficult to reliably compare two high-performing classifiers (e.g., $2.1\%$ vs $1.9\%$ error). **Fix labels first.** |

### Guidelines for Fixing Labels

If you decide to manually fix labels in the dev/test sets, follow these principles:

1.  **Apply to Both:** Apply the label correction process consistently to both the dev and test sets to ensure they maintain the same data distribution.
2.  **Examine Correct and Incorrect Examples (Ideal):** Ideally, you should examine the labels for both examples the algorithm got wrong and examples it got right.
    * *Reality Check:* This is often impractical if the algorithm is highly accurate ($98\%$ correct), as it would require checking $98\%$ of the data. Often, teams only check labels for examples the classifier got wrong.
3.  **Training Set Optional:** You can choose to fix labels only in the smaller dev/test sets and leave the errors in the much larger training set. This is acceptable, although it introduces a slight distribution difference between the training and dev/test sets.

### Importance of Human Insight

* **Beyond Automaton:** While deep learning emphasizes feeding data to an algorithm, building practical systems still requires manual error analysis and human insight.
* **Prioritization Tool:** Spending a short amount of time (minutes or a few hours) manually examining data to count error categories is an extremely efficient way to prioritize development directions.