# Week 3: Advice for applying machine learning

## Deciding what to try next

Here we introduce the topic of **Machine Learning Diagnostics**, stressing the importance of making efficient, data-driven decisions during a machine learning project to save significant time and effort.

### 1. The Need for Efficiency
* **Time Savings:** An efficient team can often complete a machine learning system in a few weeks, while a less skilled team might take six months for the same project.
* **Effective Decisions:** The key to efficiency is the ability to repeatedly make **good decisions about what to do next** to improve the model's performance.

### 2. Common Improvement Strategies
When a machine learning model (e.g., regularized linear regression) shows unacceptable error, there are many common, but not always fruitful, steps one might try:
* **Data Quantity:** Get **more training examples**.
* **Feature Quantity:** Try a **smaller set of features** (feature selection).
* **Feature Quality:** Get **additional features** (feature engineering).
* **Feature Transformation:** Add **polynomial features** (e.g., $x_1^2, x_1x_2$).
* **Regularization:** **Decrease** the regularization parameter ($\lambda$).
* **Regularization:** **Increase** the regularization parameter ($\lambda$).

### 3. The Role of Diagnostics
* **Problem:** Blindly trying the strategies above often wastes time (e.g., spending months collecting data that doesn't ultimately help).
* **Solution:** A **diagnostic** is a test you can run to gain insight into **what is or isn't working** with your current learning algorithm.
* **Guidance:** Diagnostics provide objective guidance on which investment of time (e.g., which strategy from the list above) is most likely to improve the algorithm's performance.
* **Value:** While diagnostics take time to implement, running them is a crucial step that can save months of wasted effort.

### 4. Next Step
The immediate next step in this study is to learn **how to evaluate the performance** of a learning algorithm, which is necessary before any meaningful diagnostic can be run.

## Evaluating a model

Here we introduce the fundamental process of **evaluating a machine learning model** by splitting the dataset into training and test sets and calculating error metrics.

### 1. The Need for Systematic Evaluation
* **Problem:** For models with many features (more than 1 or 2), it is impossible to visually plot the function $f(x)$ to determine if the model is a good fit.
* **Goal:** A systematic method is required to tell if a model that fits the **training data** well will also **generalize** to new, unseen examples.

### 2. Splitting the Dataset
* **Method:** The entire dataset is split into two subsets:
    * **Training Set ($m_{\text{train}}$):** Typically **70% to 80%** of the data. Used exclusively to **train** the model (i.e., minimize the cost function $J$).
    * **Test Set ($m_{\text{test}}$):** Typically **20% to 30%** of the data. Used exclusively to **evaluate** the trained model's generalization ability.

### 3. Evaluating Performance (Regression - Squared Error Cost)
* **Training Cost (Used to fit $W, b$):** The standard cost function **including the regularization term** is minimized to find the parameters:
    $$J(W, b) = \frac{1}{2m_{\text{train}}} \sum_{i=1}^{m_{\text{train}}} (f(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2$$
* **Training Error ($J_{\text{train}}$):** The average squared error calculated **only on the training set** and **without the regularization term**:
    $$J_{\text{train}}(W, b) = \frac{1}{2m_{\text{train}}} \sum_{i=1}^{m_{\text{train}}} (f(x^{(i)}) - y^{(i)})^2$$
* **Test Error ($J_{\text{test}}$):** The average squared error calculated **only on the test set** and **without the regularization term**:
    $$J_{\text{test}}(W, b) = \frac{1}{2m_{\text{test}}} \sum_{i=1}^{m_{\text{test}}} (f(x_{\text{test}}^{(i)}) - y_{\text{test}}^{(i)})^2$$
* **Diagnosis:** A large difference where $J_{\text{train}}$ is low but $J_{\text{test}}$ is high indicates a **poorly generalizing model** (overfitting).

### 4. Evaluating Performance (Classification)
For classification (e.g., Logistic Regression), two common methods are used for evaluation:

| Evaluation Metric | Description |
| :--- | :--- |
| **Average Loss** | The average logistic loss (cross-entropy) calculated on the test set (analogous to $J_{\text{test}}$ in regression, without regularization). |
| **Misclassification Error** (Most Common) | The **fraction of examples** in the test set where the model's prediction ($\hat{y}$) does not equal the actual label ($y$). |
| **Calculation:** | The model predicts $\hat{y} = 1$ (if $f(x) \ge 0.5$) or $\hat{y} = 0$ (if $f(x) < 0.5$). The error is the count of $(\hat{y} \ne y)$ divided by $m_{\text{test}}$. |

### 5. Future Steps
Splitting the data into training and test sets is the first step toward enabling an algorithm to **automatically choose the best model** (e.g., deciding between a second-order or fourth-order polynomial), which requires one further refinement to the data splitting idea.

## Model selection and training/cross validation/test sets

Here are details the proper procedure for **Model Selection** in machine learning, which involves splitting the data into three subsets to prevent an overly optimistic estimate of generalization error.

### 1. The Flaw in Two-Set Evaluation
* **Problem:** If you split your data into only a training set and a test set, and then use the test set to **select** the best model (e.g., choosing a 5th-degree polynomial because it has the lowest $J_{\text{test}}$), the resulting $J_{\text{test}}$ error will be an **overly optimistic (lower) estimate** of the true generalization error.
* **Reason:** You have effectively **fit a hyperparameter** (the polynomial degree, $d$) to the test set, making the test set error an unreliable estimate of performance on truly new data.

### 2. The Solution: Three-Way Data Split
To enable automatic model selection without corrupting the final performance estimate, the data must be split into three distinct subsets:

| Subset | Typical Split | Purpose | Notation |
| :--- | :--- | :--- | :--- |
| **Training Set** | 60% | Used to **fit (train)** the model's parameters ($W, b$). | $m_{\text{train}}$ |
| **Cross-Validation Set** | 20% | Used to **select** the best model (e.g., choose polynomial degree $d$, or neural network architecture). | $m_{\text{cv}}$ |
| **Test Set** | 20% | Used only **once** at the very end to give an **unbiased final estimate** of the generalization error. | $m_{\text{test}}$ |

* **Terminology:** The Cross-Validation set (CV set) is also commonly called the **Validation Set** or the **Development Set (Dev Set)**.

### 3. The Model Selection Procedure
The process for automatically choosing the best model (Model Selection) uses all three sets:

1.  **Fit Models:** Train all candidate models (e.g., $d=1$ through $d=10$ polynomials, or different neural network architectures) only on the **Training Set**. This yields a set of parameters ($W^1, b^1$), ($W^2, b^2$), etc.
2.  **Select Best Model:** Evaluate the performance of all trained models using the **Cross-Validation Error** ($J_{\text{cv}}$).
    * **Pick the model** that yields the lowest $J_{\text{cv}}$. (e.g., if $J_{\text{cv}}$ is lowest for the 4th-order polynomial, you select that model).
3.  **Final Report:** Report the generalization performance of the **selected model** using its error on the **Test Set** ($J_{\text{test}}$).

### 4. Best Practice
* **Rule:** All decisions about the model (fitting parameters $W, b$, and choosing hyperparameters like polynomial degree or architecture) must be made using only the **Training** and **Cross-Validation** sets.
* **Integrity:** The **Test Set must be untouched** during the decision-making process. This ensures that the final reported $J_{\text{test}}$ is a **fair** (not overly optimistic) estimate of how well the chosen model will perform on truly new data in the real world.

The next topic, **Bias and Variance**, is introduced as the most powerful diagnostic tool, relying on these three error measures.

## Notes from C2W3_Lab_01_Model_Evaluation_and_Selection

Scikit-learn also has a built-in `mean_squared_error()` function that you can use. Take note though that [as per the documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-error), scikit-learn's implementation only divides by `m` and not `2*m`, where `m` is the number of examples. As mentioned in Course 1 of this Specialization (cost function lectures), dividing by `2m` is a convention we will follow but the calculations should still work whether or not you include it. Thus, to match the equation above, you can use the scikit-learn function then divide by 2 as shown below. We also included a for-loop implementation so you can check that it's equal.

## Diagnosing bias and variance

Here we explain the concepts of **Bias** and **Variance** in machine learning, and how to use the **Training Error ($J_{\text{train}}$)** and **Cross-Validation Error ($J_{\text{cv}}$)** to diagnose these problems, which then guides the next steps for model improvement.

### 1. The Purpose of Bias and Variance Diagnostics
* **Goal:** To systematically determine the reason for a model's poor performance to guide the next steps for improvement.
* **Method:** Instead of relying on visualizations (which aren't possible with many features), a model's performance on the **training set** and the **cross-validation set** provides key diagnostic insights.

### 2. Diagnosing Bias (Underfitting)
* **Visual Analogy:** Fitting a straight line to curvilinear data ($d=1$).
* **Model Characteristic:** The model is too simple (e.g., too few features or a low-degree polynomial) and cannot capture the complexity of the data.
* **Diagnostic Signature:**
    * **$J_{\text{train}}$ is HIGH:** The model performs poorly even on the data it was trained on. This is the **strongest indicator of high bias**.
    * **$J_{\text{cv}}$ is $\approx$ HIGH:** The cross-validation error is similar to the training error (both are high).

### 3. Diagnosing Variance (Overfitting)
* **Visual Analogy:** Fitting a very wiggly, high-degree polynomial ($d=4$) to a small dataset.
* **Model Characteristic:** The model is too complex and fits the noise and specific examples of the training data too closely, failing to generalize to new data.
* **Diagnostic Signature:**
    * **$J_{\text{train}}$ is LOW:** The model performs great on the training set.
    * **$J_{\text{cv}}$ is MUCH GREATER than $J_{\text{train}}$:** The large gap between the two errors indicates the model performs well on seen data but poorly on unseen data. This is the **strongest indicator of high variance**.

### 4. The "Just Right" Model
* **Diagnostic Signature:**
    * **$J_{\text{train}}$ is LOW** (but not zero).
    * **$J_{\text{cv}}$ is LOW** and is **close to** $J_{\text{train}}$.

### 5. Relationship Between Error and Model Complexity
When plotting the error metrics against the model complexity (e.g., degree of polynomial, $d$):
* **$J_{\text{train}}$:** Typically **decreases** as model complexity ($d$) increases, because a more complex model can always fit the training data better.
* **$J_{\text{cv}}$:** Forms a **U-shaped curve**:
    * **Low $d$ (High Bias):** $J_{\text{cv}}$ is high because the model underfits.
    * **Optimal $d$:** $J_{\text{cv}}$ is at its minimum ("just right").
    * **High $d$ (High Variance):** $J_{\text{cv}}$ rises again because the model overfits and fails to generalize.

![Variance and Bias](images/var_bias.png)

### 6. Simultaneous High Bias and High Variance
* **Possibility:** While less common in simple linear regression, it is possible in complex models (like neural networks) to simultaneously have both problems.
* **Diagnostic Signature:**
    * **$J_{\text{train}}$ is HIGH** (indicating bias/underfitting on the training data).
    * **$J_{\text{cv}}$ is MUCH GREATER than $J_{\text{train}}$** (indicating variance/poor generalization).

**Key Takeaways:**
* **High Bias $\implies$ $J_{\text{train}}$ is High.**
* **High Variance $\implies$ $J_{\text{cv}}$ $\gg J_{\text{train}}$.**

## Regularization and bias/variance

Here we touch on the effect of the **regularization parameter ($\lambda$)** on a model's bias and variance, and demonstrates how to use the **cross-validation set** to select the optimal value for $\lambda$.

### 1. Effect of $\lambda$ on Bias and Variance

The regularization parameter $\lambda$ controls the trade-off between fitting the training data well and keeping the model weights ($W$) small. This directly impacts bias and variance:

| Value of $\lambda$ | Model Behavior | Diagnostic Signature | Problem |
| :--- | :--- | :--- | :--- |
| **Very Large** (e.g., 10,000) | Parameters $W$ are forced to near zero, making $f(x) \approx b$ (a constant line). | **$J_{\text{train}}$ is HIGH** and $J_{\text{cv}}$ is HIGH. | **High Bias (Underfitting)** |
| **Very Small** (e.g., 0) | No regularization; the model is free to fit the training noise. | $J_{\text{train}}$ is LOW, but **$J_{\text{cv}}$ is MUCH GREATER than $J_{\text{train}}$**. | **High Variance (Overfitting)** |
| **Intermediate** | The model is "just right," balancing fit and simplicity. | $J_{\text{train}}$ is LOW and $J_{\text{cv}}$ is LOW and close to $J_{\text{train}}$. | **Good Generalization** |

### 2. Using Cross-Validation to Select $\lambda$

The cross-validation set is used to automatically select the best $\lambda$ value, similar to how it's used to select the best polynomial degree ($D$).

1.  **Define a Range:** Choose a wide range of potential $\lambda$ values to test (e.g., $\lambda=0, 0.01, 0.02, 0.04, \dots, 10$).
2.  **Train Models:** For each chosen $\lambda$, train the model (e.g., a 4th-order polynomial) by minimizing the cost function on the **Training Set**. Each $\lambda$ yields a different set of parameters ($W_i, b_i$).
3.  **Evaluate on CV Set:** Calculate the **Cross-Validation Error** ($J_{\text{cv}}$) for each set of parameters ($W_i, b_i$).
4.  **Select Optimal $\lambda$:** Pick the $\lambda$ value that resulted in the **lowest $J_{\text{cv}}$ error**.
5.  **Final Report:** Report the model's final generalization performance using the **Test Set Error** ($J_{\text{test}}$) only on the chosen parameters.

### 3. Error vs. $\lambda$ Plot

When plotting the errors as a function of $\lambda$ (on a logarithmic scale):

* **$J_{\text{train}}$ Curve:** **Increases** as $\lambda$ increases. A higher $\lambda$ forces the optimizer to prioritize smaller $W$ values over minimizing training error.
* **$J_{\text{cv}}$ Curve:** Forms a **U-shaped curve**:
    * It is high on the left (small $\lambda$, high variance).
    * It is high on the right (large $\lambda$, high bias).
    * It dips to a minimum in the middle, representing the optimal $\lambda$ value for best generalization.

This diagnostic plot reinforces that cross-validation is essential for finding the **sweet spot** where the model avoids both high bias and high variance.

![Variance and bias for lambda](images/var_bias_lambda.png)

## Stablishing a baseline level of performance

This section refines the diagnosis of **Bias and Variance** by introducing the concept of a **Baseline Level of Performance** (often Human-Level Performance). This benchmark allows for a more meaningful judgment of whether the training error is "high."

### 1. The Need for a Baseline

* **Problem with Absolute Error:** Simply looking at the training error ($J_{\text{train}}$) in isolation (e.g., concluding 10.8% error is "high") can be misleading. For many tasks (especially those with noisy unstructured data like speech or images), achieving 0% error is impossible.
* **Baseline Defined:** The **Baseline Level of Performance** is the error rate you can reasonably hope your learning algorithm can eventually reach.
    * **Common Baseline:** **Human-Level Performance** (HLP) is a good benchmark for tasks that humans are proficient at (vision, speech, text).
    * **Other Baselines:** A previous algorithm's performance, a competitor's result, or a guess based on prior experience.

### 2. Refining the Bias-Variance Diagnosis

The diagnosis is based on two key gaps, relative to the Baseline:

####  Gap 1: Baseline to Training Error ($\text{Baseline} \rightarrow J_{\text{train}}$)
This gap indicates the severity of the **High Bias** (Underfitting) problem.

* **Formula:** $\text{Gap}_1 = J_{\text{train}} - \text{Baseline Error}$
* **Interpretation:** If $\text{Gap}_1$ is **LARGE**, it means the algorithm is not even fitting the training data as well as reasonably expected. **$\implies$ High Bias Problem.**
* **Example 1 (Speech):** HLP is 10.6%, $J_{\text{train}}$ is 10.8%. $\text{Gap}_1 = 0.2\%$. This is small, meaning the algorithm is fitting the training data well **relative to the baseline**.

#### Gap 2: Training Error to Cross-Validation Error ($J_{\text{train}} \rightarrow J_{\text{cv}}$)
This gap indicates the severity of the **High Variance** (Overfitting) problem.

* **Formula:** $\text{Gap}_2 = J_{\text{cv}} - J_{\text{train}}$
* **Interpretation:** If $\text{Gap}_2$ is **LARGE** ($J_{\text{cv}}$ is much larger than $J_{\text{train}}$), it means the algorithm is generalizing poorly. **$\implies$ High Variance Problem.**
* **Example 1 (Speech):** $J_{\text{train}}$ is 10.8%, $J_{\text{cv}}$ is 14.8%. $\text{Gap}_2 = 4.0\%$. This is large, indicating a **High Variance Problem.**

### 3. Concluding the Diagnosis

| Example Case | HLP / Baseline | $J_{\text{train}}$ | $J_{\text{cv}}$ | Bias Gap ($\text{HLP} \rightarrow J_{\text{train}}$) | Variance Gap ($J_{\text{train}} \rightarrow J_{\text{cv}}$) | Conclusion |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **Case 1** | 10.6% | 10.8% | 14.8% | Small (0.2%) | Large (4.0%) | **High Variance** |
| **Case 2** | 10.6% | 15.0% | 15.4% | Large (4.4%) | Small (0.4%) | **High Bias** |
| **Case 3** | 10.6% | 15.0% | 19.7% | Large (4.4%) | Large (4.7%) | **High Bias & High Variance** |

* **Final Rule of Thumb:**
    * **High Bias** is confirmed if $J_{\text{train}}$ is large **relative to the baseline**.
    * **High Variance** is confirmed if $J_{\text{cv}}$ is large **relative to $J_{\text{train}}$**.

## Learning curves

Here we introduces **Learning Curves** as a powerful diagnostic tool to understand a model's performance by plotting the training and cross-validation errors as a function of the **training set size ($m_{\text{train}}$)**.


### 1. What a Learning Curve Shows

* **General Trends (for a good model):**
    * **$J_{\text{cv}}$ (Cross-Validation Error):** Always **decreases** as $m_{\text{train}}$ increases (more data means a better, more generalized model).
    * **$J_{\text{train}}$ (Training Error):** Always **increases** as $m_{\text{train}}$ increases. It's easy to fit 1-3 examples perfectly (error $\approx 0$), but harder to fit hundreds perfectly.
    * **Relationship:** $J_{\text{cv}}$ is always higher than $J_{\text{train}}$ because the parameters are fitted to the training set.

### 2. Learning Curve for High Bias (Underfitting)
* **Diagnosis:** The flattened error is significantly **above the desired Baseline Level of Performance** (e.g., human-level performance). The model is too simple (e.g., a straight line) for the data.
* **Implication for Improvement:** **Getting more training data WILL NOT help much.** Because the model is too simple, adding more data will not change the simple function being fitted, and the errors will remain high. To fix high bias, you must increase model complexity.

---

![High bias](images/learning_curve_bias.png)


### 3. Learning Curve for High Variance (Overfitting)
* **Diagnosis:** The model is too complex and is overfitting the noise in the training data, leading to poor generalization.
* **Implication for Improvement:** **Getting more training data IS LIKELY to help.** By extending the curve to the right:
    * $J_{\text{train}}$ will continue to rise.
    * $J_{\text{cv}}$ will continue to fall and approach $J_{\text{train}}$.
    * The gap closes, and the cross-validation error moves closer to the desired baseline performance.

---
![High variance](images/learning_curve_variance.png)


### 4. Practical Considerations
* **Computational Cost:** Plotting learning curves by training many models on different subsets of the data can be **computationally expensive** and is not done often in practice.
* **Mental Model:** However, having the **mental picture** of the learning curves is essential for understanding whether your primary problem is high bias or high variance, which then guides your strategy for improving the algorithm.

## What to do next??

The diagnosis (checking $J_{\text{train}}$ and $J_{\text{cv}}$) guides the choice of solution. Different strategies address either a High Bias (Underfitting) or a High Variance (Overfitting) problem.

### 1. Strategies for High Variance (Overfitting)

A high variance model is too complex and fits the training data (and noise) too well, leading to a large gap between $J_{\text{train}}$ and $J_{\text{cv}}$. The goal is to **simplify the model** or **provide more data**.

* **Get More Training Examples:** $\uparrow$ Data is the most effective way to reduce variance, as it forces the model to generalize better across a wider range of examples.
* **Try a Smaller Set of Features:** $\downarrow$ Features reduces the model's flexibility to find complex, "wiggly" functions, thereby simplifying the model.
* **Increase Regularization ($\lambda$):** $\uparrow \lambda$ forces the parameters ($W$) to be smaller (closer to zero), resulting in a smoother, less complex function.

### 2. Strategies for High Bias (Underfitting)

A high bias model is too simple and fails to capture the complexity of the data, resulting in a high $J_{\text{train}}$ (poor performance even on the training set). The goal is to **increase the model's complexity/power**.

* **Get Additional Features:** Adding relevant, new features provides the model with more information to better predict the output.
* **Add Polynomial Features:** Introducing terms like $x^2, x^3, x_1x_2$ allows a linear model to fit non-linear data (i.e., increases its complexity).
* **Decrease Regularization ($\lambda$):** $\downarrow \lambda$ reduces the penalty on the parameters, giving the model more freedom to fit the training data better.

### 3. The Systematic Workflow

1.  **Diagnose:** Look at the error metrics ($J_{\text{train}}$ and $J_{\text{cv}}$) and the **Baseline Performance** to determine the primary issue:
    * **High Bias:** $J_{\text{train}}$ is high (relative to baseline).
    * **High Variance:** $J_{\text{cv}}$ $\gg J_{\text{train}}$.
2.  **Act:** Choose the corresponding set of solutions.
    * *Self-Correction Note:* Do not try to fix high bias by reducing the training set size, as this drastically worsens the cross-validation error and overall performance.

This systematic approach replaces trial-and-error, allowing developers to be far more effective in improving their machine learning systems.

## Iterative loop of ML development

### 1. The Iterative ML Development Loop
Developing a successful machine learning system is rarely a one-shot process; it requires multiple iterations:
1.  **Architecture & Data:** Decide on the model, data, and initial hyperparameters.
2.  **Implement & Train:** Train the initial model (which seldom works perfectly right away).
3.  **Diagnostics:** Analyze performance using tests like **Bias and Variance** and **Error Analysis** (to be covered next).
4.  **Decide & Refine:** Based on diagnostic insights, make changes (e.g., change model size, adjust $\lambda$, add/remove data or features).
5.  **Iterate:** Repeat the loop until the desired performance is reached.

### 2. Example: Email Spam Classification
* **Goal:** Build a supervised learning algorithm where input $\mathbf{X}$ (email features) predicts output $y$ (1 for spam, 0 for non-spam).
* **Feature Construction (Bag of Words):** A common method is to select the top 10,000 words in a dictionary and create a feature vector where:
    * $x_i = 1$ if the $i$-th word appears in the email.
    * $x_i = 0$ if the $i$-th word does not appear.

### 3. Multiple Improvement Ideas (and the Challenge of Prioritization)
Once a baseline model is trained, several ideas for improvement emerge, and choosing the most fruitful path is crucial for efficiency:

| Improvement Idea | Description |
| :--- | :--- |
| **Collect More Data** | Launching large projects (like "honeypots") to acquire more spam emails. |
| **Sophisticated Routing Features** | Using information from the email header (the sequence of servers an email traveled through) to identify spam paths. |
| **Sophisticated Body Features** | Normalizing text by treating variations (e.g., "discounting" and "discount") as the same word. |
| **Misspelling Detection** | Creating algorithms to recognize deliberate misspellings ("watches," "m0rtgage") used by spammers to defeat simple filters. |

### 4. Diagnostics Guide Decisions
* **Bias/Variance as a Filter:** Decisions must be guided by diagnostics. For example, if the model has **High Bias**, spending months on a honeypot project (collecting more data) will be a waste of time.
* **Error Analysis:** The next key diagnostic, Error Analysis, will provide more detailed insights on *which types of errors* the model makes, further guiding feature engineering and architecture decisions.

## Error Analysis

**Error Analysis** is the second most important diagnostic tool (after bias/variance) for guiding effective machine learning development. It involves manually inspecting misclassified examples to prioritize where to focus improvement efforts.

### 1. What is Error Analysis?
* **Definition:** Manually examining a set of examples that the learning algorithm **misclassified** (from the cross-validation set) to gain insight into the common types and sources of error.
* **Procedure:**
    1.  Get a set of misclassified examples (e.g., 100 out of 500 total CV errors). If the error set is too large, randomly **sample a subset** (e.g., 100-200 examples) for manual review.
    2.  Manually inspect these examples and group them into **common, non-mutually exclusive categories** (e.g., pharmaceutical spam, phishing, deliberate misspellings, embedded image spam).
    3.  **Count** the number of misclassified examples falling into each category.

### 2. Prioritizing Work
* The counts provide empirical data for prioritizing development work.
* **High-Impact Categories:** Categories with a large number of misclassified examples (e.g., Pharmaceutical spam: 21/100, Phishing: 18/100) are the most promising areas to focus engineering effort.
* **Low-Impact Categories:** Categories with few examples (e.g., Deliberate Misspellings: 3/100) are less critical. Even if a perfect solution is built for this category, it would only fix 3% of the current errors.
* **Benefit:** Error analysis helps avoid spending significant time on problems that ultimately have little impact on overall performance (a mistake the speaker admits to making early in his career).

### 3. Error Analysis Guides Solutions
The insights gained from counting errors inspire specific, targeted improvements:

* **Pharmaceutical Spam:** Might inspire collecting **more specific data** (only pharma spam emails) or creating **new features** related to specific drug names.
* **Phishing/Password Theft:** Might inspire creating special code to extract features from **suspicious URLs** within the email body.

### 4. Limitations
* Error analysis is **easiest for tasks that humans are good at** (e.g., classifying emails, recognizing images) because a human can easily determine why the algorithm made a mistake.
* It is **harder for tasks that humans struggle with** (e.g., predicting ad click-through rates), as a manual inspection is less insightful.

### 5. Synergy with Bias/Variance
Error analysis works in conjunction with bias/variance diagnostics:
* **Bias/Variance** determines **if** a strategy (like collecting more data) will help.
* **Error Analysis** determines **what kind of data** or **what kind of features** should be the focus of the engineering effort.

## Adding data

Here we provides several techniques for efficiently **adding, collecting, or creating data** to improve a machine learning model, particularly emphasizing a **data-centric approach** to development.

### 1. Targeted Data Collection
* **Efficiency:** Instead of indiscriminately collecting "more data of everything" (which is slow and expensive), use **Error Analysis** to identify and target specific error subsets.
* **Example:** If errors are dominated by **pharmaceutical spam**, focus efforts on labeling or acquiring more examples *only* of pharma spam to give the algorithm a targeted performance boost.

### 2. Data Augmentation
* **Definition:** Taking existing training examples ($\mathbf{X}, \mathbf{y}$) and applying **distortions or transformations** to the input ($\mathbf{X}$) to generate new, unique training examples that share the original label ($\mathbf{y}$).
* **Image Data (OCR):**
    * Apply basic distortions: **Rotation, scaling (enlarging/shrinking), or adjusting contrast**.
    * Apply advanced distortions: **Random grid warping** to create a richer variety of examples.
* **Audio Data (Speech Recognition):**
    * Add realistic **background noise** (e.g., crowd noise, car noise).
    * Simulate **recording quality issues** (e.g., bad cell phone connection).
* **Guiding Principle:** Distortions should be **representative of the types of noise or variations** expected in the test set. Adding purely random or unrealistic noise is generally not helpful.

### 3. Data Synthesis
* **Definition:** Creating entirely **new examples from scratch** rather than modifying existing ones.
* **Example (Photo OCR):** Generating images of text by using a computer's text editor, varying **fonts, colors, contrasts, and backgrounds**.
* **Benefit:** Can quickly generate a very large number of realistic-looking training examples, particularly effective for computer vision tasks.
* **Cost:** Requires significant initial effort to write the code to generate high-quality, realistic synthetic data.

### 4. Shifting to a Data-Centric Approach
* **Model-Centric (Traditional):** Holding the dataset fixed and focusing effort on improving the model/code (e.g., trying different algorithms, optimizing code).
* **Data-Centric (Modern Focus):** Holding the model/algorithm (e.g., neural network) fixed and focusing effort on **engineering the data** used by the algorithm.
* **Rationale:** Modern algorithms (neural networks, logistic regression, etc.) are already highly effective. Therefore, focusing on getting **high-quality, targeted data** is often the more efficient path to improving performance.

### 5. Preview: Transfer Learning
* **Context:** A technique for scenarios where data is scarce and hard to collect.
* **Concept:** Using data and knowledge learned from a **different, often unrelated task** to boost the performance of the current, data-poor application.

## Transfer learning: using data from a different task

**Transfer Learning** is a powerful technique for improving model performance, especially when labeled data for a specific application is scarce. It involves leveraging knowledge gained from training a model on a large, related dataset.

### 1. The Transfer Learning Process

Transfer learning consists of two main steps:

1.  **Supervised Pre-training (Knowledge Acquisition):**
    * Train a large neural network on a **very large, readily available dataset** (e.g., 1 million images with 1,000 different classes like cats, dogs, cars).
    * This process teaches the network to extract **generic, low-level features** (edges, corners, curves, basic shapes) from the input data.
    * Researchers often **post these pre-trained models online** so others can skip this time-consuming step.
2.  **Fine-Tuning (Application-Specific Adaptation):**
    * Take the pre-trained network and **replace the output layer** with a new, smaller output layer corresponding to your specific task (e.g., 10 units for handwritten digits 0-9).
    * **Initialize** the weights ($W, b$) of the lower layers using the pre-trained values.
    * **Train the new network** on your smaller, specific dataset. Two options exist:
        * **Option 1 (Very Small Dataset):** Only train the parameters of the new output layer ($W^5, b^5$). **Hold the weights of the lower layers fixed.**
        * **Option 2 (Slightly Larger Dataset):** Use the pre-trained weights as **initialization** and **train all the parameters** in the network.

![Transfer Learning](images/transfer_learning.png)

### 2. Why Transfer Learning Works

* The early layers of a deep neural network learn features that are often **generic and transferable** across similar input types (e.g., edges and corners are useful for recognizing both cats and handwritten digits).
* By starting with these pre-trained feature detectors, the new network begins training from a **much better starting point**, requiring far fewer examples to achieve good results on the new task.

### 3. Key Constraints and Benefits

* **Input Type Must Match:** The input data type ($\mathbf{X}$) for pre-training and fine-tuning **must be the same**.
    * Pre-trained on **images** $\implies$ fine-tune on **images**.
    * Pre-trained on **audio** $\implies$ fine-tune on **audio**.
    * Pre-trained on **text** $\implies$ fine-tune on **text**.
* **Data Scarcity Solution:** Transfer learning helps dramatically when the target application has a **small dataset** (e.g., getting good results with only 50 or 1,000 labeled images when pre-training used a million).
* **Community Contribution:** Transfer learning thrives on the generous sharing of pre-trained models (like those behind concepts such as ImageNet, BERT, and GPT-3), allowing the entire machine learning community to build better systems faster.

## Full cycle of machine learning project

### 1. The Iterative Development Stages
Developing a valuable ML system involves a recurring cycle of stages:
1.  **Project Scoping:** Defining the project and its goals (e.g., building speech recognition for mobile voice search).
2.  **Data Collection:** Gathering the necessary input data (X) and corresponding labels (Y) for the task.
3.  **Model Training & Iteration:** Training the model, performing **Error Analysis** and **Bias-Variance Diagnostics**, and iteratively improving performance.
    * This stage often loops back to the Data Collection step (e.g., using data augmentation to get more **car-noise speech data** after error analysis reveals poor performance in that domain).
4.  **Deployment (Production):** Making the model available for users to utilize.
5.  **Monitoring & Maintenance:** Continuously tracking the system's performance and updating the model as needed.

---

![Fulle Cyle of ML Project](images/full_cycle_ml_project.png)

### 2. Deployment in Production
Deployment typically involves setting up an **Inference Server** to host the trained model and serve predictions:
* **API Calls:** A user-facing application (e.g., a mobile app) sends the input ($\mathbf{X}$, like an audio clip) via an API call to the server.
* **Prediction:** The server runs the machine learning model and returns the prediction ($\hat{Y}$, like the text transcript).
* **Software Engineering:** Deployment requires significant software engineering effort to ensure the system is:
    * **Reliable and Efficient:** Making fast, accurate predictions.
    * **Scalable:** Managing large numbers of users.

---

![ML Deployment](images/ml_deployment.png)

### 3. Monitoring and Maintenance
Maintaining a deployed ML system is critical because the real-world data distribution can change:
* **Logging:** Logging input data ($\mathbf{X}$) and predictions ($\hat{Y}$) is essential (while respecting privacy and consent).
* **System Monitoring:** Logs are used to monitor the system's performance. For example, monitoring can reveal that the system is performing poorly on new search terms (e.g., names of new celebrities or politicians) that weren't in the original training data.
* **Model Updates:** If monitoring detects a performance drop (data shift), the model must be retrained and updated to replace the older version.

### 4. MLOps (Machine Learning Operations)
* **Definition:** MLOps is a growing field that focuses on the systematic practices for **building, deploying, and maintaining** machine learning systems in a production environment.
* **Focus:** Ensuring reliability, scalability, cost-efficiency, logging, and monitoring for ML models serving a large user base.

## Fairness, bias, and ethics

This section emphasizes the critical importance of **ethics, fairness, and bias** in machine learning development, providing practical steps and general guidance for mitigating potential negative impacts.

### 1. The Problem of Bias in ML Systems
The history of machine learning includes documented cases of unacceptable bias leading to harm:
* **Discrimination:** Systems have been shown to **discriminate against subgroups** (e.g., a hiring tool biased against women, bank loan approval systems biased against certain groups).
* **Inaccurate Recognition:** Face recognition systems have demonstrated **higher error rates for dark-skinned individuals**, leading to misidentification.
* **Reinforcing Stereotypes:** Algorithms can unintentionally **reinforce negative societal stereotypes** (e.g., search results for certain professions failing to show diversity).

### 2. Negative and Adverse Use Cases
* **Deepfakes:** Technology can be used to generate **fake videos** or content without consent and proper disclosure.
* **Harmful Content:** Social media algorithms, by optimizing solely for engagement, have contributed to the **spreading of toxic or incendiary speech**.
* **Fraud and Misuse:** Machine learning is used by malicious actors to create **fake content**, commit fraud, or build harmful products. Developers are urged to **walk away** from projects deemed unethical.

### 3. General Guidance for Ethical ML Development
While a simple ethical checklist does not exist, developers should incorporate these practices:
* **Assemble a Diverse Team:** Prior to deployment, assemble a team diverse in gender, ethnicity, culture, and other dimensions to **brainstorm possible harms**, particularly to vulnerable groups. Diverse teams are better at spotting potential problems.
* **Literature Search:** Conduct research on **industry standards or emerging guidelines** relevant to the application (e.g., standards for fairness in financial loan approval systems).
* **Audit Against Harm:** Before deployment, **audit the system** to measure performance across identified dimensions of possible bias (e.g., measure accuracy for different genders or ethnicities) to ensure problems are identified and fixed.
* **Develop a Mitigation Plan:** Create a plan for rapid response if harm occurs after deployment (e.g., a plan to roll back to a known fair system).
* **Monitor Post-Deployment:** **Continuously monitor** the system for unexpected harm or ethical failures in production to allow for quick execution of the mitigation plan.

## Error metrics for skewed datasets

When dealing with **skewed datasets** (where one class is much rarer than the other), standard **accuracy** is an inadequate performance metric because a simple, non-learning model that always predicts the majority class can achieve artificially high accuracy.

### 1. The Problem with Accuracy
* **Example:** If a disease is present in only 0.5% of the population, a "dummy" algorithm that **always predicts "no disease"** ($y=0$) achieves 99.5% accuracy (0.5% error).
* **Diagnosis Issue:** This makes it impossible to tell if a learning algorithm that achieves, say, 99% accuracy is any better than the useless dummy algorithm.

### 2. The Solution: Precision and Recall
To properly evaluate performance on skewed datasets, we use metrics derived from the **Confusion Matrix**, which categorize a classifier's predictions:

| Actual Class | Predicted Class 1 (Positive) | Predicted Class 0 (Negative) |
| :--- | :--- | :--- |
| **Actual Class 1** | **True Positive (TP)** | **False Negative (FN)** |
| **Actual Class 0** | **False Positive (FP)** | **True Negative (TN)** |

### 3. Defining Precision and Recall

These two metrics, which must both be high for a useful model, are defined as follows:

| Metric | Definition | Interpretation |
| :--- | :--- | :--- |
| **Precision** | $\text{TP} / (\text{TP} + \text{FP})$ | **Of all the times the model predicted positive**, what fraction was correct? (Measures prediction accuracy.) |
| **Recall** | $\text{TP} / (\text{TP} + \text{FN})$ | **Of all the actual positive cases**, what fraction did the model correctly detect? (Measures completeness/sensitivity.) |

![Confusion Matrix](images/confusion_matrix.png)

### 4. Detecting Useless Algorithms
* **Zero Prediction:** An algorithm that always predicts the negative class ($y=0$) will have **zero True Positives (TP)**.
* **Recall Check:** Since $\text{Recall} = 0 / (0 + \text{FN})$, the recall will be **zero**, immediately signaling that the model is useless (i.e., it failed to detect any actual positive cases). This avoids the deception of high accuracy.

### 5. Goal for Rare Classes
When working with a rare class, the goal is to build a classifier that achieves **high values for both precision and recall**.

## Trading off precision and recall

Here we discuss the necessity of **trade-off between Precision and Recall** when dealing with classification problems, particularly those with skewed datasets. It shows how the choice of an output threshold affects this balance and introduces the **F1 score** as a single metric to optimize the tradeoff.

### 1. The Precision-Recall Trade-Off

* **Ideal Goal:** High **Precision** (when the model predicts positive, it's usually correct) and high **Recall** (the model successfully finds most actual positive cases).
* **The Dilemma:** In practice, improving one often comes at the expense of the other.

##### Adjusting the Threshold (Logistic Regression)
The standard prediction threshold for a logistic model is $f(\mathbf{x}) \ge 0.5$. Adjusting this threshold changes the trade-off:

| Threshold Adjustment | Impact on Prediction | Effect on P & R | Scenario Example |
| :--- | :--- | :--- | :--- |
| **Raise Threshold** (e.g., to $\ge 0.7$ or $0.9$) | Predicts $y=1$ only when **very confident**. | $\uparrow$ **Precision** (fewer false positives, predictions are more accurate) / $\downarrow$ **Recall** (misses more true positives) | When treatment is invasive/expensive, prioritizing diagnosis accuracy. |
| **Lower Threshold** (e.g., to $\ge 0.3$) | Predicts $y=1$ even when **less confident**. | $\downarrow$ **Precision** (more false positives) / $\uparrow$ **Recall** (finds more true positives) | When leaving the disease untreated has severe consequences, prioritizing detection completeness. |

![Precision Recall Curve](images/precision_recall_curve.png)

### 2. Choosing the Best Model

When comparing multiple models, having two metrics (Precision and Recall) makes selection difficult.

#### The F1 Score
To select the best model, it's helpful to combine Precision ($P$) and Recall ($R$) into a single score. The simple average is inadequate because it can favor a model with very low P or very low R (e.g., a model that always predicts $y=1$).

* **F1 Score Definition:** The **harmonic mean** of Precision and Recall. It gives greater emphasis to whichever of $P$ or $R$ is the lower value, thereby penalizing models with a severe imbalance.
* **Formula:**
    $$\text{F1 Score} = \frac{1}{1/2(\frac{1}{P} + \frac{1}{R})} = 2 \cdot \frac{P \cdot R}{P + R}$$
* **Usage:** By computing the F1 Score for different algorithms or for different threshold choices, the developer can pick the one with the **highest F1 Score**, which represents the best balanced trade-off between Precision and Recall.