# Week 3: Classification

### Logistic Regression

![logistic regression definition](../images/lr1.png)

![logistic regression interpretation](../images/lr2.png)

### Decision Boundary

![logistic regression decision boundary](../images/lreg_db.png)

![logistic regression decision boundary-linear](../images/lreg_db1.png)

![logistic regression decision boundary-nonlinear](../images/lreg_db2.png)

### Cost function for logistic regression

Here is the problem with sequared loss function for the logistic regression:

![squared cost function](../images/squared_cost_function.png)

We need another function to estimate the loss function for logistic regression:

![log-loss loss function1](../images/lreg_loss_function1.png)

![log-loss loss function1](../images/lreg_loss_function2.png)

![log-loss cost function](../images/lreg_cost_function.png)

### Gradient Descent Implementation

* **Objective**: To fit a **logistic regression** model, the goal is to find the optimal values for parameters **$w$** and **$b$** that **minimize the cost function**, **$J(w, b)$**, using an optimization algorithm called **gradient descent**.
---
* **Gradient Descent Algorithm**: The process involves repeatedly updating each parameter (e.g., **$w_j$** and **$b$**) by subtracting the **learning rate ($α$)** multiplied by the **derivative of the cost function** with respect to that parameter.
    * **Derivative with respect to $w_j$**: The derivative of the cost function **$J$** with respect to **$w_j$** is given by the expression: $\frac{1}{m}\sum_{i=1}^{m} (f^{(i)} - y^{(i)})x_{j}^{(i)}$.
    * **Derivative with respect to $b$**: The derivative of the cost function **$J$** with respect to **$b$** is given by the expression: $\frac{1}{m}\sum_{i=1}^{m} (f^{(i)} - y^{(i)})$.
    * **Simultaneous Updates**: All parameter updates must be calculated first and then applied simultaneously to ensure the algorithm works correctly.
---
* **Key Distinction from Linear Regression**: Although the gradient descent update equations for logistic regression appear identical to those for linear regression, the algorithms are fundamentally different. This is because the definition of the prediction function, **$f(x)$**, has changed:
    * **Linear Regression**: $f(x) = wx + b$
    * **Logistic Regression**: $f(x) = \text{sigmoid}(wx + b)$. The sigmoid function transforms the linear output into a probability between 0 and 1.
---
* **Best Practices and Convergence**:
    * You can monitor gradient descent to ensure it converges, just as in linear regression.
    * **Feature scaling**—adjusting features to similar ranges (e.g., -1 to +1)—is highly recommended as it can significantly speed up the convergence of the gradient descent algorithm.
    * **Vectorization** can be used to implement gradient descent more efficiently and make it run faster.
---

![Logistic regression gradient descent](../images/lreg_gradient_descent.png)

### The Problem of Overfitting

#### Underfitting (High Bias)

Underfitting occurs when a model is too simple to capture the underlying patterns in the training data. This leads to poor performance on both the training data and new, unseen data. It often results from having too few features or a model that isn't complex enough. The model has a strong, incorrect assumption or "bias" about the data's nature. 

#### Overfitting (High Variance)

Overfitting happens when a model is too complex and learns the noise and random fluctuations of the training data instead of just the core patterns. While it performs extremely well on the training data, it fails to **generalize** to new examples. This is also called having "high variance," as the model's predictions are highly sensitive and **variable** to minor changes in the training data. 

#### The "Just Right" Model

The goal of machine learning is to find a "just right" model—one that avoids both underfitting and overfitting. This model is complex enough to capture the data's main trends but simple enough to ignore the noise, ensuring it can make accurate predictions on new data.

---
![overfitting](../images/overfitting.png)

### Addressing Overfitting

To fix an overfitted model (high variance), you can take one of three main actions:

#### 1. Get More Data
Collecting a larger dataset is the best way to combat overfitting. More data points provide a clearer picture of the underlying patterns, making it harder for the model to just memorize noise and random fluctuations from a small sample. This helps the model generalize better.

#### 2. Use Fewer Features
If you have many features and a limited amount of data, overfitting is likely. You can manually select a smaller, more relevant subset of features, a process known as **feature selection**. While this can reduce overfitting, it may cause you to lose potentially useful information.

#### 3. Regularization
Regularization is a powerful and widely used technique that allows you to keep all your features while reducing their impact. It works by adding a penalty to the model's cost function to discourage overly large parameter values (**$w$**). This encourages a simpler, smoother model that is less likely to overfit and performs better on new data.

---

![Addressing Overfitting](../images/addressing_overfitting.png)

### Cost function with regularization

Regularization is a technique used to **reduce overfitting** in machine learning models by adding a penalty term to the cost function. This encourages the model's parameters ($W$) to be small.

#### The Regularized Cost Function (L2 Regularization)

The standard cost function $J$ is modified by adding a **regularization term** that penalizes large parameter values. This specific form is known as **L2 regularization** (or Ridge Regression in the context of linear models).

* **Original Goal:** Minimize the error between predictions and actual values (e.g., Mean Squared Error).
* **New Goal:** Minimize the error **AND** keep the $W$ parameters small.

The modified cost function is:
$$J_{regularized}(\mathbf{W}, b) = \underbrace{\frac{1}{2m} \sum_{i=1}^{m} (\text{loss term})}_{\text{Fit the data well}} + \underbrace{\frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2}_{\text{Regularization term (Keep } W \text{ small)}}$$

* **$\lambda$ (Lambda):** This is the **regularization parameter**, a hyperparameter that controls the trade-off between fitting the training data well (minimizing the first term) and keeping the weights small (minimizing the second term).
* **Convention:** By convention, the bias term ($b$) is usually **not** penalized, as it makes very little difference in practice.

#### How Regularization Works

* **Simpler Model:** Penalizing large $W$ values effectively simplifies the model. By forcing $W$ values close to zero (without eliminating them entirely), the model reduces the influence of those features, resulting in a smoother, less "wiggly" function that is less prone to overfitting.
* **Trade-off:** The model must balance two conflicting goals:
    * **Low $\lambda$ (e.g., $\lambda=0$):** No regularization. The model is free to choose large weights, resulting in a complex, wiggly curve that **overfits** the data.
    * **High $\lambda$ (e.g., $\lambda=10^{10}$):** Heavy penalty. The only way to minimize the cost is to drive all $W$ values close to zero. This simplifies the model excessively, making it too rigid (like a horizontal line), which causes the model to **underfit** the data.
    * **Just Right $\lambda$:** An intermediate value of $\lambda$ balances the two terms, leading to a model that generalizes well by fitting the data's pattern without overfitting to the noise.

### Regularized linear regression

Gradient descent for regularized linear regression is an extension of the standard algorithm, incorporating the penalty term for the weights ($W$) to mitigate overfitting.

#### Regularized Cost Function
The goal is to minimize the regularized cost function, $J(\mathbf{W}, b)$:
$$J(\mathbf{W}, b) = \frac{1}{2m} \sum_{i=1}^{m} (f_{\mathbf{W}, b}(\mathbf{x}^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2$$

#### Gradient Descent Updates
The core gradient descent update rules are modified to include the derivative of the regularization term. Remember that the bias parameter ($b$) is **not** regularized.

| Parameter | Update Rule | Notes |
| :--- | :--- | :--- |
| **$w_j$** | $w_j := w_j - \alpha \left[ \left( \frac{1}{m} \sum_{i=1}^{m} (f_{\mathbf{W}, b}(\mathbf{x}^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m} w_j \right]$ | Includes the new $\frac{\lambda}{m} w_j$ term to shrink $w_j$. |
| **$b$** | $b := b - \alpha \left[ \frac{1}{m} \sum_{i=1}^{m} (f_{\mathbf{W}, b}(\mathbf{x}^{(i)}) - y^{(i)}) \right]$ | Remains unchanged from unregularized linear regression. |

* **Simultaneous Updates:** All $w_j$ and $b$ must be updated simultaneously.
* **$f_{\mathbf{W}, b}(\mathbf{x}^{(i)})$** is the linear regression prediction: $\mathbf{w} \cdot \mathbf{x} + b$.

#### Intuition (Weight Decay)
The update rule for $w_j$ can be algebraically rearranged to reveal a clearer interpretation:
$$w_j := w_j \left( 1 - \alpha \frac{\lambda}{m} \right) - \alpha \left( \frac{1}{m} \sum_{i=1}^{m} (\text{Error} \times x_j^{(i)}) \right)$$

* **Weight Shrinkage:** The term $\left( 1 - \alpha \frac{\lambda}{m} \right)$ is a number slightly less than 1 (since $\alpha$, $\lambda$, and $m$ are positive).
* **Effect:** On every iteration, $w_j$ is multiplied by a number slightly less than 1 before the standard update is applied. This effectively shrinks the value of $w_j$ toward zero, giving the method its alternative name: **Weight Decay**.

### Regularized logistic regression

To address overfitting in logistic regression, a **regularization term** is added to the cost function, similar to linear regression.

#### Regularized Cost Function

The cost function for logistic regression, $J(\mathbf{W}, b)$, is modified to include the $\text{L2}$ regularization term:

$$J_{regularized}(\mathbf{W}, b) = \underbrace{\frac{1}{m} \sum_{i=1}^{m} \left[ -y^{(i)}\log(f_{\mathbf{W}, b}(\mathbf{x}^{(i)})) - (1-y^{(i)})\log(1-f_{\mathbf{W}, b}(\mathbf{x}^{(i)})) \right]}_{\text{Original Logistic Regression Cost}} + \underbrace{\frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2}_{\text{Regularization Term}}$$

* This term $\frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2$ penalizes large values of the weights ($\mathbf{W}$) to prevent an overly complex decision boundary and improve generalization.
* **$f_{\mathbf{W}, b}(\mathbf{x})$** is the logistic regression prediction: $g(\mathbf{w} \cdot \mathbf{x} + b)$, where $g$ is the sigmoid function.

#### Gradient Descent Updates

Gradient descent is used to minimize the new cost function. The update rules for the parameters are almost identical to those for regularized linear regression, but the definition of the prediction function ($f$) is different.

* **Update for $w_j$** (for $j=1$ to $n$):
    $$w_j := w_j - \alpha \left[ \left( \frac{1}{m} \sum_{i=1}^{m} (f_{\mathbf{W}, b}(\mathbf{x}^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m} w_j \right]$$
    * The term **$\frac{\lambda}{m} w_j$** is the derivative of the regularization term, which ensures that the weights are shrunk toward zero in every iteration.
* **Update for $b$** (Bias):
    $$b := b - \alpha \left[ \frac{1}{m} \sum_{i=1}^{m} (f_{\mathbf{W}, b}(\mathbf{x}^{(i)}) - y^{(i)}) \right]$$
    * The bias parameter **$b$ is not regularized**, so its update rule remains the same as in the unregularized version.

This method allows logistic regression to use many features (including high-order polynomial features) while still finding a simpler, more reasonable **decision boundary** that avoids overfitting. 