<h2 style="text-align: center;"><strong>Week 3: Classification</strong></h2>

- Introduction to Classification
- Logistic Regression
- Decision Boundary
- Cost Function for Logistic Regression
- Gradient Descent Implementation for Logistic Regression
- Overfitting
- Regularization in Machine Learning

---

## **Introduction to Classification**

Classification is a **supervised learning task** where the output variable $y$ can take on only **a limited set of discrete values**, rather than any value in a continuous range.  

In binary classification, $y$ takes only **two possible values**, typically $0$ or $1$.  

Examples of binary classification problems include:

- Predicting whether an email is **spam ($1$)** or **not spam ($0$)**  
- Determining if an online transaction is **fraudulent ($1$)** or **legitimate ($0$)**  
- Diagnosing a tumor as **malignant ($1$)** or **benign ($0$)**  

### **Positive and Negative Classes**

By convention:

- The class corresponding to $y = 1$ is called the **positive class**  
- The class corresponding to $y = 0$ is called the **negative class**  

For example:

| Problem                  | Positive Class ($y=1$) | Negative Class ($y=0$) |
|--------------------------|------------------------|------------------------|
| Email spam detection      | Spam                   | Not Spam               |
| Transaction fraud detection | Fraudulent           | Legitimate             |
| Tumor diagnosis          | Malignant              | Benign                 |

*Note: Positive and negative do **not imply good or bad**. They simply indicate the presence or absence of a particular property.*

### **Why Linear Regression is Not Suitable for Classification**

Linear regression predicts a **continuous output**, which can take any real value.  

For classification, this causes several issues:

- Predictions can be **less than $0$** or **greater than $1$**, which are not valid probabilities  
- A simple threshold (e.g., $0.5$) can be applied, but the model can be **highly sensitive to outliers**  
- Adding a single extreme data point can **shift the decision boundary** dramatically, reducing reliability

### **Need for Logistic Regression**

Logistic regression addresses these limitations by:

- Constraining outputs to the range **$[0, 1]$**  
- Providing **probabilistic interpretation** of predictions  
- Maintaining **stable decision boundaries** even with new or extreme data points  

This makes logistic regression a **preferred method for binary classification problems**, forming the foundation for more advanced classification techniques.

---

## **Logistic Regression**

Logistic Regression is a **classification algorithm** used for problems where the output label $y$ can take only **two values**, typically $0$ or $1$. In applications such as tumor classification, the goal is to predict whether a tumor is **benign (0)** or **malignant (1)**.

Linear regression is not suitable for this type of problem because its predictions are unbounded and cannot be interpreted as probabilities. Logistic regression addresses this limitation by fitting an **S-shaped curve** that outputs values strictly between **0 and 1**.

### **Sigmoid (Logistic) Function**

A key component of logistic regression is the **sigmoid function**, also known as the **logistic function**. This function maps any real-valued number to a value between 0 and 1.

The sigmoid function is defined as:

$$
g(z) = \frac{1}{1 + e^{-z}}
$$

Where:
- $z$ is any real number  
- $e$ is a mathematical constant approximately equal to $2.718$

### **Behavior of the Sigmoid Function**

The sigmoid function has a characteristic S-shape:

- When $z$ is a **large positive number**, $e^{-z}$ becomes very small and $g(z)$ approaches **1**
- When $z$ is a **large negative number**, $e^{-z}$ becomes very large and $g(z)$ approaches **0**
- When $z = 0$:

$$
g(0) = \frac{1}{1 + 1} = 0.5
$$

This is why the sigmoid curve crosses the vertical axis at **0.5**.

### **Logistic Regression Model**

Logistic regression is built in two steps.

First, we compute a linear combination of the input features:

$$
z = \mathbf{w}^T \mathbf{x} + b
$$

Next, we apply the sigmoid function to this value:

$$
h_{\mathbf{w}, b}(\mathbf{x}) = g(z) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}
$$

This equation defines the **logistic regression model**.

### **Interpretation of the Output**

The output of logistic regression is interpreted as a **probability**:

$$
h_{\mathbf{w}, b}(\mathbf{x}) = P(y = 1 \mid \mathbf{x})
$$

For example:
- If the model outputs $0.7$, it predicts a **70% probability** that $y = 1$
- Since $y$ can only be $0$ or $1$, the probability that $y = 0$ is:

$$
P(y = 0 \mid \mathbf{x}) = 1 - 0.7 = 0.3
$$

Although the model outputs values like $0.7$ or $0.3$, the true label is always **0 or 1**.

### **Why Logistic Regression Works for Classification**

- Outputs are constrained between **0 and 1**
- Predictions have a clear probabilistic interpretation
- The sigmoid function models uncertainty near decision boundaries
- Forms the foundation for many real-world classification systems

Logistic regression is one of the most important supervised learning algorithms and serves as a stepping stone to more advanced classification methods.

---

## **Decision Boundary**

In logistic regression, the **decision boundary** explains **how the model converts a probability output into a class prediction (0 or 1)**. While logistic regression outputs values between 0 and 1, the decision boundary defines **where the prediction switches from class 0 to class 1**.

### **Recap: Logistic Regression Model**

Logistic regression computes predictions in two steps:

1. **Linear combination of features**
$$
z = w^T x + b
$$

2. **Apply the Sigmoid (logistic) function**
$$
f(x) = g(z) = \frac{1}{1 + e^{-z}}
$$

The output $f(x)$ is interpreted as:
$$
f(x) = P(y = 1 \mid x; w, b)
$$

This value represents the **probability that the label is 1 given the input features**.

### **From Probability to Prediction**

To make a **hard classification**, we introduce a threshold. The most common choice is $0.5$:

$$
\hat{y} =
\begin{cases}
1 & \text{if } f(x) \ge 0.5 \\
0 & \text{if } f(x) < 0.5
\end{cases}
$$

Because the sigmoid function satisfies:
$$
g(z) \ge 0.5 \iff z \ge 0
$$

This means:
$$
\hat{y} = 1 \quad \text{when} \quad w^T x + b \ge 0
$$
$$
\hat{y} = 0 \quad \text{when} \quad w^T x + b < 0
$$

### **Definition of the Decision Boundary**

The **decision boundary** is defined by:
$$
w^T x + b = 0
$$

This equation represents the set of all points where the model is **indifferent** between predicting class 0 and class 1. On one side of this boundary, predictions are 1; on the other side, predictions are 0.

### **Decision Boundary with Two Features**

For two input features $x_1$ and $x_2$:
$$
z = w_1 x_1 + w_2 x_2 + b
$$

The decision boundary becomes:
$$
w_1 x_1 + w_2 x_2 + b = 0
$$

**Example:**
$$
w_1 = 1,\quad w_2 = 1,\quad b = -3
$$

Decision boundary:
$$
x_1 + x_2 = 3
$$

- Points where $x_1 + x_2 \ge 3 \implies \hat{y} = 1$  
- Points where $x_1 + x_2 < 3 \implies \hat{y} = 0$  

This is a **straight line** in the $x_1$–$x_2$ plane.

### **Non-linear Decision Boundaries with Polynomial Features**

By introducing **polynomial features**, logistic regression can learn **non-linear decision boundaries**. For example:

$$
z = w_1 x_1^2 + w_2 x_2^2 + b
$$

Decision boundary:
$$
w_1 x_1^2 + w_2 x_2^2 + b = 0
$$

**Example:**
$$
w_1 = 1,\quad w_2 = 1,\quad b = -1
$$

Decision boundary:
$$
x_1^2 + x_2^2 = 1
$$

- Inside the circle ($x_1^2 + x_2^2 < 1$) $\implies \hat{y} = 0$  
- Outside the circle ($x_1^2 + x_2^2 \ge 1$) $\implies \hat{y} = 1$  

Higher-order polynomial features allow logistic regression to **learn very complex boundaries**, including ellipses or intricate shapes, giving flexibility to fit real-world data.

### **Key Points**

- Decision boundary separates predicted classes  
- Linear logistic regression $\implies$ linear boundaries  
- Polynomial features $\implies$ non-linear boundaries  
- Defined by $w^T x + b = 0$ (or polynomial equivalent)  
- Critical for interpreting logistic regression predictions

---

## **Cost Function for Logistic Regression**

In logistic regression, the **cost function** measures how well a particular set of parameters $(w, b)$ fits the training data. Choosing an appropriate cost function is crucial because it guides the optimization process to find the best parameters.

### **Why Squared Error Is Not Ideal**

For linear regression, the squared error is used:

$$
J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} (f(x^{(i)}) - y^{(i)})^2
$$

However, using this cost function for logistic regression is problematic:

- The hypothesis in logistic regression is 
$$
f(x) = g(w^T x + b) = \frac{1}{1 + e^{-z}}, \quad z = w^T x + b
$$
- Plugging this into the squared error cost results in a **non-convex function**  
- Non-convexity means **multiple local minima**, making gradient descent unreliable  

Hence, a **different cost function** is needed to ensure convergence.

### **Loss Function for a Single Training Example**

Logistic regression uses a **log loss** (also called cross-entropy loss) for a single example:

$$
\text{Loss}(f(x), y) =
\begin{cases} 
-\log(f(x)) & \text{if } y = 1 \\[2mm]
-\log(1 - f(x)) & \text{if } y = 0
\end{cases}
$$

- If the model predicts close to the true label, the loss is small  
- If the prediction is far from the true label, the loss increases sharply  

This **penalizes confident but wrong predictions** heavily, making the algorithm learn effectively.

### **Simplified Single-Equation Loss Function**

Since $y \in \{0, 1\}$, the two cases can be combined into a single formula:

$$
\text{Loss}(f(x), y) = - \Big[ y \log(f(x)) + (1 - y) \log(1 - f(x)) \Big]
$$

- When $y = 1$: $- [1 \cdot \log(f(x)) + 0 \cdot \log(1 - f(x))] = -\log(f(x))$  
- When $y = 0$: $- [0 \cdot \log(f(x)) + 1 \cdot \log(1 - f(x))] = -\log(1 - f(x))$  

This compact form is convenient for implementation.

### **Cost Function for the Entire Training Set**

The cost over $m$ training examples is the **average loss**:

$$
J(w, b) = \frac{1}{m} \sum_{i=1}^{m} \text{Loss}(f(x^{(i)}), y^{(i)})
$$

Substituting the simplified loss function:

$$
J(w, b) = - \frac{1}{m} \sum_{i=1}^{m} \Big[ y^{(i)} \log(f(x^{(i)})) + (1 - y^{(i)}) \log(1 - f(x^{(i)})) \Big]
$$

- This is the standard **logistic regression cost function** used in practice  
- It is **convex**, ensuring gradient descent converges to the global minimum  

### **Intuition Behind the Cost Function**

- **y = 1**: if $f(x)$ is close to 1, loss is near 0; if $f(x)$ is close to 0, loss is very high  
- **y = 0**: if $f(x)$ is close to 0, loss is near 0; if $f(x)$ is close to 1, loss is very high  

Graphically, this creates a **smooth, convex surface** for $J(w, b)$, unlike the wiggly, non-convex surface of squared error for logistic regression.

### **Statistical Rationale**

This cost function can be derived from **maximum likelihood estimation (MLE)**:

- Logistic regression predicts probabilities $P(y = 1 \mid x; w, b)$  
- The likelihood of the observed data is maximized by minimizing this cost function  
- MLE justifies the choice of **logarithmic loss** from a statistical perspective  

### **Key Points**

- Squared error is not suitable for logistic regression due to non-convexity  
- The logistic loss (cross-entropy) ensures **convexity** and reliable gradient descent  
- Single-example loss:
$$
\text{Loss}(f(x), y) = - \big[y \log(f(x)) + (1-y)\log(1-f(x)) \big]
$$
- Overall cost function:
$$
J(w, b) = - \frac{1}{m} \sum_{i=1}^{m} \big[ y^{(i)} \log(f(x^{(i)})) + (1 - y^{(i)}) \log(1 - f(x^{(i)})) \big]
$$
- Provides strong penalties for wrong predictions and small penalties for correct ones  
- Forms the foundation for gradient descent optimization in logistic regression

---

## **Gradient Descent Implementation for Logistic Regression**

To train a logistic regression model, we aim to find parameters $(w, b)$ that **minimize the cost function**:

$$
J(w, b) = - \frac{1}{m} \sum_{i=1}^{m} \Big[ y^{(i)} \log(f(x^{(i)})) + (1 - y^{(i)}) \log(1 - f(x^{(i)})) \Big]
$$

where 

$$
f(x^{(i)}) = g(z^{(i)}) = \frac{1}{1 + e^{-z^{(i)}}}, \quad z^{(i)} = w^T x^{(i)} + b
$$

is the logistic regression hypothesis using the Sigmoid function.

### **Gradient Descent Update Rules**

Gradient descent iteratively updates the parameters in the direction that **reduces the cost function**:

- For each weight $w_j$:

$$
w_j := w_j - \alpha \frac{\partial J(w, b)}{\partial w_j} 
$$

- For the bias term $b$:

$$
b := b - \alpha \frac{\partial J(w, b)}{\partial b}
$$

### **Gradients**

The partial derivatives of the cost function are:

- With respect to $w_j$:

$$
\frac{\partial J(w, b)}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} \Big( f(x^{(i)}) - y^{(i)} \Big) x_j^{(i)}
$$

- With respect to $b$:

$$
\frac{\partial J(w, b)}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} \Big( f(x^{(i)}) - y^{(i)} \Big)
$$

Here $x_j^{(i)}$ is the $j^{th}$ feature of the $i^{th}$ training example.

### **Simultaneous Updates**

All parameters $(w_1, w_2, ..., w_n, b)$ should be updated **simultaneously**:

1. Compute all gradients using current parameters  
2. Update all parameters using the gradients at the same time  

This prevents interference between updates.

### **Vectorized Implementation**

The gradient descent updates can also be written in **vectorized form** for efficiency:

$$
\mathbf{w} := \mathbf{w} - \alpha \frac{1}{m} X^T (f(X) - \mathbf{y})
$$

$$
b := b - \alpha \frac{1}{m} \sum_{i=1}^{m} (f(x^{(i)}) - y^{(i)})
$$

Where:
- $X$ is the $m \times n$ feature matrix  
- $\mathbf{w}$ is the $n \times 1$ weight vector  
- $f(X)$ is the vector of predictions for all training examples  

Vectorization significantly speeds up computation compared to looping over each training example.

### **Feature Scaling**

- Feature scaling is recommended to **speed up gradient descent**  
- Scale each feature to a similar range (e.g., $[-1, 1]$)  
- Helps ensure all weights converge efficiently

### **Algorithm Summary**

1. Initialize $w_j = 0$ and $b = 0$  
2. Repeat until convergence:
   - Compute predictions $f(x^{(i)})$ using Sigmoid
   - Compute gradients $\frac{\partial J}{\partial w_j}$ and $\frac{\partial J}{\partial b}$
   - Update parameters using the learning rate $\alpha$
3. Return optimized parameters $(w, b)$

### **Key Points**

- Gradient descent for logistic regression is similar in form to linear regression  
- The difference lies in using the **Sigmoid function** as the hypothesis  
- Convex cost function ensures convergence to a **global minimum**  
- Feature scaling improves convergence speed  
- Vectorized implementations are more efficient for large datasets  

This forms the foundation for training logistic regression models on real-world classification problems.

---

## **Overfitting**

Overfitting is a common problem in machine learning where a model performs **extremely well on training data** but **fails to generalize to new, unseen examples**. It occurs when a model learns not only the underlying patterns but also the **noise or random fluctuations** in the training data. Understanding overfitting requires contrasting it with underfitting and exploring the concepts of **bias, variance, and generalization**.

### **Underfitting vs Overfitting**

1. **Underfitting (High Bias):**  
   - Occurs when a model is too simple to capture the underlying structure of the data.  
   - Example: Predicting housing prices using a straight line when the actual relationship is nonlinear.  
   - The model performs poorly on both the training set and new data.  
   - Also called **high bias**, because the model has a strong assumption about the data (e.g., "prices are linear with size") that prevents it from fitting the data well.

2. **Overfitting (High Variance):**  
   - Occurs when a model is too complex and tries to fit every training example perfectly, including noise.  
   - Example: Fitting a fourth-order polynomial to only five house price examples. The model passes through all points but creates a wiggly curve that makes poor predictions on new houses.  
   - The model performs very well on training data but **generalizes poorly**.  
   - Also called **high variance**, because small changes in the training data can lead to large changes in the model.

3. **Just Right Model:**  
   - A model that **balances bias and variance**, fits the training data well, and generalizes to new examples.  
   - Example: A quadratic model might capture the curvature in house prices without being overly complex.

This relationship is often illustrated with the **Goldilocks principle**:  
- Too simple → underfit  
- Too complex → overfit  
- Just right → optimal balance

### **Overfitting in Regression and Classification**

#### **Regression Example**
- Features: House size, polynomial terms ($x$, $x^2$, $x^3$, $x^4$).  
- **High-order polynomial** → overfit → perfect training fit but unrealistic predictions.  
- **Quadratic polynomial** → just right → reasonable fit, good generalization.

#### **Classification Example**
- Features: Tumor size ($x_1$) and age ($x_2$).  
- Logistic regression:
  - Simple model: Straight decision boundary → underfit.  
  - Quadratic features: Elliptical decision boundary → just right.  
  - High-order polynomial: Complex contoured boundary → overfit, poor generalization.

### **Causes of Overfitting**
1. **Too many features** relative to the number of training examples.  
2. **Highly flexible models** (e.g., high-degree polynomials).  
3. **Noisy training data** with random fluctuations.

### **Detecting Overfitting**
- Large gap between **training accuracy** and **validation/test accuracy**.
- Visual inspection: Overly complex curves in regression, twisted decision boundaries in classification.

### **Techniques to Reduce Overfitting**

1. **Collect More Training Data**
   - Increasing the number of examples can reduce the variance of the model.  
   - More data allows high-complexity models to generalize better.  
   - Limitation: Not always feasible.

2. **Feature Selection**
   - Reduce the number of input features to **simplify the model**.  
   - Example: Select only the most relevant features for predicting house prices (size, bedrooms, age).  
   - Benefit: Reduces the chance of overfitting.  
   - Limitation: Discards some information; optimal feature selection is often automated.

3. **Regularization**
   - Penalizes large parameter values to **limit model complexity**.  
   - Instead of removing features, it **shrinks their weights**:  
     - $L_2$ (Ridge) regularization: $\frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2$  
     - $L_1$ (Lasso) regularization: $\frac{\lambda}{m} \sum_{j=1}^{n} |w_j|$  
   - Prevents features from dominating the prediction and reduces overfitting.  
   - Typically, we regularize $w_1, w_2, ..., w_n$, but not the bias term $b$.  
   - Maintains the flexibility of the model while improving generalization.

### **Summary of Overfitting Remedies**
| Method                        | How it helps                                          | Limitation                                 |
|--------------------------------|------------------------------------------------------|-------------------------------------------|
| **More Training Data**         | Reduces variance, smoothens model                   | Not always available                       |
| **Feature Selection**          | Reduces model complexity                             | May discard useful information             |
| **Regularization**             | Penalizes large weights, reduces overfitting        | Requires choosing a regularization factor |

### **Key Concepts Recap**
- **Bias:** Model’s inability to capture data patterns → underfitting.  
- **Variance:** Model’s sensitivity to training data → overfitting.  
- **Generalization:** Ability of the model to perform well on unseen data.  
- **Just Right Model:** Achieves balance between bias and variance.  

By carefully managing model complexity, selecting appropriate features, and using regularization, you can **prevent overfitting** and train models that generalize well to real-world data.

---

## **Regularization in Machine Learning**

Regularization is a key technique to **reduce overfitting** in machine learning models. It works by **penalizing large values of model parameters** (weights), effectively simplifying the model while retaining all features. Regularization is used in both **linear regression** and **logistic regression** and is particularly useful when the model has **many features** relative to the number of training examples.

### **Intuition Behind Regularization**

1. **Problem Scenario**:  
   - Consider fitting a high-degree polynomial (e.g., fourth-order) to a small dataset of housing prices.  
   - Without any constraints, the model might produce a **highly wiggly curve** that fits the training data perfectly but **generalizes poorly**.

2. **Idea**:  
   - If we could **force the parameters of higher-order terms** (like $w_3$ and $w_4$) to be very small, the model would behave more like a **simpler quadratic curve**.  
   - Regularization achieves this by **adding a penalty term** to the cost function that increases as the parameters grow large.

### **Modified Cost Function with Regularization**

For **linear regression**, the regularized cost function becomes:

$$
J(w, b) = \frac{1}{2m} \sum_{i=1}^m \left(f(x^{(i)}) - y^{(i)}\right)^2 + \frac{\lambda}{2m} \sum_{j=1}^n w_j^2
$$

Where:

- $f(x) = w \cdot x + b$ is the predicted output.  
- $m$ = number of training examples.  
- $n$ = number of features.  
- $w_j$ = parameter for feature $j$.  
- $\lambda$ = **regularization parameter** controlling the trade-off between fitting the data and keeping parameters small.  
- The bias term $b$ is typically **not regularized**.

**Key Points:**

- **First Term:** Minimizes the standard cost (mean squared error for linear regression).  
- **Second Term (Regularization Term):** Penalizes large weights to prevent overfitting.  
- **Choosing $\lambda$:**  
  - $\lambda = 0$ → no regularization → may overfit.  
  - Very large $\lambda$ → heavy regularization → underfitting (weights shrink toward $0$).  
  - Moderate $\lambda$ → balances fitting the data and keeping the model simple.

### **Gradient Descent with Regularization**

To implement regularized gradient descent, the update rules change slightly:

**Linear Regression Updates:**

$$
w_j := w_j - \alpha \left[ \frac{1}{m} \sum_{i=1}^m \left(f(x^{(i)}) - y^{(i)}\right)x_j^{(i)} + \frac{\lambda}{m} w_j \right]
$$

$$
b := b - \alpha \frac{1}{m} \sum_{i=1}^m \left(f(x^{(i)}) - y^{(i)}\right)
$$

- The extra term $\frac{\lambda}{m} w_j$ **shrinks $w_j$ slightly** on each iteration.  
- Intuition: $w_j \times \left(1 - \alpha \frac{\lambda}{m}\right)$ → gradually reduces weight magnitude.  
- The update for $b$ **remains unchanged**.

**Logistic Regression Updates:**

- Regularization is applied in the **same way**, with the only difference being the function $f(x)$:  

$$
f(x) = \sigma(z) = \frac{1}{1 + e^{-z}}, \quad z = w \cdot x + b
$$  

- The derivative update for $w_j$ includes the regularization term $\frac{\lambda}{m} w_j$, but $b$ is not regularized.

### **Effects of Regularization**

| $\lambda$ Value      | Effect on Model |
|---------------------|----------------|
| 0                   | No regularization → may overfit. |
| Moderate            | Balances fit and simplicity → reduces overfitting, improves generalization. |
| Very Large          | Heavy regularization → underfits, weights approach $0$. |

- By keeping weights small, regularization **smooths the model**, preventing extreme fluctuations that cause overfitting.  
- It allows us to **use all features** without the model relying too heavily on any single one.

### **Key Concepts**

1. **Regularization Parameter ($\lambda$)**: Controls the trade-off between **training error** and **model complexity**.  
2. **Weight Shrinking**: Each weight $w_j$ is gradually reduced during gradient descent, preventing overfitting.  
3. **Bias Term ($b$)**: Typically not regularized. Regularizing $b$ has minimal impact.  
4. **General Principle**: Regularization makes complex models behave more like simpler models without discarding features.

### **Summary**

Regularization is a **powerful tool** to prevent overfitting:

1. Modify the cost function to include a **penalty for large weights**.  
2. Adjust **gradient descent updates** to shrink weights slightly each iteration.  
3. Choose a **$\lambda$ value** that balances fit and simplicity.  
4. Apply the same idea to **both linear and logistic regression**.  

By implementing regularization, even **high-dimensional models** (many features) can generalize well to unseen data, making this technique essential for practical machine learning applications.

---