# Gradient Descent for Logistic Regression (Explicit $w$ and $b$)

## Overview
Gradient descent optimizes the parameters **weight vector $\vec{w}$** and **bias term $b$** to minimize the log loss cost function. Here's the revised formulation with explicit $w$ and $b$:

---

## Key Components

### 1. **Hypothesis Function**

$$
h_{\vec{w}, b}(\vec{x}) = \sigma(\vec{w}^T \vec{x} + b) = \frac{1}{1 + e^{-(\vec{w}^T \vec{x} + b)}}
$$
- $\vec{w} = [w_1, w_2, \dots, w_n]^T$: **Weight vector** (excluding bias).  
- $b$: **Bias term** (scalar).  
- $\vec{x} = [x_1, x_2, \dots, x_n]^T$: Feature vector (**no** added $1$ for bias).  

### 2. **Cost Function (Log Loss)**

$$
J(\vec{w}, b) = -\frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log(h_{\vec{w}, b}(\vec{x}^{(i)})) + (1-y^{(i)}) \log(1 - h_{\vec{w}, b}(\vec{x}^{(i)})) \right]
$$

### 3. **Gradients**
#### Partial Derivative w.r.t. Weight $w_j$:

$$
\frac{\partial J(\vec{w}, b)}{\partial w_j} = \frac{1}{m} \sum_{i=1}^m \left( h_{\vec{w}, b}(\vec{x}^{(i)}) - y^{(i)} \right) x_j^{(i)}
$$
#### Partial Derivative w.r.t. Bias $b$:

$$
\frac{\partial J(\vec{w}, b)}{\partial b} = \frac{1}{m} \sum_{i=1}^m \left( h_{\vec{w}, b}(\vec{x}^{(i)}) - y^{(i)} \right)
$$

---

## Gradient Descent Steps

### 1. **Initialize Parameters**
- Set $\vec{w}$ to initial values (e.g., zeros).  
- Set $b = 0$.  

### 2. **Update Rules**
For each weight $w_j$:  

$$
w_j := w_j - \alpha \frac{\partial J(\vec{w}, b)}{\partial w_j}
$$
For bias $b$:  

$$
b := b - \alpha \frac{\partial J(\vec{w}, b)}{\partial b}
$$

### 3. **Repeat**
- Update all $w_j$ and $b$ **simultaneously**.  
- Iterate until convergence.  

---

## Matrix Form (Efficient Computation)
Let:  
- $\mathbf{X}$: Design matrix of shape $(m \times n)$ (**no** added bias column).  
- $\vec{y}$: Label vector of shape $(m \times 1)$.  
- $\vec{h} = \sigma(\mathbf{X} \vec{w} + b)$: Vector of predicted probabilities.  

### Gradients
- **Weight gradient vector**:  

$$
\nabla_{\vec{w}} J = \frac{1}{m} \mathbf{X}^T (\vec{h} - \vec{y})
$$
- **Bias gradient**:  

$$
\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (h^{(i)} - y^{(i)})
$$

### Parameter Updates

$$
\vec{w} := \vec{w} - \alpha \nabla_{\vec{w}} J
$$

$$
b := b - \alpha \frac{\partial J}{\partial b}
$$

---

## Summary of Algorithm
1. Compute $\vec{h} = \sigma(\mathbf{X} \vec{w} + b)$.  
2. Calculate gradients:  
   - $\nabla_{\vec{w}} J = \frac{1}{m} \mathbf{X}^T (\vec{h} - \vec{y})$  
   - $\frac{\partial J}{\partial b} = \frac{1}{m} \sum (\vec{h} - \vec{y})$  
3. Update parameters:  

$$
\vec{w} := \vec{w} - \alpha \nabla_{\vec{w}} J
$$

$$
b := b - \alpha \frac{\partial J}{\partial b}
$$
4. Repeat until convergence.  

---

### Notes:
- **Interpretation**: The bias $b$ acts as an "offset" in the linear combination $\vec{w}^T \vec{x} + b$.  
- **Implementation**: In code, often combine $\vec{w}$ and $b$ into a single parameter vector $\vec{\theta} = [b, w_1, \dots, w_n]^T$ by adding a column of $1$s to $\mathbf{X}$.  
- **Equivalence**: This formulation is mathematically identical to the $\vec{\theta}$ notation but separates $w$ and $b$ explicitly.  