# Cost Function for Logistic Regression

Logistic Regression is a classification algorithm used to predict binary outcomes (0 or 1). Its cost function is derived using **maximum likelihood estimation (MLE)** and penalizes incorrect predictions probabilistically. Below is a detailed breakdown of the logistic loss function and the overall cost function.

---

## 1. **Hypothesis Function (Logistic/Sigmoid Function)**
The logistic regression hypothesis maps input features to a probability between 0 and 1 using the **sigmoid function**:

$$
h_\theta(x) = \sigma(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}
$$
- $\theta$: Model parameters (weights).
- $x$: Input feature vector.
- $\sigma(z)$: Sigmoid function, which squashes $z$ into $[0, 1]$.

---

## 2. **Logistic Loss Function**
The loss function for a single training example $(x^{(i)}, y^{(i)})$ is the **negative log-likelihood**, also known as the **logistic loss** or **cross-entropy loss**:

$$
L(h_\theta(x^{(i)}), y^{(i)}) = 
\begin{cases} 
-\log(h_\theta(x^{(i)})) & \text{if } y^{(i)} = 1, \\
-\log(1 - h_\theta(x^{(i)})) & \text{if } y^{(i)} = 0.
\end{cases}
$$
This can be compactly written as:

$$
L(h_\theta(x^{(i)}), y^{(i)}) = -y^{(i)} \log(h_\theta(x^{(i)})) - (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)}))
$$

---

## 3. **Cost Function**
The overall cost function $J(\theta)$ is the **average loss** over all $m$ training examples:

$$
J(\theta) = \frac{1}{m} \sum_{i=1}^m \left[ -y^{(i)} \log(h_\theta(x^{(i)})) - (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]
$$
To prevent overfitting, a **regularization term** (e.g., L2 regularization) is often added:

$$
J(\theta) = \frac{1}{m} \sum_{i=1}^m \left[ -y^{(i)} \log(h_\theta(x^{(i)})) - (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right] + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2
$$
- $\lambda$: Regularization parameter.
- $\theta_j$: Model weights (excluding the bias term $\theta_0$).

---

## 4. **Gradient of the Cost Function**
To optimize $\theta$, we compute the gradient of $J(\theta)$:

$$
\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)} + \frac{\lambda}{m} \theta_j \quad \text{(for } j \geq 1 \text{)}
$$
For the unregularized case ($\lambda = 0$):

$$
\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right) x_j^{(i)}
$$

---

## 5. **Vectorized Form**
Let $X \in \mathbb{R}^{m \times n}$ be the design matrix, $y \in \mathbb{R}^m$ the labels, and $\theta \in \mathbb{R}^n$ the parameters:

$$
J(\theta) = -\frac{1}{m} \left[ y^T \log(h_\theta(X)) + (1 - y)^T \log(1 - h_\theta(X)) \right] + \frac{\lambda}{2m} \theta^T \theta
$$

$$
\nabla J(\theta) = \frac{1}{m} X^T (h_\theta(X) - y) + \frac{\lambda}{m} \theta
$$
where $h_\theta(X) = \sigma(X\theta)$.

---

## 6. **Key Properties**
1. **Convexity**: The logistic loss is $convex$, ensuring gradient descent converges to the global minimum.
2. **Probabilistic Interpretation**: Minimizing $J(\theta)$ maximizes the likelihood of the observed data.
3. **Regularization**: The $\lambda$-term penalizes large weights to avoid overfitting.

---

## 7. **Why Not Mean Squared Error (MSE)?**
MSE is unsuitable for logistic regression because:
- The loss landscape becomes $non\text{-}convex$.
- Predictions are probabilities, not continuous values.