# Logistic Regression: Gradient Descent with Log Loss


Logistic Regression is used when the output variable $y$ is **binary** (0 or 1).
The goal is to learn parameters $w$ and $b$ such that the predicted probability:

$$
\hat{y} = P(y=1 \mid x)
$$

matches the true labels as closely as possible.

---

# 1. The Hypothesis Function

Logistic Regression uses the **sigmoid function** on a linear model:

$$
z = wx + b
$$

$$
\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}
$$

Interpretation:

- If $\hat{y} \approx 1$, the model predicts class **1**
- If $\hat{y} \approx 0$, the model predicts class **0**

---

# 2. Loss Function: Log Loss
(also called Binary Cross-Entropy)

For each training example:

$$
L_i = -\left[
y_i \log(\hat{y}_i)
+
(1 - y_i)\log(1 - \hat{y}_i)
\right]
$$

Properties:

- If $y_i = 1$, only the first term remains:
  $$
  L_i = -\log(\hat{y}_i)
  $$
- If $y_i = 0$, only the second term remains:
  $$
  L_i = -\log(1 - \hat{y}_i)
  $$

The total cost over all examples:

$$
J(w,b) = \frac{1}{n} \sum_{i=1}^{n} L_i
$$

Our goal:
$$
\min_{w,b} J(w,b)
$$

---

# 3. The Chain Rule Structure

We must compute:

$$
\frac{\partial J}{\partial w},
\qquad
\frac{\partial J}{\partial b}
$$

For one example, the computation flows like this:

$$
w, b
\quad \longrightarrow \quad
z = wx + b
\quad \longrightarrow \quad
\hat{y} = \sigma(z)
\quad \longrightarrow \quad
L
$$

Applying the chain rule:

$$
\frac{\partial L}{\partial w}
=
\frac{\partial L}{\partial \hat{y}}
\cdot
\frac{\partial \hat{y}}{\partial z}
\cdot
\frac{\partial z}{\partial w}
$$

$$
\frac{\partial L}{\partial b}
=
\frac{\partial L}{\partial \hat{y}}
\cdot
\frac{\partial \hat{y}}{\partial z}
\cdot
\frac{\partial z}{\partial b}
$$

We compute each component next.

---

# 4. Step-by-Step Derivatives

## 4.1 Derivative of the Sigmoid Function

$$
\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}
$$

A very useful identity:

$$
\frac{d\hat{y}}{dz} = \hat{y}(1 - \hat{y})
$$

---

## 4.2 Derivative of Log Loss w.r.t. Prediction $\hat{y}$

$$
L = -\left[
y\log(\hat{y}) + (1 - y)\log(1 - \hat{y})
\right]
$$

Differentiate term-by-term:

$$
\frac{\partial L}{\partial \hat{y}}
=
-\frac{y}{\hat{y}}
+
\frac{1 - y}{1 - \hat{y}}
$$

---

## 4.3 Derivative of $z$

$$
z = wx + b
$$

$$
\frac{\partial z}{\partial w} = x,
\qquad
\frac{\partial z}{\partial b} = 1
$$

---

# 5. Combine Using Chain Rule

### First compute $\frac{\partial L}{\partial z}$:

$$
\frac{\partial L}{\partial z}
=
\frac{\partial L}{\partial \hat{y}}
\cdot
\frac{\partial \hat{y}}{\partial z}
$$

Substitute:

$$
\frac{\partial L}{\partial z}
=
\left(
-\frac{y}{\hat{y}}
+
\frac{1-y}{1-\hat{y}}
\right)
\cdot
\hat{y}(1 - \hat{y})
$$

Simplify each term:

$$
-\frac{y}{\hat{y}} \cdot \hat{y}(1 - \hat{y})
= -y(1 - \hat{y})
$$

$$
\frac{1-y}{1-\hat{y}} \cdot \hat{y}(1 - \hat{y})
= (1 - y)\hat{y}
$$

Combine:

$$
\frac{\partial L}{\partial z}
= -y(1-\hat{y}) + (1-y)\hat{y}
$$

Expand:

$$
= -y + y\hat{y} + \hat{y} - y\hat{y}
$$

$$
= \hat{y} - y
$$

---

# 6. Final Gradients for One Example

Using:

$$
\frac{\partial z}{\partial w} = x,
\qquad
\frac{\partial z}{\partial b} = 1
$$

### Gradient w.r.t. w:
$$
\frac{\partial L}{\partial w}
=
(\hat{y} - y)x
$$

### Gradient w.r.t. b:
$$
\frac{\partial L}{\partial b}
=
\hat{y} - y
$$

---

# 7. Gradients for All Training Samples

Since:

$$
J(w,b) = \frac{1}{n} \sum L_i
$$

Then:

$$
\frac{\partial J}{\partial w}
=
\frac{1}{n}
\sum_{i=1}^{n}
(\hat{y}_i - y_i)x_i
$$

$$
\frac{\partial J}{\partial b}
=
\frac{1}{n}
\sum_{i=1}^{n}
(\hat{y}_i - y_i)
$$

---

# 8. Gradient Descent Update Rules

Given learning rate $\alpha$:

$$
w \leftarrow w - \alpha \frac{\partial J}{\partial w}
$$

$$
b \leftarrow b - \alpha \frac{\partial J}{\partial b}
$$

Substitute the gradients:

$$
w \leftarrow
w - \alpha \left(
\frac{1}{n}
\sum_{i=1}^{n}
(\hat{y}_i - y_i)x_i
\right)
$$

$$
b \leftarrow
b - \alpha \left(
\frac{1}{n}
\sum_{i=1}^{n}
(\hat{y}_i - y_i)
\right)
$$

---

In [1]:
import numpy as np

# -----------------------------
# 1. Sample Data (6 values)
# -----------------------------
X = np.array([1, 2, 3, 4, 5, 6], dtype=float) # number of hours studied
y = np.array([0, 0, 0, 1, 1, 1], dtype=float) # Pass=1 and fail =0

n = len(X)

# -----------------------------
# 2. Sigmoid function
# -----------------------------
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# -----------------------------
# 3. Log Loss Cost Function
# -----------------------------
def compute_cost(w, b, X, y):
    z = w * X + b
    y_hat = sigmoid(z)
    cost = -np.mean(y * np.log(y_hat + 1e-9) + (1 - y)*np.log(1 - y_hat + 1e-9))
    return cost

# -----------------------------
# 4. Compute gradients dw, db
# -----------------------------
def compute_gradients(w, b, X, y):
    z = w * X + b
    y_hat = sigmoid(z)
    error = y_hat - y
    dw = np.mean(error * X)
    db = np.mean(error)
    return dw, db

# -----------------------------
# 5. Gradient Descent
# -----------------------------
w = 0.0
b = 0.0
alpha = 0.1
iterations = 2000

cost_history = []

for i in range(iterations):
    dw, db = compute_gradients(w, b, X, y)

    # Update parameters
    w -= alpha * dw
    b -= alpha * db

    # Save cost for monitoring
    cost = compute_cost(w, b, X, y)
    cost_history.append(cost)

    # Print every 200 steps
    if i % 200 == 0:
        print(f"Iteration {i}: Cost = {cost:.6f}, w = {w:.4f}, b = {b:.4f}")

# -----------------------------
# 6. Final parameters
# -----------------------------
print("\nTraining complete!")
print(f"Final weight: w = {w:.4f}")
print(f"Final bias  : b = {b:.4f}")

# -----------------------------
# 7. Try predictions
# -----------------------------
def predict(x):
    prob = sigmoid(w * x + b)
    return 1 if prob >= 0.5 else 0

test_values = [1, 2, 3, 4, 5, 6]

print("\nPredictions:")
for x in test_values:
    print(f"Hours {x} → Predicted: {predict(x)} (Prob = {sigmoid(w*x+b):.4f})")


Iteration 0: Cost = 0.647499, w = 0.0750, b = 0.0000
Iteration 200: Cost = 0.324548, w = 0.7819, b = -2.2807
Iteration 400: Cost = 0.236686, w = 1.1202, b = -3.5519
Iteration 600: Cost = 0.194605, w = 1.3592, b = -4.4346
Iteration 800: Cost = 0.169109, w = 1.5473, b = -5.1223
Iteration 1000: Cost = 0.151564, w = 1.7043, b = -5.6929
Iteration 1200: Cost = 0.138511, w = 1.8405, b = -6.1851
Iteration 1400: Cost = 0.128278, w = 1.9615, b = -6.6208
Iteration 1600: Cost = 0.119952, w = 2.0709, b = -7.0139
Iteration 1800: Cost = 0.112987, w = 2.1712, b = -7.3733

Training complete!
Final weight: w = 2.2637
Final bias  : b = -7.7039

Predictions:
Hours 1 → Predicted: 0 (Prob = 0.0043)
Hours 2 → Predicted: 0 (Prob = 0.0401)
Hours 3 → Predicted: 0 (Prob = 0.2864)
Hours 4 → Predicted: 1 (Prob = 0.7943)
Hours 5 → Predicted: 1 (Prob = 0.9738)
Hours 6 → Predicted: 1 (Prob = 0.9972)
