### Step 1: Linear combination (same as linear regression)

We first compute:

$$
z = m \cdot x + b
$$

This is just a weighted sum of inputs — no difference from linear regression.

---

### Step 2: Sigmoid Activation

We pass the output $z$ into a **sigmoid function** to squash it between 0 and 1:

$$
\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}
$$

This gives us the **predicted probability** that the input belongs to class 1.

---

### Step 3: Binary Cross-Entropy Loss

We use the **log loss** (a.k.a. binary cross-entropy) to measure prediction quality:

$$
\text{Loss} = - \frac{1}{n} \sum_{i=1}^{n} \left[ y_i \cdot \log(\hat{y}_i) + (1 - y_i) \cdot \log(1 - \hat{y}_i) \right]
$$

Why?

* It's ideal for probabilistic outputs
* It penalizes confident wrong predictions more than less confident ones

---

### Step 4: Gradient Descent to Update Parameters

We compute gradients of the loss w\.r.t. `m` and `b`, then update them:

$$
\frac{\partial \text{Loss}}{\partial m} = \frac{1}{n} \sum ( \hat{y}_i - y_i ) \cdot x_i
$$

$$
\frac{\partial \text{Loss}}{\partial b} = \frac{1}{n} \sum ( \hat{y}_i - y_i )
$$

Then update:

$$
m = m - \alpha \cdot \frac{\partial \text{Loss}}{\partial m}, \quad b = b - \alpha \cdot \frac{\partial \text{Loss}}{\partial b}
$$



In [1]:
import math
import random

In [2]:
# ---------- Training Data ----------
x_train = [0, 1, 2, 3, 4, 5]     # Feature (e.g., study hours)
y_train = [0, 0, 0, 1, 1, 1]     # Labels (e.g., fail/pass)

lst = [1, 2, 3, 4, 5]        
x_train = [random.choice(lst) for _ in range(100)]        # Hours of study
y_train = [random.choice([0,1]) for _ in range(100)]    # Test scores

In [3]:
# ---------- Initialize Parameters ----------
m = 0.0   # weight
b = 0.0   # bias

learning_rate = 0.1
epochs = 10
n = len(x_train)

In [4]:
# ---------- Sigmoid Function ----------
def sigmoid(z):
    return 1 / (1 + math.exp(-z))

In [5]:
# ---------- Training Loop ----------
for epoch in range(epochs):
    total_loss = 0
    dm = 0
    db = 0

    for i in range(n):
        z = m * x_train[i] + b
        y_pred = sigmoid(z)

        # Compute loss (just to monitor)
        loss = - (y_train[i] * math.log(y_pred + 1e-8) + (1 - y_train[i]) * math.log(1 - y_pred + 1e-8))
        total_loss += loss

        # Compute gradients
        error = y_pred - y_train[i]
        dm += error * x_train[i]
        db += error

    # Average gradients and update parameters
    m -= learning_rate * (dm / n)
    b -= learning_rate * (db / n)

    # Print every 100 steps
    if epoch % 100 == 0:
        avg_loss = total_loss / n
        print(f"Epoch {epoch}: Loss = {avg_loss:.4f}, m = {m:.4f}, b = {b:.4f}")

print(f"\nFinal model: sigmoid({m:.2f}x + {b:.2f})")

Epoch 0: Loss = 0.6931, m = -0.0200, b = -0.0020

Final model: sigmoid(-0.07x + 0.01)


## 4. Why Each Step Matters

| Step                  | What It Does                                        | Why It's Needed                           |
| --------------------- | --------------------------------------------------- | ----------------------------------------- |
| `sigmoid(z)`          | Converts raw score to a probability between 0 and 1 | Enables classification                    |
| `log loss`            | Measures how wrong the prediction is                | Guides training to improve accuracy       |
| `gradients`           | Tells us the direction to adjust `m` and `b`        | Minimizes the loss                        |
| `learning_rate`       | Controls step size                                  | Prevents overshooting or slow convergence |
| `looping over epochs` | Repeated updates                                    | Allows learning from data gradually       |
