### Start by importing some libraries

In [2]:
import random

### Step 1: **Understanding the Goal**

We want to learn a **linear relationship** between input (`x`) and output (`y`):

$$
\text{Prediction: } \hat{y} = m \cdot x + b
$$

Our **goal** is to find the best values for:

* `m` (slope)
* `b` (intercept)

So that predictions (`𝑦̂`) are **as close as possible** to the actual values `y`.

For this example well consider a sample dataset that we'll create by random function where we will predict test score based on number of study hours

In [3]:
lst = [1, 2, 3, 4, 5]        
x = [random.choice(lst) for _ in range(100)]        # Hours of study
y = [random.randint(25,100) for _ in range(100)]    # Test scores

### Step 2: **Start with Random Initialization**

We start by setting `m = 0` and `b = 0`. At this point, our model has **no knowledge**, so predictions will be poor.

**Why?**

* This is a **common practice** in training models. You start with a guess, then improve using data.

In [4]:
# Initialize parameters
m = 0.0  # slope
b = 0.0  # intercept

# Hyperparameters
learning_rate = 0.01
epochs = 10

n = len(x)

### Step 3: **Measure How Bad the Predictions Are (Loss Function)**

We need a way to measure **how far off** our predictions are from the actual values.

We use **Mean Squared Error (MSE)**:

$$
\text{MSE} = \frac{1}{n} \sum (y_i - \hat{y}_i)^2
$$

Where:

* $y_i$ is the actual value
* $\hat{y}_i = m x_i + b$ is the predicted value

```
y_pred = [m * xi + b for xi in x]
error = [y[i] - y_pred[i] for i in range(n)]
```

**Why MSE?**

* It penalizes **large errors** more than small ones (because of squaring)
* It is smooth and differentiable (important for gradient descent)
* It has a clear geometric meaning: the average squared vertical distance from points to the line

### Step 4: **Use Gradient Descent to Improve**

Now that we can measure the error, we want to **reduce it**.

**Gradient Descent** is an algorithm that:

* Measures the **slope of the loss** with respect to the model parameters
* Updates the parameters to **reduce the loss**

We take the **partial derivatives** of the loss with respect to `m` and `b`.

---

### Step 5: **Derive the Gradients**

#### Gradient with respect to `m`:

$$
\frac{\partial \text{MSE}}{\partial m} = -\frac{2}{n} \sum x_i (y_i - \hat{y}_i)
$$

#### Gradient with respect to `b`:

$$
\frac{\partial \text{MSE}}{\partial b} = -\frac{2}{n} \sum (y_i - \hat{y}_i)
$$

These tell us **how to change** `m` and `b` to **decrease the loss**.

```
dm = (-2/n) * sum([x[i] * error[i] for i in range(n)])
db = (-2/n) * sum(error)
```
    

### Step 6: **Update Parameters**

We update `m` and `b` using a small step in the opposite direction of the gradient:

$$
m = m - \alpha \cdot \frac{\partial \text{MSE}}{\partial m}
$$

$$
b = b - \alpha \cdot \frac{\partial \text{MSE}}{\partial b}
$$

Where `α` (alpha) is the **learning rate** – a small constant like `0.01`.

**Why learning rate?**

* Controls how big each update is
* Too large: might overshoot the minimum
* Too small: slow learning


### Step 7: **Repeat for Many Epochs**

```
epochs = 1000
```

We keep repeating:

1. Predict
2. Calculate loss (MSE)
3. Calculate gradients
4. Update `m` and `b`

After many iterations, the model gradually **learns better parameters**, and the loss decreases.

In [5]:
for epoch in range(epochs):

    # Get the predicted value with current values of m and b, then calculate the error using MSE
    # y_pred = mx + b
    y_pred = [m * xi + b for xi in x]
    error = [y[i] - y_pred[i] for i in range(n)]
    
    # Compute gradients
    dm = (-2/n) * sum([x[i] * error[i] for i in range(n)])
    db = (-2/n) * sum(error)
    
    # Update parameters
    m -= learning_rate * dm
    b -= learning_rate * db

    # Optionally print loss every 100 iterations
    if epoch % 100 == 0:
        loss = sum([(y[i] - y_pred[i]) ** 2 for i in range(n)]) / n
        print(f"Epoch {epoch}: Loss={loss:.4f}, m={m:.4f}, b={b:.4f}")

print(f"\nFinal model: y = {m:.2f}x + {b:.2f}")

Epoch 0: Loss=5261.0400, m=4.3022, b=1.3808

Final model: y = 15.62x + 6.53


### Final Step: **Converge to Best Line**

Eventually, the model will converge (or stop improving significantly). At this point:

* You have `m` and `b` that define the best-fit line
* You can now **predict y for any x**

---

### Summary of Steps

| Step | What Happens                            | Why It Matters                            |
| ---- | --------------------------------------- | ----------------------------------------- |
| 1    | Initialize `m=0`, `b=0`                 | Start with a baseline                     |
| 2    | Make predictions `ŷ = mx + b`          | Compute model output                      |
| 3    | Calculate MSE loss                      | Measure how bad predictions are           |
| 4    | Compute gradients (∂MSE/∂m and ∂MSE/∂b) | Find how to change weights to reduce loss |
| 5    | Update `m` and `b` using gradients      | Learn better values                       |
| 6    | Repeat many times                       | Gradually converge to optimal solution    |


In [6]:
lst = [1, 2, 3, 4, 5]        
x_test = [random.choice(lst) for _ in range(10)]        # Hours of study
y_test_actual = [random.randint(25,100) for _ in range(10)]    # Ground truth (made-up for this example)



In [10]:
# Predict using the learned model
y_test_pred = [round(m * xi + b,3) for xi in x_test]


In [11]:
# Calculate MSE on test data
mse_test = sum([(y_test_actual[i] - y_test_pred[i])**2 for i in range(len(x_test))]) / len(x_test)



In [12]:
print("\n--- TEST RESULTS ---")
print(f"Final learned model: y = {m:.2f}x + {b:.2f}")
print("Test Predictions:", y_test_pred)
print("Actual Values:   ", y_test_actual)
print(f"Test MSE: {mse_test:.4f}")


--- TEST RESULTS ---
Final learned model: y = 15.62x + 6.53
Test Predictions: [53.407, 69.032, 22.158, 53.407, 22.158, 37.782, 69.032, 22.158, 53.407, 37.782]
Actual Values:    [96, 63, 29, 30, 47, 37, 30, 44, 60, 80]
Test MSE: 688.9383
