### **1. Introduction: What Are We Doing?**

We’re combining three things:

- **Linear Regression Model** 📈: Predicts a straight line to fit the training data.
- **Squared Error Cost Function** 🧮: Measures how far off the predictions are from the actual values.
- **Gradient Descent** ⚙️: An algorithm to minimize the cost function by adjusting the parameters $ w $ (weight) and $ b $ (bias).

### **2. The Goal**

We want to train the linear regression model so it fits the data by using:

- The **cost function** to measure the error.
- **Gradient descent** to reduce that error and find the best $ w $ and $ b $.

---

### **3. Linear Regression Model and Cost Function**

#### **Linear Regression Model** 📊

The formula for a linear regression model is:

$$
f(w, b, x) = w \cdot x + b
$$

Here:

- $ w $: Weight (how much $ x $ affects the prediction).
- $ b $: Bias (shifts the line up or down).
- $ x $: Input feature (data point).

#### **Squared Error Cost Function** 🧾

The cost function measures how far predictions are from actual values. The formula is:

$$
J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} \left( f(w, b, x^i) - y^i \right)^2
$$

Here:

- $ m $: Number of training examples.
- $ y^i $: Actual value.
- $ f(w, b, x^i) $: Predicted value.

The $ \frac{1}{2} $ is included to simplify calculations when taking derivatives.

---

### **4. Gradient Descent**

Gradient descent helps find the best $ w $ and $ b $ by minimizing the cost function $ J(w, b) $.

#### **Update Rules**

The parameters $ w $ and $ b $ are updated using these formulas:

$$
w = w - \alpha \cdot \frac{\partial J}{\partial w}, \quad b = b - \alpha \cdot \frac{\partial J}{\partial b}
$$

- $ \alpha $: Learning rate (step size).
- $ \frac{\partial J}{\partial w} $: Derivative of $ J $ with respect to $ w $.
- $ \frac{\partial J}{\partial b} $: Derivative of $ J $ with respect to $ b $.

---

### **5. Calculating Derivatives**

Using calculus, the derivatives are derived as follows:

#### **Derivative with Respect to $ w $:**

$$
\frac{\partial J}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} \left( f(w, b, x^i) - y^i \right) \cdot x^i
$$

- $ f(w, b, x^i) - y^i $: Error term (difference between predicted and actual values).
- $ x^i $: Input feature.

#### **Derivative with Respect to $ b $:**

$$
\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} \left( f(w, b, x^i) - y^i \right)
$$

- This is similar to the formula for $ w $, but without the $ x^i $ term.

These formulas are plugged into the gradient descent algorithm to compute updates for $ w $ and $ b $.

---

### **6. Convex Cost Function and Global Minimum**

The squared error cost function is **convex** 🥣 (bowl-shaped). This means:

- It has only **one global minimum** (the lowest point).
- Gradient descent will always converge to this point if the learning rate is set appropriately.

Unlike some functions with multiple local minima, there’s no risk of getting stuck in the wrong place.

---

### **7. Algorithm Recap**

Here’s the **gradient descent algorithm** step-by-step:

1. Initialize $ w $ and $ b $ (randomly or as zeros).
2. Compute the cost $ J(w, b) $.
3. Calculate derivatives $ \frac{\partial J}{\partial w} $ and $ \frac{\partial J}{\partial b} $.
4. Update $ w $ and $ b $ using the formulas:
   $$
   w = w - \alpha \cdot \frac{\partial J}{\partial w}, \quad b = b - \alpha \cdot \frac{\partial J}{\partial b}
   $$
5. Repeat steps 2–4 until the cost $ J(w, b) $ stops decreasing (convergence).

---

### **Summary**

- The **linear regression model** predicts $ y $ using $ f(w, b, x) = w \cdot x + b $.
- The **cost function** $ J(w, b) $ measures prediction errors.
- **Gradient descent** minimizes $ J(w, b) $ by adjusting $ w $ and $ b $.
- The derivatives $ \frac{\partial J}{\partial w} $ and $ \frac{\partial J}{\partial b} $ are calculated using calculus.
- The cost function is convex, so gradient descent always finds the global minimum.

---

### **Interactive Note**

Fill in the blanks and check your answers!

1. The formula for a linear regression model is $ f(w, b, x) = \_ \cdot x + \_ $.
2. The cost function $ J(w, b) $ measures the **\_\_\_\_** between predicted and actual values.
3. The derivative of $ J(w, b) $ with respect to $ w $ includes the **\_\_\_\_** term $ x^i $, but the derivative with respect to $ b $ does not.
4. A **\_\_\_\_** cost function ensures gradient descent converges to the global minimum.
5. The two key parameters updated in gradient descent are $ \_\_\_\_$ and $ \_\_\_\_$.

---

### **Answer Key**

1. $ w $, $ b $
2. error/difference
3. input feature
4. convex
5. $ w $, $ b $
