## **Gradient Descent Intuition**

#### **1. What is Gradient Descent?**

Gradient descent is a method used in machine learning to adjust parameters (like $ w $ and $ b $) to minimize a cost function. The cost function tells us how well our model is performing. The goal of gradient descent is to find the lowest point of the cost function, meaning the best possible values for $ w $ and $ b $ that make predictions as accurate as possible.

---

#### **2. Understanding the Learning Rate $ \alpha $**

- The symbol $ \alpha $ (alpha) is called the **learning rate**.
- The learning rate controls **how big of a step** we take when updating the parameters $ w $ and $ b $.
- If $ \alpha $ is too big, we might **overshoot** the minimum and never settle at the lowest point.
- If $ \alpha $ is too small, the steps will be tiny, making the learning process **very slow**.

---

#### **3. What is the Derivative?**

- The **derivative** tells us the slope of a function at a particular point.
- The notation $ \frac{d}{dw} $ means **taking the derivative** of the function with respect to $ w $.
- The derivative tells us:
  - If we should **increase** or **decrease** $ w $.
  - How **big of a step** we should take.

In more advanced mathematics, this is actually called a **partial derivative**, but for now, we’ll just call it a derivative.

---

#### **4. How Gradient Descent Works Step by Step**

Let's simplify the problem and assume we only have **one parameter**, $ w $, and a cost function $ J(w) $.

The update rule in gradient descent is:

$$
w = w - \alpha \times \frac{d}{dw} J(w)
$$

This means:

1. Take the current value of $ w $.
2. Subtract a small step ($ \alpha $) **times** the slope of the cost function at that point.
3. Repeat the process until we reach the minimum.

---

![A](images/L14_1.png)

---

#### **5. Visualizing Gradient Descent with a Graph**

##### **Case 1: When We Start from the Right Side**

- Imagine a graph where the **horizontal axis** is $ w $ and the **vertical axis** is $ J(w) $ (cost function).
- We start from a random point on the graph.
- The **slope of the tangent line** at that point is **positive**.
- A **positive slope** means:
  - $ \frac{d}{dw} J(w) > 0 $ (the cost is increasing as $ w $ increases).
  - Using the update rule: $ w = w - \alpha \times \text{(positive number)} $.
  - This **decreases** $ w $, moving left towards the minimum.
  - The cost function $ J(w) $ decreases, which is what we want.

##### **Case 2: When We Start from the Left Side**

- If we start on the **left side**, the slope of the tangent line is **negative**.
- A **negative slope** means:
  - $ \frac{d}{dw} J(w) < 0 $ (the cost is decreasing as $ w $ increases).
  - Using the update rule: $ w = w - \alpha \times \text{(negative number)} $.
  - Subtracting a negative number is the same as **adding** a positive number.
  - This **increases** $ w $, moving right towards the minimum.
  - Again, $ J(w) $ decreases, showing gradient descent is working correctly.

---

![A](images/L14_2.png)

---

**Key observation**:  
No matter where we start, gradient descent moves us towards the **minimum** of the cost function.

---

#### **6. Why This Process Works**

- The derivative **tells us the direction** we should move.
- The learning rate **controls the step size** so we don't move too fast or too slow.
- Gradient descent **repeats this process** until it reaches the lowest possible point.

---

#### **Summary**

1. **Gradient Descent** is a method to find the best values of parameters by minimizing a cost function.
2. The **learning rate $ \alpha $** determines how big each step is.
3. The **derivative** tells us whether to increase or decrease $ w $.
4. If the **slope is positive**, gradient descent **decreases $ w $**.
5. If the **slope is negative**, gradient descent **increases $ w $**.
6. The process **continues until the cost function reaches its minimum**.
