## Week 2: Optimization Algorithms

### Mini-batch Gradient Descent

Mini-batch gradient descent is a highly effective optimization algorithm used to train neural networks faster, especially with large datasets (big data regime).

#### The Problem with Batch Gradient Descent

* **Batch Gradient Descent (BGD):** This is the traditional method where you process the **entire training set** ($m$ examples) to compute one gradient step.
* **Slow for Big Data:** If $m$ is very large (e.g., 5 million), BGD is extremely slow because you must process all 5 million examples before taking even a single weight update step.

#### Mini-Batch Gradient Descent (MBGD)

* **Core Idea:** MBGD splits the large training set into smaller, more manageable subsets called **mini-batches**. The algorithm then performs a gradient descent step on each mini-batch sequentially.
* **Notation:**
    * Superscript curly braces **$\{t\}$** are used to index the mini-batch: **$X^{\{t\}}$** and **$Y^{\{t\}}$**.
    * If a dataset has $M=5,000,000$ examples and each mini-batch has a size of 1,000, you will have $T=5,000$ mini-batches.
* **The Algorithm:**
    * An **Epoch** is defined as one single pass through the entire training set.
    * In a single epoch, a **For loop** runs through all mini-batches ($t=1$ to $T$).
    * For each mini-batch ($X^{\{t\}}, Y^{\{t\}}$):
        1.  **Forward Propagation:** Run forward prop on $X^{\{t\}}$. (Vectorization is used to process all examples in the mini-batch efficiently).
        2.  **Compute Cost:** Compute the cost $J^{\{t\}}$ on the mini-batch.
        3.  **Backpropagation:** Compute gradients $dW$ and $dB$ using only $X^{\{t\}}$ and $Y^{\{t\}}$.
        4.  **Update Weights:** Update $W$ and $B$ using the gradients and the learning rate $\alpha$.

#### Key Advantages

* **Speed:** MBGD allows the algorithm to start making progress and taking weight update steps after processing only a small fraction of the data (e.g., 1,000 examples), rather than waiting for the entire dataset (5 million examples).
* **Vectorization Efficiency:** Even though the batches are "mini," the use of **vectorization** within each mini-batch ensures high computational efficiency, leveraging parallel processing capabilities.
* **Progress Per Epoch:** With BGD, one epoch results in **one** gradient step. With MBGD, one epoch results in $T$ (e.g., **5,000**) gradient steps, drastically accelerating training.

### Understanding Mini-batch Gradient

#### Cost Function Behavior

* **Batch Gradient Descent (BGD):** The cost function $J$ is calculated over the entire dataset and is expected to **decrease monotonically** on every iteration. If it goes up, the learning rate ($\alpha$) is too high.
* **Mini-Batch Gradient Descent (MBGD):** The cost $J^{\{t\}}$ is calculated only on the current mini-batch ($X^{\{t\}}, Y^{\{t\}}$). Since each mini-batch is a different subset of data (some are "easier," some are "harder"), the cost will **oscillate but should trend downwards** over time. It's normal for the cost to increase on individual iterations.

#### Choosing the Mini-Batch Size

The mini-batch size is a critical hyperparameter that determines the algorithm's behavior. The size must be neither too small nor too large.

| Mini-Batch Size | Algorithm Name | Pros | Cons |
| :---: | :---: | :--- | :--- |
| **$m$** (Entire dataset) | **Batch Gradient Descent (BGD)** | Low noise; guaranteed convergence to minimum. | **Too slow** per iteration (must process all $m$ examples). |
| **1** (Single example) | **Stochastic Gradient Descent (SGD)** | Fastest progress per example (1 update/example). | **High noise**; may not converge, instead oscillating around the minimum; **loses vectorization speedup** (inefficient processing). |
| **$1 < \text{size} < m$** | **Mini-Batch Gradient Descent (MBGD)** | **Fastest learning in practice.** Leverages vectorization speedup while taking frequent updates. | **Oscillates** (but trends down); requires hyperparameter tuning. |

#### Building mini-batches from the training set $(X,Y)$

There are two steps:
- **Shuffle**: Create a shuffled version of the training set (X, Y) as shown below. Each column of X and Y represents a training example. Note that the random shuffling is done synchronously between X and Y. Such that after the shuffling the $i^{th}$ column of X is the example corresponding to the $i^{th}$ label in Y. The shuffling step ensures that examples will be split randomly into different mini-batches. 

<!-- <img src="images/kiank_shuffle.png" style="width:550px;height:300px;"> -->
![Mini batch shuffle](images/kiank_shuffle.png)

- **Partition**: Partition the shuffled (X, Y) into mini-batches of size `mini_batch_size` (here 64). Note that the number of training examples is not always divisible by `mini_batch_size`. The last mini batch might be smaller, but you don't need to worry about this. When the final mini-batch is smaller than the full `mini_batch_size`, it will look like this: 

<!-- <img src="images/kiank_partition.png" style="width:550px;height:300px;"> -->
![Mini batch partition](images/kiank_partition.png)

#### Practical Guidelines for Mini-Batch Size

1.  **Small Dataset ($m \le 2000$):** Use **Batch Gradient Descent** (mini-batch size = $m$). Vectorization is fast enough, and you get the benefit of stable updates.
2.  **Large Dataset ($m > 2000$):** Use **Mini-Batch Gradient Descent**.
    * **Typical Sizes:** Common values are powers of 2, ranging from **64 to 512** (e.g., 64, 128, 256, 512). Powers of 2 often allow for faster processing due to how computer memory and GPUs are structured.
    * **Memory Constraint:** The mini-batch must fit entirely within your CPU/GPU memory. If it doesn't, performance will drop sharply.
3.  **Hyperparameter Tuning:** The mini-batch size is a hyperparameter that should be tuned to find the value that results in the fastest convergence for your specific problem.

### Exponentially Weighted Averages

**Exponentially Weighted Averages (EWA)**, also known as Exponentially Weighted Moving Averages, which is a crucial mathematical tool used in faster optimization algorithms in deep learning.

#### 1. Definition and Purpose
* **Definition:** EWA is a way to calculate a moving average of a time series (like daily temperature) where more recent data points are given exponentially greater weight than older data points.
* **Notation:** The formula to compute the average for a given day $t$ ($V_t$) is:
    $$V_t = \beta V_{t-1} + (1 - \beta) \theta_t$$
    where:
    * $\mathbf{V_t}$ is the exponentially weighted average at day $t$.
    * $\mathbf{V_{t-1}}$ is the average from the previous day.
    * $\mathbf{\theta_t}$ is the measured value (e.g., temperature) on day $t$.
    * $\mathbf{\beta}$ (beta) is the weight parameter (typically between 0 and 1).

#### 2. Interpretation of the Beta ($\beta$) Parameter
* **Averaging Window:** The term $V_t$ can be roughly thought of as averaging the data over the last $\mathbf{\frac{1}{1 - \beta}}$ days of temperature.
* **Effect of $\beta$ on the Average:**

| $\beta$ Value | Averaging Window ($\approx \frac{1}{1 - \beta}$ days) | Resulting Curve | Characteristics |
| :--- | :--- | :--- | :--- |
| **Medium (e.g., 0.9)** | 10 days | **Smoother (Red Line)** | Good balance; represents a reasonably accurate trend. |
| **High (e.g., 0.98)** | 50 days | **Very Smooth (Green Line)** | High $\beta$ means more weight on $V_{t-1}$ (older data). Curve has **more latency** and adapts **slowly** to changes. |
| **Low (e.g., 0.5)** | 2 days | **Very Noisy (Yellow Line)** | Low $\beta$ means less weight on $V_{t-1}$. Curve is more **susceptible to outliers** but adapts **quickly** to changes. |

![EWA](images/ewa_example.png)

#### 3. Application in Optimization
* EWA allows algorithms to track the general direction of the gradient over time, filtering out high-frequency noise and leading to faster, more efficient convergence.
* By varying the $\beta$ parameter (which acts as a **hyperparameter** in learning algorithms), practitioners can tune the trade-off between smoothness and responsiveness.

#### **Bonus:** Derivation $\frac{1}{1-\beta}$ to approximate averaging window

The term $1/(1-\beta)$ is a widely used **approximation** for the averaging window. It doesn't represent a precise, fixed-size window like a simple average, but rather the effective number of days whose temperature significantly contributes to the current $V_t$.

Here is how the approximation $1/(1-\beta)$ is derived:

#### The Derivation of $1/(1-\beta)$

The core idea is to determine how many terms $\theta_t$ (the daily values) are needed for their contribution to the current average, $V_t$, to drop to a negligible level. We typically define "negligible" as the point where the weight is $\frac{1}{e}$, or roughly **$36.8\%$** of the starting weight.

#### 1. The EWA Formula

Let's look at the expanded formula for $V_t$. By recursively substituting $V_{t-1}$, $V_{t-2}$, and so on, we see the weight assigned to any past value $\theta_{t-k}$ is:

$$\text{Weight} = (1 - \beta) \beta^k$$

Where:
* $k$ is the number of steps (days) in the past.
* $\theta_{t-k}$ is the value $k$ days ago.

#### 2. Finding the Drop-Off Point

We want to find the number of days $k$ where the weight $\beta^k$ drops to the threshold $\frac{1}{e}$.

$$\beta^k \approx \frac{1}{e}$$

#### 3. Using the Natural Log

To solve for $k$, we take the natural logarithm ($\ln$) of both sides:

$$\ln(\beta^k) \approx \ln\left(\frac{1}{e}\right)$$

$$k \cdot \ln(\beta) \approx -1$$

$$k \approx \frac{-1}{\ln(\beta)}$$

#### 4. The Final Approximation

For small values of $(1-\beta)$ (which is true when $\beta$ is close to 1, like 0.9 or 0.98), the following Taylor series approximation holds:

$$\ln(\beta) = \ln(1 - (1-\beta)) \approx -(1-\beta)$$

Substituting this back into the formula for $k$:

$$k \approx \frac{-1}{-(1-\beta)}$$

$$k \approx \frac{1}{1-\beta}$$


#### Conclusion

The formula **$1/(1-\beta)$** is derived from calculating the time it takes for the weight of a past data point to decay to $\frac{1}{e}$ (about $36.8\%$) of its original value, using a common mathematical approximation for $\ln(\beta)$.

It's a useful rule of thumb because it directly links the hyperparameter $\beta$ to a conceptual averaging window:

* If $\beta = 0.9$: $1/(1-0.9) = 1/0.1 = 10$ (Averaging over $\approx 10$ days)
* If $\beta = 0.98$: $1/(1-0.98) = 1/0.02 = 50$ (Averaging over $\approx 50$ days)

### Understanding Exponentially Weighted Averages

Here, we discuss the mathematical intuition behind the Exponentially Weighted Averages (EWA) formula, explaining why it's called "exponentially weighted" and how the $\beta$ parameter dictates the effective averaging window.

#### 1. Mathematical Breakdown of EWA

* **Recursive Expansion:** By repeatedly substituting the formula $V_t = \beta V_{t-1} + (1 - \beta) \theta_t$, the current average ($V_t$) can be expressed as a sum of all past data points ($\theta_k$) multiplied by decaying weights.
* **Exponential Decay:** The weight assigned to a data point $k$ days ago ($\theta_{t-k}$) is proportional to **$\beta^k$**. Since $\beta < 1$, this factor decays exponentially as $k$ increases (i.e., as the data point gets older).
* **Weighted Average:** $V_t$ is effectively a weighted sum of past $\theta$ values, where the weights decay exponentially. This confirms why it's called an **exponentially weighted average**.
* **Coefficient Sum:** Up to a detail called "bias correction" (discussed in the next video), the coefficients of all past $\theta$ terms sum to 1, confirming $V_t$ is indeed an average.

#### 2. Intuition for the $1/(1-\beta)$ Averaging Window

* **The $1/e$ Rule:** The rule of thumb that $V_t$ averages over $\approx \frac{1}{1-\beta}$ days comes from observing how quickly the weight $\beta^k$ decays.
* **Decay Time:** It takes approximately $k = \frac{1}{1-\beta}$ steps for the weight $\beta^k$ to decay to about $\frac{1}{e}$ (where $e \approx 2.718$), which is roughly $36.8\%$ of the original weight.
    * Example: If $\beta = 0.9$, then $\frac{1}{1-0.9} = 10$. After 10 days, the weight given to that older temperature is about $\frac{1}{e} \approx 0.35$ times the weight given to the current day's temperature.
* **Rule of Thumb:** This gives a useful (though not formal) way to interpret $\beta$: a higher $\beta$ means a larger window and a smoother average (e.g., $\beta=0.98 \rightarrow 50$ days).

#### 3. Implementation Efficiency (Why EWA is Used in ML)

* **Low Memory Usage:** EWA is extremely efficient because it only requires storing **one single variable** ($V$) in memory to compute the average. It continuously overwrites this single value.
* **Computational Efficiency:** It is very fast to compute, taking only a single line of code.
* **Trade-off:** While explicitly summing over a moving window might yield a slightly better statistical average, the extreme memory and computational efficiency of EWA make it the preferred choice when tracking averages for millions of parameters in deep learning.

### Bias Correction in Exponnetially Weighted Averages

We introduce **bias correction**, to refine the calculation of Exponentially Weighted Averages (EWA) to make them more accurate, particularly during the initial phase of tracking.

#### 1. The Bias Problem in EWA
* **Initial Bias:** When using the standard EWA formula ($V_t = \beta V_{t-1} + (1 - \beta) \theta_t$) and initializing $V_0 = 0$, the initial estimates ($V_1, V_2, \dots$) are significantly **lower** than the true value. This is called **bias**.
    * *Example:* If $\beta = 0.98$ and $\theta_1 = 40$, then $V_1 = 0.02 \times 40 = 0.8$. This $0.8$ is a poor, highly biased estimate of the first day's temperature.
* **Visual Effect:** The calculated average (the "purple curve") starts off very low and only gradually rises to meet the true average (the desired "green curve").

![EWA bias correction](images/ewa_example2.png)


#### 2. The Bias Correction Formula
* **Correction Method:** To remove this initial bias, the EWA estimate $V_t$ is divided by a correction factor.
* **Bias-Corrected Estimate:**
    $$\text{Corrected } V_t = \frac{V_t}{1 - \beta^t}$$
    where $t$ is the current time step (day or iteration number).

#### 3. How Bias Correction Works
* **Initial Stages ($t$ is small):** When $t$ is small, the denominator $1 - \beta^t$ is a small number (close to 0). Dividing $V_t$ by this small number significantly **inflates** the initial estimate, making it more accurate and closer to the true value (shifting the purple line to the green line).
* **Later Stages ($t$ is large):** As $t$ grows large, $\beta^t$ approaches 0 (since $\beta < 1$). The denominator $1 - \beta^t$ approaches 1, meaning the correction factor essentially disappears.
* **Practical Use in ML:** In many machine learning implementations, practitioners often **skip bias correction** because they can tolerate the slightly biased estimates for the first few initial iterations, after which the difference becomes negligible. However, if accuracy is required from the very start, bias correction is necessary.

### Gradient Descent with Momentum

We introduce the **Momentum** optimization algorithm, a method that significantly speeds up standard Gradient Descent by dampening oscillations and prioritizing convergence toward the minimum.

#### 1. The Problem with Standard Gradient Descent
* **Oscillations:** When optimizing a cost function with steep, elongated contours (like an ellipse or a "bowl"), standard Gradient Descent takes a jagged path, oscillating heavily along the vertical (steep) axis.
* **Slow Learning:** These oscillations force the use of a small learning rate ($\alpha$) to prevent divergence, which makes the convergence along the horizontal (shallow) axis very slow.
* **Goal:** The desired solution is to slow down learning vertically while accelerating learning horizontally.

#### 2. The Momentum Algorithm
The core idea is to calculate an **Exponentially Weighted Average (EWA) of the gradients** and use this average to update the weights.

On iteration $t$, after computing the current gradients ($dW$ and $dB$):

1.  **Compute EWA of Gradients (Velocity):**
    $$\mathbf{v}_{dW} = \beta \mathbf{v}_{dW} + (1 - \beta) dW$$
    $$\mathbf{v}_{dB} = \beta \mathbf{v}_{dB} + (1 - \beta) dB$$
2.  **Update Parameters:**
    $$W = W - \alpha \cdot \mathbf{v}_{dW}$$
    $$B = B - \alpha \cdot \mathbf{v}_{dB}$$

#### 3. How Momentum Solves the Problem
* **Dampening Oscillations:** In the vertical (steep) direction, the gradients often alternate signs (up, then down). The EWA effectively averages these opposite signs out, resulting in a **small $\mathbf{v}_{dW}$** vertically.
* **Accelerating Convergence:** In the horizontal (shallow) direction, the gradients consistently point toward the minimum. The EWA averages these consistent directions, resulting in a **large $\mathbf{v}_{dW}$** horizontally.
* **Net Result:** The update vector takes a more direct, efficient path toward the minimum, allowing for a larger learning rate and faster convergence.

#### 4. Implementation Details and Hyperparameters
* **Initialization:** Initialize velocity terms ($\mathbf{v}_{dW}, \mathbf{v}_{dB}$) to a matrix/vector of zeros with the same dimensions as the parameters.
* **Hyperparameters:**
    * **Learning Rate ($\alpha$):** Still needs to be tuned.
    * **$\beta$ (EWA weight):** Controls the averaging window. The most common, robust default value is **$\mathbf{0.9}$** (averaging over $\approx 10$ iterations).
* **Bias Correction:** Bias correction ($\div (1 - \beta^t)$) is typically **not used** because the moving average quickly "warms up" after a few iterations (e.g., 10 iterations) and the bias becomes negligible.
* **Formula Variation:** Some literature omits the $(1-\beta)$ term in the EWA formula, which requires re-tuning the learning rate $\alpha$. The version including $(1-\beta)$ is often preferred for clarity and stability.

### RMSprop

We introduces **RMSprop (Root Mean Square Propagation)**, an optimization algorithm that speeds up Gradient Descent by adapting the learning rate for each parameter, specifically slowing down updates in directions that oscillate excessively.

#### 1. The Goal of RMSprop
* **Problem:** Standard Gradient Descent suffers from large, slow oscillations in directions where the cost function is steep (high gradient), which limits the overall learning rate ($\alpha$) that can be used.
* **Solution:** RMSprop aims to **slow down learning** in the oscillating (steep) directions and **accelerate/maintain learning** in the steady (shallow) directions.

#### 2. The RMSprop Mechanism
RMSprop uses an Exponentially Weighted Average (EWA) of the **squared gradients** to adapt the learning rate for each parameter.

1.  **Compute EWA of Squared Gradients:** For each parameter (e.g., $W$ and $B$), an EWA of the squared gradient is tracked. The squaring is an **element-wise operation**.
    $$\mathbf{S}_{dW} = \beta_2 \mathbf{S}_{dW} + (1 - \beta_2) dW^2$$
    $$\mathbf{S}_{dB} = \beta_2 \mathbf{S}_{dB} + (1 - \beta_2) dB^2$$
    *(Note: The weight $\beta$ is often called $\beta_2$ to distinguish it from the $\beta$ used in Momentum).*

2.  **Update Parameters:** The standard gradient update is scaled by the square root of the averaged squared gradients.
    $$W = W - \alpha \frac{dW}{\sqrt{\mathbf{S}_{dW}} + \epsilon}$$
    $$B = B - \alpha \frac{dB}{\sqrt{\mathbf{S}_{dB}} + \epsilon}$$

#### 3. Intuition and Effect
* **Adaptive Learning Rate:** The algorithm divides the update by $\sqrt{\mathbf{S}}$, effectively creating an adaptive learning rate for each parameter.
* **Dampening Oscillations:** In the steep (oscillating) direction (e.g., the $B$ direction), the gradient $dB$ is large. Therefore, $S_{dB}$ becomes large. Dividing the update $\alpha \cdot dB$ by a large $\sqrt{\mathbf{S}_{dB}}$ **damps down** the vertical oscillations.
* **Accelerating Progress:** In the shallow direction (e.g., the $W$ direction), the gradient $dW$ is small, so $S_{dW}$ remains small. Dividing the update by a small $\sqrt{\mathbf{S}_{dW}}$ **maintains or accelerates** horizontal progress.
* **Result:** The convergence path is smoother, faster, and allows for a larger overall learning rate $\alpha$.

#### 4. Implementation Details
* **Hyperparameter:** $\beta_2$ is the EWA weight (commonly $0.9$ or $0.999$).
* **Numerical Stability:** A small constant $\mathbf{\epsilon}$ (e.g., $10^{-8}$) is added to the denominator ($\sqrt{\mathbf{S}} + \epsilon$) to prevent division by zero or an extremely small number.

### Adam Optimization Algorithm

The **Adam (Adaptive Moment Estimation)** algorithm is one of the most effective and widely adopted optimization techniques in deep learning. It achieves rapid convergence by combining the benefits of **Momentum** and **RMSprop**.
 
#### 1. The Adam Algorithm (per iteration $t$)

| Step | Purpose | Formula for Weights ($\mathbf{W}$) | Formula for Biases ($\mathbf{b}$) |
| :--- | :--- | :--- | :--- |
| **Initialization** | Initialize the first and second moment vectors to zero. | $\mathbf{V}_{dW} = 0, \mathbf{S}_{dW} = 0$ | $\mathbf{V}_{db} = 0, \mathbf{S}_{db} = 0$ |
| **1. Momentum Update** (First Moment) | Compute the Exponentially Weighted Average (EWA) of the gradients. | $\mathbf{V}_{dW} = \beta_1 \mathbf{V}_{dW} + (1 - \beta_1) dW$ | $\mathbf{V}_{db} = \beta_1 \mathbf{V}_{db} + (1 - \beta_1) db$ |
| **2. RMSprop Update** (Second Moment) | Compute the EWA of the **squared** gradients (element-wise). | $\mathbf{S}_{dW} = \beta_2 \mathbf{S}_{dW} + (1 - \beta_2) dW^2$ | $\mathbf{S}_{db} = \beta_2 \mathbf{S}_{db} + (1 - \beta_2) db^2$ |
| **3. Bias Correction (First Moment)** | Correct the bias in the first moment estimate. | $\mathbf{V}_{dW}^{\text{corrected}} = \mathbf{V}_{dW}/(1 - \beta_1^t)$ | $\mathbf{V}_{db}^{\text{corrected}} = \mathbf{V}_{db}/(1 - \beta_1^t)$ |
| **4. Bias Correction (Second Moment)** | Correct the bias in the second moment estimate. | $\mathbf{S}_{dW}^{\text{corrected}} = \mathbf{S}_{dW}/(1 - \beta_2^t)$ | $\mathbf{S}_{db}^{\text{corrected}} = \mathbf{S}_{db}/(1 - \beta_2^t)$ |
| **5. Parameter Update** | Update the parameter using the adaptive step size. | $W = W - \alpha \frac{\mathbf{V}_{dW}^{\text{corrected}}}{\sqrt{\mathbf{S}_{dW}^{\text{corrected}}} + \epsilon}$ | $b = b - \alpha \frac{\mathbf{V}_{db}^{\text{corrected}}}{\sqrt{\mathbf{S}_{db}^{\text{corrected}}} + \epsilon}$ |

#### 2. Hyperparameters and Default Values

Adam is effective because it often works well with robust default settings, minimizing the need for extensive tuning of all parameters except the learning rate ($\alpha$).

| Hyperparameter | Description | Recommended Default Value |
| :--- | :--- | :--- |
| **$\alpha$** | **Learning Rate** | **Must be tuned** (the most important hyperparameter). |
| **$\beta_1$** | **Momentum EWA Weight** (First Moment) | **$0.9$** |
| **$\beta_2$** | **RMSprop EWA Weight** (Second Moment) | **$0.999$** |
| **$\epsilon$** | **Numerical Stability Term** | **$10^{-8}$** |

#### 3. Naming and Rationale

  * **Adam** stands for **Adaptive Moment Estimation**.
      * The $\mathbf{V}$ term (EWA of gradients) is the estimate of the **first moment** (the mean).
      * The $\mathbf{S}$ term (EWA of squared gradients) is the estimate of the **second moment** (related to variance).
  * **Effectiveness:** Adam has become a standard, highly recommended optimizer due to its proven ability to converge quickly and robustly across a wide variety of deep learning architectures and problems.

### Learning Rate Decay

**Learning Rate Decay** is a technique used in optimization to gradually reduce the learning rate ($\alpha$) over time to ensure better convergence.

#### 1. Rationale for Learning Rate Decay
* **Convergence Noise:** When using Mini-Batch Gradient Descent (MBGD) with a fixed learning rate $\alpha$, the optimization path doesn't precisely converge to the minimum. Instead, it **wanders or oscillates** around the minimum due to the noise introduced by different mini-batches.
* **Tighter Convergence:** By slowly reducing $\alpha$ over time, the algorithm is forced to take smaller, slower steps later in training. This causes the oscillations to become much tighter, allowing the parameters to settle **closer to the true minimum** rather than wandering far away.
* **Fast Initial Learning:** A large $\alpha$ is maintained during the initial phases for fast learning, but it decreases when learning approaches convergence.

#### 2. Common Learning Rate Decay Formulas
The learning rate $\alpha$ is often calculated based on the **epoch number** (or mini-batch number).

* **Standard (Inverse Time) Decay:**
    $$\alpha = \frac{1}{1 + \text{decay\_rate} \cdot \text{epoch\_num}} \cdot \alpha_0$$
    * **$\alpha_0$** is the initial learning rate (a hyperparameter to tune).
    * **Decay Rate** is a new hyperparameter that controls how quickly $\alpha$ decreases (also needs to be tuned).
    
    When you're training for a few epoch this doesn't cause a lot of troubles, but when the number of epochs is large the optimization algorithm will stop updating. One common fix to this issue is to decay the learning rate every few steps. This is called fixed interval scheduling.

  $$\alpha = \frac{1}{1 + \text{decay\_rate} \cdot \text{epoch\_num}/\text{time interval}} \cdot \alpha_0$$

* **Exponential Decay:**
    $$\alpha = 0.95^{\text{epoch\_num}} \cdot \alpha_0$$
    (or any number less than 1, like $0.95$, raised to the power of the epoch number).
* **Other Forms:**
    * $\alpha \propto \frac{1}{\sqrt{\text{epoch\_num}}}$
    * **Discrete Staircase:** $\alpha$ is held constant for several epochs and then suddenly dropped (e.g., cut in half).
* **Manual Decay:** In cases where only a small number of models are trained over many days, the user may manually decrease $\alpha$ after observing a slowdown in learning progress.

#### 3. Practical Implementation Advice
* **Hyperparameters:** Learning Rate Decay introduces an additional hyperparameter (the **decay rate**), which must be tuned along with the initial learning rate $\alpha_0$.
* **Priority:** Optimizing the initial **fixed value of $\alpha$** has a greater overall impact on performance and should typically be a higher priority than implementing learning rate decay. While decay is useful, it is often a lower-priority technique to try.

### The Problem of Local Optima

Here, we aim to clarify modern deep learning intuition regarding optimization challenges, shifting the focus from local optima to the problem of **saddle points and plateaus** in high-dimensional spaces.

#### 1. Re-evaluating Local Optima
* **Traditional Worry (Low Dimensions):** In the early days, researchers worried about the optimization algorithm getting stuck in many different "hills and valleys" (local optima), as often depicted in 2D plots.
* **Modern View (High Dimensions):** This intuition is largely incorrect for large neural networks operating in **very high-dimensional spaces** (e.g., 20,000 parameters).
    * For a point to be a true local optimum, the cost function must be convex (bending upward) in **all** directions.
    * The probability of a function being convex in all 20,000 directions is extremely small.

#### 2. The Problem: Saddle Points
* **Saddle Points are Common:** Instead of local optima, most points where the gradient is zero in high-dimensional spaces are **saddle points**.
    * A saddle point is a point where the function curves **up** in some directions but curves **down** in others (like a saddle on a horse).
    * Optimization algorithms are much more likely to encounter saddle points than true local optima.

#### 3. The Real Slowdown: Plateaus
* **Plateaus:** A more significant problem for optimization is running into **plateaus**.
    * A plateau is a large, flat region where the derivative (gradient) is close to zero for a long time.
    * When an algorithm hits a plateau, it takes a **very long time** to move across the flat surface and eventually find a descent path off the plateau.
* **Solution:** Sophisticated optimization algorithms like **Momentum, RMSprop, or Adam** are crucial for dealing with plateaus. Their accumulated "velocity" helps the optimization algorithm sweep across flat regions faster, speeding up the overall learning process.

#### 4. Conclusion on Optimization Challenges
* **Good News:** You are **unlikely to get stuck** in a bad local optimum, provided you are training a reasonably large network with many parameters.
* **Challenge:** The primary challenges are **saddle points** and **plateaus**, which are characteristic of optimization over high-dimensional, complex cost surfaces.
* **Takeaway:** This is the key reason why using advanced optimizers (Adam being a top choice) is essentialâ€”they are better designed to navigate the high-dimensional challenges of plateaus and saddle points than standard gradient descent.