# ‚öôÔ∏è Level 9 ‚Äî Adaptive Optimizers: RMSProp vs Adam vs AdamW

> **Objective:**  
> To visualize and compare the behavior of **RMSProp**, **Adam**, and **AdamW** optimizers  
> on a curved 3D loss surface, highlighting how adaptive learning rates and weight decay  
> influence convergence speed, path smoothness, and stability.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation
from mpl_toolkits.mplot3d import Axes3D
from IPython.display import HTML

# Define 3D loss surface
def f(x, y):
    """Non-convex loss surface."""
    return np.log(1 + x**2 + 2*y**2) + 0.3*np.sin(3*x) * np.cos(3*y)

def grad(x, y):
    """Gradient of the loss function."""
    dfdx = (2*x / (1 + x**2 + 2*y**2)) + 0.9*np.cos(3*x)*np.cos(3*y)
    dfdy = (4*y / (1 + x**2 + 2*y**2)) - 0.9*np.sin(3*x)*np.sin(3*y)
    return np.array([dfdx, dfdy])


In [None]:
def rmsprop(start, lr=0.08, beta=0.9, eps=1e-8, steps=70):
    x, y = start
    s = np.zeros(2)
    path = [(x, y)]
    for _ in range(steps):
        g = grad(x, y)
        s = beta * s + (1 - beta) * (g ** 2)
        x -= lr * g[0] / (np.sqrt(s[0]) + eps)
        y -= lr * g[1] / (np.sqrt(s[1]) + eps)
        path.append((x, y))
    return np.array(path)

def adam(start, lr=0.08, beta1=0.9, beta2=0.999, eps=1e-8, steps=70):
    x, y = start
    m = np.zeros(2)
    v = np.zeros(2)
    path = [(x, y)]
    for t in range(1, steps + 1):
        g = grad(x, y)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * (g ** 2)
        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)
        x -= lr * m_hat[0] / (np.sqrt(v_hat[0]) + eps)
        y -= lr * m_hat[1] / (np.sqrt(v_hat[1]) + eps)
        path.append((x, y))
    return np.array(path)

def adamw(start, lr=0.08, beta1=0.9, beta2=0.999, weight_decay=0.02, eps=1e-8, steps=70):
    x, y = start
    m = np.zeros(2)
    v = np.zeros(2)
    path = [(x, y)]
    for t in range(1, steps + 1):
        g = grad(x, y)
        # Apply weight decay directly to parameters (decoupled)
        g += weight_decay * np.array([x, y])
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * (g ** 2)
        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)
        x -= lr * m_hat[0] / (np.sqrt(v_hat[0]) + eps)
        y -= lr * m_hat[1] / (np.sqrt(v_hat[1]) + eps)
        path.append((x, y))
    return np.array(path)


In [None]:
start = np.array([2.8, 2.5])
steps = 70

rms_path = rmsprop(start, steps=steps)
adam_path = adam(start, steps=steps)
adamw_path = adamw(start, steps=steps)

paths = {"RMSProp": rms_path, "Adam": adam_path, "AdamW": adamw_path}


In [None]:
fig = plt.figure(figsize=(9, 7))
ax = fig.add_subplot(111, projection="3d")

# Surface grid
X = np.linspace(-3, 3, 150)
Y = np.linspace(-3, 3, 150)
X, Y = np.meshgrid(X, Y)
Z = f(X, Y)

ax.plot_surface(X, Y, Z, cmap="plasma", alpha=0.7, linewidth=0.4)
ax.set_title("3D Adaptive Optimizers: RMSProp vs Adam vs AdamW", fontsize=13)
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("Loss")

colors = {"RMSProp": "cyan", "Adam": "orange", "AdamW": "red"}
lines, points = {}, {}

for name in paths:
    lines[name], = ax.plot([], [], [], lw=2, color=colors[name], label=name)
    points[name], = ax.plot([], [], [], "o", color=colors[name])

ax.legend()


In [None]:
def init():
    for name in paths:
        lines[name].set_data([], [])
        lines[name].set_3d_properties([])
        points[name].set_data([], [])
        points[name].set_3d_properties([])
    return list(lines.values()) + list(points.values())

def update(frame):
    for name, path in paths.items():
        if frame >= len(path):
            continue
        x, y = path[:frame, 0], path[:frame, 1]
        z = f(x, y)
        lines[name].set_data(x, y)
        lines[name].set_3d_properties(z)
        points[name].set_data(x[-1:], y[-1:])
        points[name].set_3d_properties(z[-1:])
    return list(lines.values()) + list(points.values())

ani = animation.FuncAnimation(fig, update, init_func=init,
                              frames=steps, interval=150, blit=False)

from IPython.display import HTML
import matplotlib
plt.close(fig)
matplotlib.rcParams["animation.html"] = "jshtml"
display(HTML(ani.to_jshtml()))


## üß† Mathematical Insight

### RMSProp
- Adapts learning rate using moving average of squared gradients:
  $$
  s_t = \beta s_{t-1} + (1 - \beta)(\nabla f_t)^2
  $$
  $$
  \theta_{t+1} = \theta_t - \eta \frac{\nabla f_t}{\sqrt{s_t} + \epsilon}
  $$

### Adam
- Adds momentum on top of RMSProp:
  $$
  m_t = \beta_1 m_{t-1} + (1 - \beta_1)\nabla f_t, \quad
  v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla f_t)^2
  $$
  $$
  \theta_{t+1} = \theta_t - \eta \frac{m_t / (1 - \beta_1^t)}{\sqrt{v_t / (1 - \beta_2^t)} + \epsilon}
  $$

### AdamW
- Decouples **weight decay** from gradient updates:
  $$
  \nabla f_t \leftarrow \nabla f_t + \lambda \theta_t
  $$
  which prevents **L2 regularization bias** in Adam,  
  yielding **better generalization** and **stable training**.


## üß© Key Observations

| Optimizer | Descent Path | Adaptivity | Stability | Comment |
|------------|--------------|-------------|-------------|----------|
| **RMSProp** | Fluctuates mildly | ‚úÖ Adaptive | ‚ö†Ô∏è Slight noise | Good, but lacks momentum |
| **Adam** | Smooth adaptive descent | ‚úÖ Adaptive | ‚úÖ Stable | Fast convergence |
| **AdamW** | Smoothest, steady | ‚úÖ Adaptive + Regularized | ‚úÖ Most Stable | Best generalization & balance |


## üß≠ Takeaway

- **RMSProp** introduced adaptive learning rates.  
- **Adam** fused momentum + adaptive updates for fast convergence.  
- **AdamW** further improved it by decoupling weight decay ‚Äî reducing bias and improving generalization.

**Modern deep learning models (Transformers, CNNs, LLMs)** use **AdamW** as their default optimizer.  
This level connects pure mathematical optimization to large-scale model training in practice.


In [None]:
ani.save("adaptive_optimizers_3D.gif", writer="pillow", fps=8)
