# Stamp 04: Adam Math at t=1

**Goal:** Work through the Adam update equation at t=1 and confirm (or refute) that bias correction causes the update to reduce to -η · sign(g).

No training, no model. Just math.

---

*Jeffery Harrell & Alpha, December 1, 2025*

## The Adam Update Rule

Standard Adam (Kingma & Ba, 2014):

**Given:**
- $g_t$ = gradient at step $t$
- $\eta$ = learning rate
- $\beta_1$ = first moment decay (default: 0.9)
- $\beta_2$ = second moment decay (default: 0.999)
- $\epsilon$ = numerical stability term (default: 1e-8)

**Update equations:**

1. First moment estimate (momentum):
$$m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$$

2. Second moment estimate (RMSprop-like):
$$v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$$

3. Bias-corrected estimates:
$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$
$$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

4. Parameter update:
$$\theta_t = \theta_{t-1} - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

## At t=1

**Initial conditions:** $m_0 = 0$, $v_0 = 0$

### Step 1: Compute $m_1$ and $v_1$

$$m_1 = \beta_1 \cdot 0 + (1 - \beta_1) \cdot g_1 = (1 - \beta_1) \cdot g_1$$

$$v_1 = \beta_2 \cdot 0 + (1 - \beta_2) \cdot g_1^2 = (1 - \beta_2) \cdot g_1^2$$

### Step 2: Bias correction

$$\hat{m}_1 = \frac{(1 - \beta_1) \cdot g_1}{1 - \beta_1^1} = \frac{(1 - \beta_1) \cdot g_1}{1 - \beta_1} = g_1$$

$$\hat{v}_1 = \frac{(1 - \beta_2) \cdot g_1^2}{1 - \beta_2^1} = \frac{(1 - \beta_2) \cdot g_1^2}{1 - \beta_2} = g_1^2$$

**The bias correction terms cancel perfectly at t=1!**

### Step 3: The update

$$\Delta\theta_1 = -\eta \cdot \frac{\hat{m}_1}{\sqrt{\hat{v}_1} + \epsilon} = -\eta \cdot \frac{g_1}{\sqrt{g_1^2} + \epsilon} = -\eta \cdot \frac{g_1}{|g_1| + \epsilon}$$

If $|g_1| \gg \epsilon$:

$$\Delta\theta_1 \approx -\eta \cdot \frac{g_1}{|g_1|} = -\eta \cdot \text{sign}(g_1)$$

## Conclusion (Algebraic)

**Yes, the claim is correct.** At t=1, with $m_0 = v_0 = 0$:

$$\boxed{\Delta\theta_1 = -\eta \cdot \text{sign}(g_1)}$$

(assuming $|g_1| \gg \epsilon$)

This means:
1. The *magnitude* of the gradient doesn't matter at t=1
2. Only the *sign* matters
3. Every parameter moves by exactly $\eta$ (in absolute value)
4. This is true regardless of whether $g_1$ is 1e-10 or 1e+10

## Wait. Let's check the ε regime.

What if $|g_1|$ is *not* much greater than $\epsilon$?

With $\epsilon = 10^{-8}$ (PyTorch default for Adam):

| $|g_1|$ | $|g_1| / (|g_1| + \epsilon)$ | Effective multiplier |
|---------|------------------------------|----------------------|
| 1e-2 | 0.999999 | ≈ 1 |
| 1e-4 | 0.9999 | ≈ 1 |
| 1e-6 | 0.99 | ≈ 1 |
| 1e-8 | 0.5 | 0.5 |
| 1e-10 | 0.0099 | ≈ 0.01 |
| 1e-12 | 0.0001 | ≈ 0 |

**Oh.**

If $|g_1| \approx \epsilon$, the update is halved.

If $|g_1| \ll \epsilon$, the update approaches:

$$\Delta\theta_1 \approx -\eta \cdot \frac{g_1}{\epsilon} = -\frac{\eta}{\epsilon} \cdot g_1$$

This is just scaled gradient descent with a huge learning rate $\eta/\epsilon = 10^{-3}/10^{-8} = 10^5$.

But wait—$g_1$ is tiny, so the product might still be small...

## Let's compute actual numbers

From our sanity check:
- Dead token gradient norm at step 1: **1.79e-10**
- Live token gradient norm at step 1: **7.65e-3**

With $\eta = 10^{-3}$, $\epsilon = 10^{-8}$:

### Live tokens:
$$|g| = 7.65 \times 10^{-3} \gg \epsilon$$
$$|\Delta\theta| \approx \eta = 10^{-3}$$

### Dead tokens:
$$|g| = 1.79 \times 10^{-10} \ll \epsilon$$
$$|\Delta\theta| \approx \eta \cdot \frac{|g|}{\epsilon} = 10^{-3} \cdot \frac{1.79 \times 10^{-10}}{10^{-8}} = 10^{-3} \cdot 1.79 \times 10^{-2} = 1.79 \times 10^{-5}$$

So dead tokens should move by about **1.79e-5** at step 1, not **1e-3** like live tokens.

## The Real Question

The math shows that if $|g| \ll \epsilon$, the update is proportional to $g$, not $\text{sign}(g)$.

But **why is the dead token gradient so small in the first place?**

Theory says:
$$\frac{\partial L}{\partial W_i} = p_i \cdot h$$

where $p_i$ is the softmax probability assigned to token $i$, and $h$ is the hidden state.

For dead tokens, $p_i$ should be approximately $1/V$ (uniform) when the model is untrained, giving:
$$|g_{dead}| \approx \frac{1}{V} \cdot |h| = \frac{1}{3988} \cdot |h|$$

With $|h| \approx 11$ (from our h_mean norm), we'd expect:
$$|g_{dead}| \approx \frac{11}{3988} \approx 2.8 \times 10^{-3}$$

But we measured **1.79e-10**. That's 7 orders of magnitude smaller!

**The mystery isn't in Adam. The mystery is in the gradient computation itself.**

## Numerical Verification

In [1]:
import numpy as np

# Adam parameters (PyTorch defaults)
eta = 1e-3      # learning rate
beta1 = 0.9     # first moment decay
beta2 = 0.999   # second moment decay  
eps = 1e-8      # numerical stability

# At t=1, with m_0 = v_0 = 0
def adam_update_t1(g):
    """Compute Adam update at t=1 for gradient g."""
    # Raw moments
    m1 = (1 - beta1) * g
    v1 = (1 - beta2) * g**2
    
    # Bias correction at t=1
    m1_hat = m1 / (1 - beta1**1)  # = g
    v1_hat = v1 / (1 - beta2**1)  # = g^2
    
    # Update
    delta = -eta * m1_hat / (np.sqrt(v1_hat) + eps)
    return delta

print("Adam update at t=1 for various gradient magnitudes:")
print("="*60)
print(f"{'|g|':>12} | {'|Δθ|':>12} | {'|Δθ|/η':>12} | Notes")
print("-"*60)

for g in [1e-2, 1e-4, 1e-6, 1e-8, 1e-10, 1e-12, 7.65e-3, 1.79e-10]:
    delta = adam_update_t1(g)
    ratio = abs(delta) / eta
    note = ""
    if g == 7.65e-3:
        note = "← Live tokens"
    elif g == 1.79e-10:
        note = "← Dead tokens"
    print(f"{g:>12.2e} | {abs(delta):>12.2e} | {ratio:>12.4f} | {note}")

Adam update at t=1 for various gradient magnitudes:
         |g| |         |Δθ| |       |Δθ|/η | Notes
------------------------------------------------------------
    1.00e-02 |     1.00e-03 |       1.0000 | 
    1.00e-04 |     1.00e-03 |       0.9999 | 
    1.00e-06 |     9.90e-04 |       0.9901 | 
    1.00e-08 |     5.00e-04 |       0.5000 | 
    1.00e-10 |     9.90e-06 |       0.0099 | 
    1.00e-12 |     1.00e-07 |       0.0001 | 
    7.65e-03 |     1.00e-03 |       1.0000 | ← Live tokens
    1.79e-10 |     1.76e-05 |       0.0176 | ← Dead tokens


In [2]:
# What gradient would we EXPECT for dead tokens?
V = 3988  # vocab size
h_norm = 11.0  # approximate h_mean norm at step 1

# If softmax is uniform, p_i = 1/V for all tokens
p_uniform = 1 / V
expected_grad = p_uniform * h_norm

print(f"Expected dead token gradient (uniform softmax theory):")
print(f"  p_dead = 1/V = 1/{V} = {p_uniform:.2e}")
print(f"  |h| ≈ {h_norm}")
print(f"  |g_dead| = p_dead × |h| ≈ {expected_grad:.2e}")
print()
print(f"Observed dead token gradient: 1.79e-10")
print(f"Ratio (expected/observed): {expected_grad / 1.79e-10:.2e}")

Expected dead token gradient (uniform softmax theory):
  p_dead = 1/V = 1/3988 = 2.51e-04
  |h| ≈ 11.0
  |g_dead| = p_dead × |h| ≈ 2.76e-03

Observed dead token gradient: 1.79e-10
Ratio (expected/observed): 1.54e+07


## Summary

1. **The Adam math is correct.** At t=1, bias correction cancels and $\Delta\theta = -\eta \cdot g / (|g| + \epsilon)$.

2. **For large gradients** ($|g| \gg \epsilon$), this reduces to $-\eta \cdot \text{sign}(g)$.

3. **For tiny gradients** ($|g| \ll \epsilon$), this reduces to $-\eta \cdot g / \epsilon$, which is proportional to $g$.

4. **The mystery is NOT in Adam.** Adam would happily move dead tokens if they had reasonable gradients.

5. **The mystery is in the gradient itself.** Dead tokens get gradients of 1e-10 when theory predicts ~1e-3. That's a 7 order-of-magnitude discrepancy.

**Next question:** Why is $\partial L / \partial W_{dead}$ so small? What's happening in the forward/backward pass that makes dead tokens invisible to the gradient?