rmsprop implementation in optax and torch #532

vwxyzjn · 2023-06-01T00:00:58Z

Hello, thanks for this helpful library.

I was wondering if optax's rmsprop implementation is equivalent to torch's rmsprop implementation.

Recently, I have been working on a distributed reinforcement learning library called Cleanba which replicates IMPALA. I then compared our Cleanba's impala against torchbeast's IMPALA. I considered the following four optimizer settings:

torch setting in torchbeast

optax setting in our cleanba

optimizer = torch.optim.Adam(
    learner_model.parameters(),
    lr=0.00025,
    eps=1e-5,
)
# ...
nn.utils.clip_grad_norm_(learner_model.parameters(), 0.5)

optax.chain(
    optax.clip_by_global_norm(0.5),
    optax.adam(
        learning_rate=0.00025, 
        eps=1e-5
    ),
)

torch setting in torchbeast

optax setting in our cleanba

optimizer = torch.optim.RMSprop(
    learner_model.parameters(),
    lr=0.0006,
    momentum=0,
    eps=0.01,
    alpha=0.99,
)
# ...
nn.utils.clip_grad_norm_(learner_model.parameters(), 40)

optax.chain(
    optax.clip_by_global_norm(40),
    optax.rmsprop(
         learning_rate=0.0006,
         eps=0.01,
         decay=0.99, # decay corresponds to torch's `alpha`, right?
    ),
)

The results are presented in the figure below. When using the same Adam optimizer settings in torchbeast and cleanba, torchbeast's IMPALA and cleanba's IMPALA has similar performance. However, when using the same RMSprop optimizer settings, the performance differs significantly, with torchbeast's IMPALA obtaining much higher median human normalized score. This leads me to wonder if optax has the same RMSprop implementation as torch... Was wondering if you would have any thoughts on the performance discrepancies. Thanks!

The experiments were run for three random seeds, and the individual learning curves suggest torchbeast's IMPALA seems to perform qualitatively better.

The text was updated successfully, but these errors were encountered:

Rupt · 2023-08-13T13:59:49Z

@vwxyzjn both of your "cleanba" examples use optax.adam, although the second has an invalid decay= argument. Is this an error in your reproduction in this issue?

Rupt · 2023-08-13T14:10:33Z

This leads me to wonder if optax has the same RMSprop implementation as torch... Was wondering if you would have any thoughts on the performance discrepancies. Thanks!

@vwxyzjn Yes, the implementations differ. They treat eps differently.

In the denominator, optax uses $\sqrt{v + \varepsilon}$ (see here and here), and torch uses $(\sqrt v + \varepsilon)$ (see here).

vwxyzjn · 2023-08-13T14:49:28Z

Hi @Rupt thanks for the reply. The snippet I shared had a typo, it should be optax.rmsprop, which is actually used in the experiment

Also thanks for checking the differences, so to create a torch equivalent optimizer, I just need to do the following?

  def update_fn(updates, state, params=None):
    del params
    nu = update_moment_per_elem_norm(updates, state.nu, decay, 2)
    updates = jax.tree_util.tree_map(
-        lambda g, n: g * jax.lax.rsqrt(n + eps), updates, nu)
+        lambda g, n: g * jax.lax.rsqrt(n) + eps, updates, nu)
    return updates, ScaleByRmsState(nu=nu)

Rupt · 2023-08-13T14:59:46Z

Very welcome @vwxyzjn.

Not quite, the eps should still be in the denominator. I think you want this:

-        lambda g, n: g * jax.lax.rsqrt(n + eps), updates, nu)
+        lambda g, n: g / (jax.lax.sqrt(n) + eps), updates, nu)

vwxyzjn · 2023-08-13T15:06:37Z

Thanks so much @Rupt I will give it a try :)

mtthss · 2023-08-14T13:09:32Z

@vwxyzjn did this work?

vwxyzjn · 2023-08-16T03:27:58Z

@mtthss Actually yeah, I used the optimizer @Rupt suggested and can now match the performance of monobeast with RMSprop (the purple curve now matches the blue curve). The PyTorch RMSprop setting seems quite good, giving a performance boost in almost all the games tested.

Would you like a PR to add a warning like #571?

https://github.com/vwxyzjn/cleanba/blob/d0f5edebe8539231855d657e57e46daf7c590bc7/cleanba/cleanba_impala_envpool_machado_atari_wrapper_rmsprop_pt.py#L135-L179

Rupt · 2023-08-16T18:51:04Z

@vwxyzjn Glad it worked, thanks for sharing this nice reproduction.

The PyTorch RMSprop setting seems quite good, giving a performance boost in almost all the games tested.

Note that the two implementations will prefer different eps values. You might get more similar results if using eps**2 in the optax version, because for small nu we get $1/\sqrt{0+\varepsilon^2} = 1/(\sqrt 0 + \varepsilon)$

mtthss · 2023-10-10T08:58:58Z

Would you like a PR to add a warning like #571?

@vwxyzjn that would be great thanks!

vwxyzjn mentioned this issue Oct 10, 2023

Note RMSProp PyTorch vs Optax implementation difference #595

Merged

copybara-service bot closed this as completed in #595 Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rmsprop implementation in optax and torch #532

rmsprop implementation in optax and torch #532

vwxyzjn commented Jun 1, 2023 •

edited

Rupt commented Aug 13, 2023 •

edited

Rupt commented Aug 13, 2023

vwxyzjn commented Aug 13, 2023

Rupt commented Aug 13, 2023

vwxyzjn commented Aug 13, 2023

mtthss commented Aug 14, 2023

vwxyzjn commented Aug 16, 2023 •

edited

Rupt commented Aug 16, 2023

mtthss commented Oct 10, 2023

rmsprop implementation in optax and torch #532

rmsprop implementation in optax and torch #532

Comments

vwxyzjn commented Jun 1, 2023 • edited

Rupt commented Aug 13, 2023 • edited

Rupt commented Aug 13, 2023

vwxyzjn commented Aug 13, 2023

Rupt commented Aug 13, 2023

vwxyzjn commented Aug 13, 2023

mtthss commented Aug 14, 2023

vwxyzjn commented Aug 16, 2023 • edited

Rupt commented Aug 16, 2023

mtthss commented Oct 10, 2023

vwxyzjn commented Jun 1, 2023 •

edited

Rupt commented Aug 13, 2023 •

edited

vwxyzjn commented Aug 16, 2023 •

edited