Keras Nadam optimizer behaves different to `ApplyAdamOp` with `use_nesterov` option #15710

albertz · 2021-11-27T03:14:20Z

This is copied and adopted from here: tensorflow/tensorflow#53204

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04 (but irrelevant)
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
TensorFlow installed from (source or binary): binary (but irrelevant)
TensorFlow version (use command below): 2.3, 2.7.0, current master
Python version: 3.7
Bazel version (if compiling from source): -
GCC/Compiler version (if compiling from source): -
CUDA/cuDNN version: -
GPU model and memory: -

Describe the current behavior

Nadam as implemented in the kernel via the use_nesterov option (tensorflow/tensorflow/core/kernels/training_ops_gpu.cu.cc) behaves different to tf.keras.optimizers.Nadam.

We observed this problem (here: rwth-i6/returnn#766 (comment)) where the model converged in TF 1 with the old Nadam and does not anymore with TF 2 with the new Nadam.

Specifically, in the kernel, you have this code:

const T mul_factor = (*lr_) * sqrt(static_cast<T>(1.0) - (*beta2_power_)) /
                     (static_cast<T>(1.0) - (*beta1_power_));

...

auto m_i = m[i];
auto g_i = grad[i];
auto v_i = v[i];

m_i += one_minus_beta1 * (g_i - m_i);
v_i += one_minus_beta2 * (g_i * g_i - v_i);
if (use_nesterov) {
  var[i] -= mul_factor * (m_i * beta1 + one_minus_beta1 * g_i) /
            (epsilon + sqrt(v_i));
} else {
  var[i] -= mul_factor * m_i / (epsilon + sqrt(v_i));
}

m[i] = m_i;
v[i] = v_i;

And in the Keras Nadam optimizer, we have this code:

g_prime = grad / coefficients['one_minus_m_schedule_new']
m_t = (coefficients['beta_1_t'] * m +
       coefficients['one_minus_beta_1_t'] * grad)
m_t = tf.compat.v1.assign(m, m_t, use_locking=self._use_locking)
m_t_prime = m_t / coefficients['one_minus_m_schedule_next']
v_t = (coefficients['beta_2_t'] * v +
       coefficients['one_minus_beta_2_t'] * tf.square(grad))
v_t = tf.compat.v1.assign(v, v_t, use_locking=self._use_locking)
v_t_prime = v_t / coefficients['v_t_prime_denominator']
m_t_bar = (coefficients['one_minus_m_t'] * g_prime +
           coefficients['m_t_1'] * m_t_prime)
var_t = var - coefficients['lr_t'] * m_t_bar / (
    tf.sqrt(v_t_prime) + coefficients['epsilon'])

There are differences is the calculation of m_t and v_t. While mathematically the same, the variant in the kernel is probably numerically more stable. The difference is

m_i = m_i + one_minus_beta1 * (g_i - m_i);

vs

m_i = beta1 * m_i + one_minus_beta1 * g_i;

I think the first variant is better here. I think this should be adopted in the Keras code.

Another difference is mul_factor vs lr, and how var[i] (var_t) is calculated.

Describe the expected behavior

I would expect that Nadam as implemented in the kernel via the use_nesterov option (tensorflow/tensorflow/core/kernels/training_ops_gpu.cu.cc, available since TF 1, also still in current master), which could be used via tensorflow.contrib.opt.NadamOptimizer in TF 1, is conceptually the same as tf.keras.optimizers.Nadam.

Also, TF provides the ApplyAdamOp with Nesterov support. I would expect that Keras Nadam makes use of that anyway, as it should also be a bit more efficient.

The text was updated successfully, but these errors were encountered:

google-ml-butler · 2022-08-09T03:21:01Z

Are you satisfied with the resolution of your issue?
Yes
No

albertz · 2022-08-09T11:58:12Z

This was closed without comment. Has this been fixed? What commit?

albertz mentioned this issue Nov 27, 2021

Keras Nadam optimizer behaves different to ApplyAdamOp with use_nesterov option tensorflow/tensorflow#53204

Closed

chunduriv self-assigned this Nov 29, 2021

chunduriv added the type:bug/performance label Nov 29, 2021

chunduriv assigned sachinprasadhs and unassigned chunduriv Nov 29, 2021

sachinprasadhs added the keras-team-review-pending Pending review by a Keras team member. label Dec 2, 2021

mattdangerw assigned chenmoneygithub Dec 2, 2021

mattdangerw removed the keras-team-review-pending Pending review by a Keras team member. label Dec 2, 2021

chenmoneygithub closed this as completed Aug 9, 2022

albertz mentioned this issue Oct 18, 2023

Nadam optimizer differences rwth-i6/returnn#1440

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keras Nadam optimizer behaves different to `ApplyAdamOp` with `use_nesterov` option #15710

Keras Nadam optimizer behaves different to `ApplyAdamOp` with `use_nesterov` option #15710

albertz commented Nov 27, 2021

google-ml-butler bot commented Aug 9, 2022

albertz commented Aug 9, 2022

Keras Nadam optimizer behaves different to ApplyAdamOp with use_nesterov option #15710

Keras Nadam optimizer behaves different to ApplyAdamOp with use_nesterov option #15710

Comments

albertz commented Nov 27, 2021

google-ml-butler bot commented Aug 9, 2022

albertz commented Aug 9, 2022

Keras Nadam optimizer behaves different to `ApplyAdamOp` with `use_nesterov` option #15710

Keras Nadam optimizer behaves different to `ApplyAdamOp` with `use_nesterov` option #15710