Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keras Nadam optimizer behaves different to ApplyAdamOp with use_nesterov option #15710

Closed
albertz opened this issue Nov 27, 2021 · 2 comments
Closed

Comments

@albertz
Copy link

albertz commented Nov 27, 2021

This is copied and adopted from here: tensorflow/tensorflow#53204

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04 (but irrelevant)
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
  • TensorFlow installed from (source or binary): binary (but irrelevant)
  • TensorFlow version (use command below): 2.3, 2.7.0, current master
  • Python version: 3.7
  • Bazel version (if compiling from source): -
  • GCC/Compiler version (if compiling from source): -
  • CUDA/cuDNN version: -
  • GPU model and memory: -

Describe the current behavior

Nadam as implemented in the kernel via the use_nesterov option (tensorflow/tensorflow/core/kernels/training_ops_gpu.cu.cc) behaves different to tf.keras.optimizers.Nadam.

We observed this problem (here: rwth-i6/returnn#766 (comment)) where the model converged in TF 1 with the old Nadam and does not anymore with TF 2 with the new Nadam.

Specifically, in the kernel, you have this code:

const T mul_factor = (*lr_) * sqrt(static_cast<T>(1.0) - (*beta2_power_)) /
                     (static_cast<T>(1.0) - (*beta1_power_));

...

auto m_i = m[i];
auto g_i = grad[i];
auto v_i = v[i];

m_i += one_minus_beta1 * (g_i - m_i);
v_i += one_minus_beta2 * (g_i * g_i - v_i);
if (use_nesterov) {
  var[i] -= mul_factor * (m_i * beta1 + one_minus_beta1 * g_i) /
            (epsilon + sqrt(v_i));
} else {
  var[i] -= mul_factor * m_i / (epsilon + sqrt(v_i));
}

m[i] = m_i;
v[i] = v_i;

And in the Keras Nadam optimizer, we have this code:

g_prime = grad / coefficients['one_minus_m_schedule_new']
m_t = (coefficients['beta_1_t'] * m +
       coefficients['one_minus_beta_1_t'] * grad)
m_t = tf.compat.v1.assign(m, m_t, use_locking=self._use_locking)
m_t_prime = m_t / coefficients['one_minus_m_schedule_next']
v_t = (coefficients['beta_2_t'] * v +
       coefficients['one_minus_beta_2_t'] * tf.square(grad))
v_t = tf.compat.v1.assign(v, v_t, use_locking=self._use_locking)
v_t_prime = v_t / coefficients['v_t_prime_denominator']
m_t_bar = (coefficients['one_minus_m_t'] * g_prime +
           coefficients['m_t_1'] * m_t_prime)
var_t = var - coefficients['lr_t'] * m_t_bar / (
    tf.sqrt(v_t_prime) + coefficients['epsilon'])

There are differences is the calculation of m_t and v_t. While mathematically the same, the variant in the kernel is probably numerically more stable. The difference is

m_i = m_i + one_minus_beta1 * (g_i - m_i);

vs

m_i = beta1 * m_i + one_minus_beta1 * g_i;

I think the first variant is better here. I think this should be adopted in the Keras code.

Another difference is mul_factor vs lr, and how var[i] (var_t) is calculated.

Describe the expected behavior

I would expect that Nadam as implemented in the kernel via the use_nesterov option (tensorflow/tensorflow/core/kernels/training_ops_gpu.cu.cc, available since TF 1, also still in current master), which could be used via tensorflow.contrib.opt.NadamOptimizer in TF 1, is conceptually the same as tf.keras.optimizers.Nadam.

Also, TF provides the ApplyAdamOp with Nesterov support. I would expect that Keras Nadam makes use of that anyway, as it should also be a bit more efficient.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@albertz
Copy link
Author

albertz commented Aug 9, 2022

This was closed without comment. Has this been fixed? What commit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants