Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM - different outputs for same weights across CPU and GPU, when using float32 + tf-keras + NVIDIA A100 #772

Open
lbortolotti opened this issue Apr 10, 2024 · 4 comments
Assignees
Labels

Comments

@lbortolotti
Copy link

System information

  • Custom Code: YES
  • OS: SUSE Linux Enterprise High Performance Computing 15 SP5
  • TensorFlow installed from: DOCKER (tensorflow/tensorflow:2.16.1-gpu-jupyter)
  • TensorFlow version: v2.16.1-0-g5bc9d26649c 2.16.1
  • Python version: 3.11
  • GPU model and memory: NVIDIA A100-PCIE-40GB
  • Code to reproduce: find below

Describe the problem
I have a model comprising almost entirely of LSTM layers. If I load the same weights into a copy of the model instanced to run on CPU and GPU, results are different.

This issue disappears (the GPU results change to match CPU) if I change any of these:

  • Move from
    • SLES + NVIDIA A100 + Driver Version: 550.54.14 + CUDA Version: 12.4
      to
    • Ubuntu 22.04.4 LTS NVIDIA V100 + Driver Version: 535.161.07 + CUDA Version: 12.2
  • Set keras.backend.set_floatx('float64')
  • Use keras 3 instead of tf-keras

In all these cases, I'm running the same (official) docker image, in which my only modification has been to install tf-keras==2.16.0 and plotly.

Standalone code to reproduce the issue.

!pip install plotly
!pip install tf-keras==2.16.0
import os
import tensorflow as tf

import numpy as np

USE_TF_KERAS = True

if USE_TF_KERAS:
    import tf_keras as keras
    from tf_keras import layers
    from tf_keras import initializers
    from tf_keras import backend as K
else:
    import keras
    from keras import layers
    from keras import initializers
    from keras import backend as K

# Setting float64 as default dtype removes the discrepancy between CPU and GPU!
# keras.backend.set_floatx('float64')
from plotly import graph_objects as go

ROOT_DIR = os.getcwd()

n_time_steps = 800

theta = np.linspace(0, 2 * np.pi, n_time_steps).reshape(1, -1)

np.random.seed(42)
tf.random.set_seed(42)
dummy_input_dict = {
    "input_a": 800
    * np.stack((np.cos(theta), np.sin(theta)), axis=-1).astype(np.float32),
    "input_b": np.random.rand(1, n_time_steps, 5).astype(np.float32),
}


def build_model():
    input_a = layers.Input(shape=(n_time_steps, 2), name="input_a")
    input_b = layers.Input(shape=(n_time_steps, 5), name="input_b")

    x = layers.Concatenate()([input_a, input_b])
    for idx in range(8):
        lstm_layer = layers.LSTM(
                1024,
                kernel_initializer=initializers.RandomNormal(seed=42 + idx),
                recurrent_initializer=initializers.RandomNormal(seed=52 + idx),
                return_sequences=True,
            )
        x = lstm_layer(x)
    y = layers.Dense(1)(x)
    model = keras.Model(inputs=[input_a, input_b], outputs=y)

    return model


def main(device):
    with tf.device(device):
        model = build_model()
        model.load_weights("my_initial_weights.h5")

        features = ["input_a", "input_b"]
        dummy_input = [dummy_input_dict[k] for k in features]
        preds = model.predict(dummy_input)

    return preds

# Save one set of weights, so that we can compare the weights of the two models
with tf.device("/device:CPU:0"):
    model = build_model()
    model.save_weights("my_initial_weights.h5")


tf.config.list_logical_devices()

cpu_preds = main("/device:CPU:0")
gpu_preds = main("/device:GPU:0")

cpu_output = cpu_preds[0, :, 0]
gpu_output = gpu_preds[0, :, 0]

fig = go.Figure()
fig.add_trace(go.Scatter(y=cpu_output, name="CPU"))
fig.add_trace(go.Scatter(y=gpu_output, name="GPU"))
fig.show()

Resulting plot:

immagine

As mentioned at the beginning:

  • changing host to my V100 host
  • uncommenting # keras.backend.set_floatx('float64')
  • setting USE_TF_KERAS = False

All workaround the issue, and the GPU prediction matches the CPU prediction.

I also re-iterate that all of this has been run in the official tensorflow/tensorflow:2.16.1-gpu-jupyter container, on both hosts.

@tilakrayal
Copy link
Collaborator

@sachinprasadhs,
I was able to reproduce the issue on tensorflow v2.15, tf-keras. Kindly find the gist of it here.

@lbortolotti
Copy link
Author

@tilakrayal - the gist shows a very small difference between CPU/GPU predictions, similar to what I see on my V100 host. I wouldn't be surprised if differences that small were in fact expected.

But on my A100 host the difference becomes orders of magnitude larger. Is there a way to replicate my "problematic system" (NVIDIA A100 + Driver Version: 550.54.14 + CUDA Version: 12.4) on Colab, so that hopefully you can also see the entity of the problem, beyond the screenshots I can share?

Thanks!

@lbortolotti
Copy link
Author

I've updated the V100 system. It now has the exact same driver + CUDA as the A100 system (Driver Version: 550.54.14 + CUDA Version: 12.4), and still does not replicate the issue. So the issue seems specific to execution on the A100. How can we replicate on Colab? Thanks.

@lbortolotti
Copy link
Author

Latest update: I got hold of an H200 system, which demonstrated the same issue I see on the A100. I've also become aware of the relatively new TF32 datatype https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_tensor_float_32_execution, which is apparently enabled by default on A100 and newer!

Indeed, if I modify my example script and set tf.config.experimental.enable_tensor_float_32_execution(False), the numerical issues disappear, and the A100 system produces the same output as the V100 and CPUs.

I find it quite concerning that Tensorflow would take such liberties with data types.

In any case, the main question mark I have at this point is why I don't see the same numerical issues with multi-backend keras. Is it actually using float32, rather than the new TF32? Which keras implementation is doing the right thing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants