Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using a custom layer, loaded_model cannot give the same predicted value compared with original model [results not reproducible] #64

Closed
h44632186 opened this issue Dec 30, 2022 · 11 comments

Comments

@h44632186
Copy link

System Info

  • Tensorflow Version: 2.4.3
  • Custom Code: Yes
  • OS Platform and Distribution: CentOS Linux release 8.2.2004
  • Python version: 3.8
  • CUDA/cuDNN version: CUDA11, cuDNN8
  • GPU model and memory: RTX 3090, 24268MiB

Current Behaviour:
I implemented a custom layer, and use this layer to build a model. After training it, the original model gave a predicted value A, and the original model is saved as a h5 file. Then I load model from the h5 file, but the loaded model gave a different predicted value B, which means the results are not reproducible. Normally, the two models are supposed to give the same predicted value.
The custom layer is as simple as a Dense layer, and I have already locate that the custom layer caused the above problem. Actually, if I comment the line of custom layer, and uncomment the line below it (which is an original tf Dense layer), both the original model and the loaded_model gave the same results.

Standalone code to reproduce the issue
https://colab.research.google.com/drive/19_8DqzfC2JadKM9ZykJRLDEcxEkzjrT7?usp=sharing
Can also refer to the link where the issue is firstly posted: tensorflow/tensorflow#59041

Relevant log output
2022-12-29 10:33:52.034627: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-12-29 10:33:53.136713: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-12-29 10:33:53.138072: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-12-29 10:33:53.198743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:3d:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2022-12-29 10:33:53.198834: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-12-29 10:33:53.203166: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-12-29 10:33:53.203273: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-12-29 10:33:53.204328: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-12-29 10:33:53.204657: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-12-29 10:33:53.209031: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-12-29 10:33:53.209857: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-12-29 10:33:53.210030: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-12-29 10:33:53.212456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-12-29 10:33:53.335970: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-29 10:33:53.348062: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-12-29 10:33:53.349549: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:3d:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2022-12-29 10:33:53.349604: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-12-29 10:33:53.349644: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-12-29 10:33:53.349654: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-12-29 10:33:53.349664: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-12-29 10:33:53.349673: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-12-29 10:33:53.349682: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-12-29 10:33:53.349691: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-12-29 10:33:53.349701: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-12-29 10:33:53.352041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-12-29 10:33:53.352076: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-12-29 10:33:53.873305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-12-29 10:33:53.873357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2022-12-29 10:33:53.873366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
2022-12-29 10:33:53.877072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22430 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:3d:00.0, compute capability: 8.6)
/usr/local/lib64/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py:3503: UserWarning: Even though the tf.config.experimental_run_functions_eagerly option is set, this option does not apply to tf.data functions. tf.data functions are still traced and executed as graphs.
warnings.warn(
2022-12-29 10:33:53.990568: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-12-29 10:33:53.997553: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2600000000 Hz
2022-12-29 10:33:54.015124: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-12-29 10:33:54.679072: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-12-29 10:33:54.679260: I tensorflow/stream_executor/cuda/cuda_blas.cc:1838] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
625/625 [==============================] - 5s 7ms/step - loss: 0.4327
[[0.44245386]
[0.6916534 ]
[0.49306032]
[0.98741436]
[0.7631112 ]]
[[0.48893338]
[0.54947186]
[0.40105245]
[0.56347597]
[0.32270208]]

@sushreebarsa
Copy link
Collaborator

@h44632186 Sorry for the late response!
I don't have access to the drive link you have shared.
I tried to replicate the issue as mentioned in this ticket , please find the gist here and confirm the issue reported.
Thank you!

@h44632186
Copy link
Author

@sushreebarsa Thank you for your response. The gist you posted is exactly the same as the ticket I mentioned.
As you can see, the predicted value (predicted_val in the code) of the original model, is different from the predicted value (predicted_val2 in the code) of the loaded model, so I confirm that you have already replicated this issue.

@h44632186
Copy link
Author

@sushreebarsa Btw, I have already identified it is the custom layer that caused the above problem. If I replace the custom layer with an original Dense layer (referring to the code), both the original model and the loaded_model gave the same results. So I think the problem is caused by the custom layer.

@SuryanarayanaY
Copy link
Collaborator

Hi @h44632186 ,
I was able to replicate the behaviour you mentioned and attached the gist here. Also requesting you to cross check and confirm.

@h44632186
Copy link
Author

Hi @SuryanarayanaY , I have checked the gist you posted, and confirmed it is correct.

Hi @h44632186 , I was able to replicate the behaviour you mentioned and attached the gist here. Also requesting you to cross check and confirm.

@h44632186
Copy link
Author

@SuryanarayanaY Hi, do you have any suggestions to solve this issue?

@mattdangerw
Copy link
Member

I think the issue here is that in your custom layer, you are creating a new dense layer every time you call the model. So each new call to the model will create a new dense layer with new random weights. This will mean you often totally lose the state of your model (and might get extra confusing with function tracing, where your traced model will not create new weights).

The recommended approach would be to not create variables or layers inside call like this. If you change your custom layer as below, everything works.

@tf.keras.utils.register_keras_serializable()
class Custom_Layer(tf.keras.layers.Layer):
    def __init__(self, units, **kwargs):
        super(Custom_Layer, self).__init__(**kwargs)
        self.units = units
        self.dense = tf.keras.layers.Dense(self.units)

    def call(self, x):
        x = self.dense(x)
        return x

    def get_config(self):
        config = super(Custom_Layer, self).get_config()
        config.update(units = self.units)
        return config

https://keras.io/guides/making_new_layers_and_models_via_subclassing/ has an in depth guide.

@SuryanarayanaY
Copy link
Collaborator

Hi @h44632186 ,

As mentioned in above comment, when using sub_classing of models or layers you need to define the layers inside the constructor to make them serializable and stateful.

I have modified your code as suggested above and now both models are able to reproduce same outputs. Please refer to attached gist. Thanks

@JyotiPDLr
Copy link

@SuryanarayanaY , The constructor here has a layer but still the model is is serializable without explicit serialization and deserialization in get_config() and from_config() methods. As per documentation the layer should be serialized and de-serialized in get_config() and from_config() methods right? What am I missing here ?

@sachinprasadhs sachinprasadhs transferred this issue from keras-team/keras Sep 22, 2023
@github-actions
Copy link

github-actions bot commented Oct 7, 2023

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale label Oct 7, 2023
@github-actions
Copy link

This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants