using a custom layer, loaded_model cannot give the same predicted value compared with original model [results not reproducible] #64

h44632186 · 2022-12-30T09:30:02Z

System Info

Tensorflow Version: 2.4.3
Custom Code: Yes
OS Platform and Distribution: CentOS Linux release 8.2.2004
Python version: 3.8
CUDA/cuDNN version: CUDA11, cuDNN8
GPU model and memory: RTX 3090, 24268MiB

Current Behaviour:
I implemented a custom layer, and use this layer to build a model. After training it, the original model gave a predicted value A, and the original model is saved as a h5 file. Then I load model from the h5 file, but the loaded model gave a different predicted value B, which means the results are not reproducible. Normally, the two models are supposed to give the same predicted value.
The custom layer is as simple as a Dense layer, and I have already locate that the custom layer caused the above problem. Actually, if I comment the line of custom layer, and uncomment the line below it (which is an original tf Dense layer), both the original model and the loaded_model gave the same results.

Standalone code to reproduce the issue
https://colab.research.google.com/drive/19_8DqzfC2JadKM9ZykJRLDEcxEkzjrT7?usp=sharing
Can also refer to the link where the issue is firstly posted: tensorflow/tensorflow#59041

Relevant log output
2022-12-29 10:33:52.034627: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-12-29 10:33:53.136713: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-12-29 10:33:53.138072: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-12-29 10:33:53.198743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:3d:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2022-12-29 10:33:53.198834: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-12-29 10:33:53.203166: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-12-29 10:33:53.203273: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-12-29 10:33:53.204328: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-12-29 10:33:53.204657: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-12-29 10:33:53.209031: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-12-29 10:33:53.209857: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-12-29 10:33:53.210030: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-12-29 10:33:53.212456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-12-29 10:33:53.335970: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-29 10:33:53.348062: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-12-29 10:33:53.349549: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:3d:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2022-12-29 10:33:53.349604: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-12-29 10:33:53.349644: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-12-29 10:33:53.349654: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-12-29 10:33:53.349664: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-12-29 10:33:53.349673: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-12-29 10:33:53.349682: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2022-12-29 10:33:53.349691: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-12-29 10:33:53.349701: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-12-29 10:33:53.352041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2022-12-29 10:33:53.352076: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-12-29 10:33:53.873305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-12-29 10:33:53.873357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2022-12-29 10:33:53.873366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
2022-12-29 10:33:53.877072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22430 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:3d:00.0, compute capability: 8.6)
/usr/local/lib64/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py:3503: UserWarning: Even though the tf.config.experimental_run_functions_eagerly option is set, this option does not apply to tf.data functions. tf.data functions are still traced and executed as graphs.
warnings.warn(
2022-12-29 10:33:53.990568: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-12-29 10:33:53.997553: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2600000000 Hz
2022-12-29 10:33:54.015124: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-12-29 10:33:54.679072: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-12-29 10:33:54.679260: I tensorflow/stream_executor/cuda/cuda_blas.cc:1838] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
625/625 [==============================] - 5s 7ms/step - loss: 0.4327
[[0.44245386]
[0.6916534 ]
[0.49306032]
[0.98741436]
[0.7631112 ]]
[[0.48893338]
[0.54947186]
[0.40105245]
[0.56347597]
[0.32270208]]

sushreebarsa · 2023-01-04T10:14:18Z

@h44632186 Sorry for the late response!
I don't have access to the drive link you have shared.
I tried to replicate the issue as mentioned in this ticket , please find the gist here and confirm the issue reported.
Thank you!

h44632186 · 2023-01-05T02:11:18Z

@sushreebarsa Thank you for your response. The gist you posted is exactly the same as the ticket I mentioned.
As you can see, the predicted value (predicted_val in the code) of the original model, is different from the predicted value (predicted_val2 in the code) of the loaded model, so I confirm that you have already replicated this issue.

h44632186 · 2023-01-05T02:18:14Z

@sushreebarsa Btw, I have already identified it is the custom layer that caused the above problem. If I replace the custom layer with an original Dense layer (referring to the code), both the original model and the loaded_model gave the same results. So I think the problem is caused by the custom layer.

SuryanarayanaY · 2023-01-05T15:10:44Z

Hi @h44632186 ,
I was able to replicate the behaviour you mentioned and attached the gist here. Also requesting you to cross check and confirm.

h44632186 · 2023-01-06T02:00:34Z

Hi @SuryanarayanaY , I have checked the gist you posted, and confirmed it is correct.

Hi @h44632186 , I was able to replicate the behaviour you mentioned and attached the gist here. Also requesting you to cross check and confirm.

h44632186 · 2023-02-13T07:20:14Z

@SuryanarayanaY Hi, do you have any suggestions to solve this issue?

mattdangerw · 2023-03-16T01:42:21Z

I think the issue here is that in your custom layer, you are creating a new dense layer every time you call the model. So each new call to the model will create a new dense layer with new random weights. This will mean you often totally lose the state of your model (and might get extra confusing with function tracing, where your traced model will not create new weights).

The recommended approach would be to not create variables or layers inside call like this. If you change your custom layer as below, everything works.

@tf.keras.utils.register_keras_serializable()
class Custom_Layer(tf.keras.layers.Layer):
    def __init__(self, units, **kwargs):
        super(Custom_Layer, self).__init__(**kwargs)
        self.units = units
        self.dense = tf.keras.layers.Dense(self.units)

    def call(self, x):
        x = self.dense(x)
        return x

    def get_config(self):
        config = super(Custom_Layer, self).get_config()
        config.update(units = self.units)
        return config

https://keras.io/guides/making_new_layers_and_models_via_subclassing/ has an in depth guide.

SuryanarayanaY · 2023-09-01T07:02:54Z

Hi @h44632186 ,

As mentioned in above comment, when using sub_classing of models or layers you need to define the layers inside the constructor to make them serializable and stateful.

I have modified your code as suggested above and now both models are able to reproduce same outputs. Please refer to attached gist. Thanks

JyotiPDLr · 2023-09-10T10:06:47Z

@SuryanarayanaY , The constructor here has a layer but still the model is is serializable without explicit serialization and deserialization in get_config() and from_config() methods. As per documentation the layer should be serialized and de-serialized in get_config() and from_config() methods right? What am I missing here ?

github-actions · 2023-10-07T01:47:44Z

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2023-10-22T01:49:09Z

This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.

google-ml-butler bot assigned sushreebarsa Dec 30, 2022

h44632186 mentioned this issue Dec 30, 2022

using a custom layer, loaded_model cannot give the same predicted value compared with original model [results not reproducible] tensorflow/tensorflow#59041

Closed

sushreebarsa added the stat:awaiting response from contributor label Jan 4, 2023

google-ml-butler bot removed the stat:awaiting response from contributor label Jan 5, 2023

sushreebarsa assigned SuryanarayanaY and unassigned sushreebarsa Jan 5, 2023

SuryanarayanaY assigned gowthamkpr and unassigned SuryanarayanaY Jan 5, 2023

SuryanarayanaY added the keras-team-review-pending label Mar 9, 2023

SuryanarayanaY self-assigned this Mar 10, 2023

SuryanarayanaY mentioned this issue Sep 22, 2023

KerasCV - Cannot save model from model.save(PATH) #247

Closed

mattdangerw removed the keras-team-review-pending label Mar 16, 2023

mattdangerw self-assigned this Mar 16, 2023

tilakrayal mentioned this issue Mar 17, 2023

Saving Custom Model Not Working tensorflow/tensorflow#60022

Closed

SuryanarayanaY added the stat:awaiting response from contributor label Sep 1, 2023

sachinprasadhs transferred this issue from keras-team/keras Sep 22, 2023

github-actions bot added the stale label Oct 7, 2023

github-actions bot closed this as completed Oct 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using a custom layer, loaded_model cannot give the same predicted value compared with original model [results not reproducible] #64

using a custom layer, loaded_model cannot give the same predicted value compared with original model [results not reproducible] #64

h44632186 commented Dec 30, 2022

sushreebarsa commented Jan 4, 2023

h44632186 commented Jan 5, 2023

h44632186 commented Jan 5, 2023

SuryanarayanaY commented Jan 5, 2023

h44632186 commented Jan 6, 2023

h44632186 commented Feb 13, 2023

mattdangerw commented Mar 16, 2023

SuryanarayanaY commented Sep 1, 2023

JyotiPDLr commented Sep 10, 2023

github-actions bot commented Oct 7, 2023

github-actions bot commented Oct 22, 2023

using a custom layer, loaded_model cannot give the same predicted value compared with original model [results not reproducible] #64

using a custom layer, loaded_model cannot give the same predicted value compared with original model [results not reproducible] #64

Comments

h44632186 commented Dec 30, 2022

sushreebarsa commented Jan 4, 2023

h44632186 commented Jan 5, 2023

h44632186 commented Jan 5, 2023

SuryanarayanaY commented Jan 5, 2023

h44632186 commented Jan 6, 2023

h44632186 commented Feb 13, 2023

mattdangerw commented Mar 16, 2023

SuryanarayanaY commented Sep 1, 2023

JyotiPDLr commented Sep 10, 2023

github-actions bot commented Oct 7, 2023

github-actions bot commented Oct 22, 2023