Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usage of the checkpoint throws error #82

Closed
posEdgeOfLife opened this issue Feb 2, 2022 · 9 comments
Closed

usage of the checkpoint throws error #82

posEdgeOfLife opened this issue Feb 2, 2022 · 9 comments

Comments

@posEdgeOfLife
Copy link

I want to evaluate axial-deep lab.
Here is my cmd:
train.py
--config_file
../configs/cityscapes/axial_deeplab/max_deeplab_s_backbone_os16.textproto
--mode
eval
--model_dir
C:\develop\max_deeplab_s_backbone_os16_axial_deeplab_cityscapes_trainfine\

I also updated initial_checkpoint to C:\develop\max_deeplab_s_backbone_os16_axial_deeplab_cityscapes_trainfine\ckpt-60000, which is the prefix for both the data and index file downloaded from the official checkpoint.

when I run the script, i keep getting a warning:
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer
W0201 20:33:08.415140 51764 util.py:204] Unresolved object in checkpoint: (root).optimizer
....

The dataset pattern and experiment name are correct I believe since there isn't error in previous stages, but only when the evaluation starts.

Am I using the checkpoint int a wrong way? I am confused by checkpath path, which could be a dir, or one of the files(data, index), or it could be dir/ckpt-60000?

@markweberdev
Copy link
Collaborator

Hi @posEdgeOfLife

Could you provide us some more information about the following:

  • Your OS
  • Python version
  • Tensorflow version
  • Does the evaluation still run besides these warnings?

@posEdgeOfLife
Copy link
Author

Hi @posEdgeOfLife

Could you provide us some more information about the following:

  • Windows 10
  • 3.6.9
  • tf2.7
  • Yes, it would run with these warnings.

However, when I try to freeze the graph using the following:

model = tf.keras.models.load_model(r"C:/develop/max_deeplab_l_backbone_os16_axial_deeplab_cityscapes_trainfine_saved_model/")
 signatures = model.signatures['serving_default']
 frozen_func = convert_variables_to_constants_v2(signatures) 

I got error:
ValueError: Node 'DeepLab/max_deeplab_l_backbone/stage4/block2/attention/height_axis/query_rpe/Gather/axis' is not unique

With a little debug, I found all gather nodes are converted from resourceGather which has two inputs:

  1. resource
  2. indices

These two are both converted to gather node with the same name. I am not sure what is a proper way to resolve this. Any suggestions? Thanks for help.

@csrhddlam
Copy link
Contributor

Hello,

Thanks for your interest in our work and the code.

WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer
W0201 20:33:08.415140 51764 util.py:204] Unresolved object in checkpoint: (root).optimizer

This warning looks normal to me. -- For evaluation, only model weights are useful. The optimizer states are not useful and thus are not loaded from the checkpoint file (which contains both model weights and optimizer states). So please disregard the warning, especially as you replied that your code is able to run with these warnings, but please make sure that the evaluation result matches the expected number.

Am I using the checkpoint int a wrong way? I am confused by checkpath path, which could be a dir, or one of the files(data, index), or it could be dir/ckpt-60000?

I think you are using it correctly. The checkpoint loader is designed in a flexible way (supports a file or a directory) so that we can load not only trained weights for evaluation, but also ImageNet pretrained checkpoints, as well as online evaluation of the lastest checkpoint in a directory while training.

ValueError: Node 'DeepLab/max_deeplab_l_backbone/stage4/block2/attention/height_axis/query_rpe/Gather/axis' is not unique

I have not seen such an error before. What are you trying to achieve with the frozen graph? Is it for model exporting/deployment? Does the frozen graph relate to the warning for evaluation mentioned in this issue?

Best,
Huiyu

@posEdgeOfLife
Copy link
Author

Hi Huiyu, thanks for your answer. I figured out the checkpoint and the warning message(not clear when reading the documents), however, the last error is still not clear to me what is the root cause. Yes, i am trying to export the graph to some other formats, onnx mainly. The issue occurs when tensorflow convert resourceGather to a frozen node, it actually creates two nodes with the same name since resourceGather in this case has two inputs as its edge. I slightly modified the tensorflow python code with a map for deduplication, but i am not super clear about the root cause.

@csrhddlam
Copy link
Contributor

csrhddlam commented Mar 28, 2022

Hi, glad to know that the checkpoint issue is resolved. I am not an expert in exporting graphs so I am not certain what the root cause is, but there are three points that may be helpful.

  1. For Axial-DeepLab models, recompute_grad is usually set to True, and recompute_grad usually computes the forward pass for the second time during the backward pass. So I would suggest making sure that recompute_grad is disabled during exporting.
  2. To figure out the root cause, maybe one can try exporting a ResNet model instead. The ResNet model will usually have recompute_grad disabled and will not use any relative positional encodings that call the gather function.
  3. For Axial-DeepLab models, if the gathering nodes are the only issue that blocks the process, I believe the gathering node can be removed by simply storing the gathered output as model parameters. (The gather nodes in the relative positional encoding layers do not depend on the input image.)

That said, were you able to make it work with your deduplication modification?

@posEdgeOfLife
Copy link
Author

Hi, glad to know that the checkpoint issue is resolved. I am not an expert in exporting graphs so I am not certain what the root cause is, but there are three points that may be helpful.

  1. For Axial-DeepLab models, recompute_grad is usually set to True, and recompute_grad usually computes the forward pass for the second time during the backward pass. So I would suggest making sure that recompute_grad is disabled during exporting.
  2. To figure out the root cause, maybe one can try exporting a ResNet model instead. The ResNet model will usually have recompute_grad disabled and will not use any relative positional encodings that call the gather function.
  3. For Axial-DeepLab models, if the gathering nodes are the only issue that blocks the process, I believe the gathering node can be removed by simply storing the gathered output as model parameters. (The gather nodes in the relative positional encoding layers do not depend on the input image.)

That said, were you able to make it work with your deduplication modification?

Thanks again for the tips. Deduplication got me around the first issue, but I have to remove the post processing part out to eliminate loops, which is not supported in ONNX or tflite. However, after tested with onnxRuntime, the final result is not matching. I got hang there since debugging layer wise is a lot of work. I think all 3 points you mentioned worth a retry to see if it fixes anything. Thanks again!

@fschvart
Copy link

Hi @csrhddlam

I am trying to export a coco panoptic deeplab semantic segmentation model with a resent beta backbone to a saved_model file. (can I/should I export to something else?)
I'm using the 60k checkpoint, I get several warnings but the export is done without errors (although nothing is created in saved_model/assets). The saved_model does not work as well as the eval step of the 60k checkpoint on the same image so I assume that the export isn't working great.
So first, If I want to export this checkpoint into a model that will eventually be used in C++, is there anything different I can do?
I noticed that resnet_beta has 'axial_use_recompute_grad' set to True, how can I set it to false? Under which category of the config file it should go?

Thanks!

@fschvart
Copy link

OK, I partly answered myself. In order to change the setting it needs to be changed in the model/encoder/axial_resnet_instances.py file.
I still have an issue so I'll open a new issue.

@aquariusjay
Copy link
Contributor

Closing the issue, since @fschvart has opened a new one, and @posEdgeOfLife could reopen it if there is still any issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants