usage of the checkpoint throws error #82

posEdgeOfLife · 2022-02-02T04:37:02Z

I want to evaluate axial-deep lab.
Here is my cmd:
train.py
--config_file
../configs/cityscapes/axial_deeplab/max_deeplab_s_backbone_os16.textproto
--mode
eval
--model_dir
C:\develop\max_deeplab_s_backbone_os16_axial_deeplab_cityscapes_trainfine\

I also updated initial_checkpoint to C:\develop\max_deeplab_s_backbone_os16_axial_deeplab_cityscapes_trainfine\ckpt-60000, which is the prefix for both the data and index file downloaded from the official checkpoint.

when I run the script, i keep getting a warning:
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer
W0201 20:33:08.415140 51764 util.py:204] Unresolved object in checkpoint: (root).optimizer
....

The dataset pattern and experiment name are correct I believe since there isn't error in previous stages, but only when the evaluation starts.

Am I using the checkpoint int a wrong way? I am confused by checkpath path, which could be a dir, or one of the files(data, index), or it could be dir/ckpt-60000?

markweberdev · 2022-02-04T08:36:44Z

Hi @posEdgeOfLife

Could you provide us some more information about the following:

Your OS
Python version
Tensorflow version
Does the evaluation still run besides these warnings?

posEdgeOfLife · 2022-02-04T18:23:49Z

Hi @posEdgeOfLife

Could you provide us some more information about the following:

Windows 10

3.6.9

tf2.7

Yes, it would run with these warnings.

However, when I try to freeze the graph using the following:

model = tf.keras.models.load_model(r"C:/develop/max_deeplab_l_backbone_os16_axial_deeplab_cityscapes_trainfine_saved_model/")
 signatures = model.signatures['serving_default']
 frozen_func = convert_variables_to_constants_v2(signatures)

I got error:
ValueError: Node 'DeepLab/max_deeplab_l_backbone/stage4/block2/attention/height_axis/query_rpe/Gather/axis' is not unique

With a little debug, I found all gather nodes are converted from resourceGather which has two inputs:

resource
indices

These two are both converted to gather node with the same name. I am not sure what is a proper way to resolve this. Any suggestions? Thanks for help.

csrhddlam · 2022-03-25T04:33:19Z

Hello,

Thanks for your interest in our work and the code.

WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer
W0201 20:33:08.415140 51764 util.py:204] Unresolved object in checkpoint: (root).optimizer

This warning looks normal to me. -- For evaluation, only model weights are useful. The optimizer states are not useful and thus are not loaded from the checkpoint file (which contains both model weights and optimizer states). So please disregard the warning, especially as you replied that your code is able to run with these warnings, but please make sure that the evaluation result matches the expected number.

Am I using the checkpoint int a wrong way? I am confused by checkpath path, which could be a dir, or one of the files(data, index), or it could be dir/ckpt-60000?

I think you are using it correctly. The checkpoint loader is designed in a flexible way (supports a file or a directory) so that we can load not only trained weights for evaluation, but also ImageNet pretrained checkpoints, as well as online evaluation of the lastest checkpoint in a directory while training.

ValueError: Node 'DeepLab/max_deeplab_l_backbone/stage4/block2/attention/height_axis/query_rpe/Gather/axis' is not unique

I have not seen such an error before. What are you trying to achieve with the frozen graph? Is it for model exporting/deployment? Does the frozen graph relate to the warning for evaluation mentioned in this issue?

Best,
Huiyu

posEdgeOfLife · 2022-03-25T05:38:35Z

Hi Huiyu, thanks for your answer. I figured out the checkpoint and the warning message(not clear when reading the documents), however, the last error is still not clear to me what is the root cause. Yes, i am trying to export the graph to some other formats, onnx mainly. The issue occurs when tensorflow convert resourceGather to a frozen node, it actually creates two nodes with the same name since resourceGather in this case has two inputs as its edge. I slightly modified the tensorflow python code with a map for deduplication, but i am not super clear about the root cause.

csrhddlam · 2022-03-28T03:49:05Z

Hi, glad to know that the checkpoint issue is resolved. I am not an expert in exporting graphs so I am not certain what the root cause is, but there are three points that may be helpful.

For Axial-DeepLab models, recompute_grad is usually set to True, and recompute_grad usually computes the forward pass for the second time during the backward pass. So I would suggest making sure that recompute_grad is disabled during exporting.
To figure out the root cause, maybe one can try exporting a ResNet model instead. The ResNet model will usually have recompute_grad disabled and will not use any relative positional encodings that call the gather function.
For Axial-DeepLab models, if the gathering nodes are the only issue that blocks the process, I believe the gathering node can be removed by simply storing the gathered output as model parameters. (The gather nodes in the relative positional encoding layers do not depend on the input image.)

That said, were you able to make it work with your deduplication modification?

posEdgeOfLife · 2022-03-28T07:06:35Z

Hi, glad to know that the checkpoint issue is resolved. I am not an expert in exporting graphs so I am not certain what the root cause is, but there are three points that may be helpful.

For Axial-DeepLab models, recompute_grad is usually set to True, and recompute_grad usually computes the forward pass for the second time during the backward pass. So I would suggest making sure that recompute_grad is disabled during exporting.

To figure out the root cause, maybe one can try exporting a ResNet model instead. The ResNet model will usually have recompute_grad disabled and will not use any relative positional encodings that call the gather function.

For Axial-DeepLab models, if the gathering nodes are the only issue that blocks the process, I believe the gathering node can be removed by simply storing the gathered output as model parameters. (The gather nodes in the relative positional encoding layers do not depend on the input image.)

That said, were you able to make it work with your deduplication modification?

Thanks again for the tips. Deduplication got me around the first issue, but I have to remove the post processing part out to eliminate loops, which is not supported in ONNX or tflite. However, after tested with onnxRuntime, the final result is not matching. I got hang there since debugging layer wise is a lot of work. I think all 3 points you mentioned worth a retry to see if it fixes anything. Thanks again!

fschvart · 2022-07-12T19:35:19Z

Hi @csrhddlam

I am trying to export a coco panoptic deeplab semantic segmentation model with a resent beta backbone to a saved_model file. (can I/should I export to something else?)
I'm using the 60k checkpoint, I get several warnings but the export is done without errors (although nothing is created in saved_model/assets). The saved_model does not work as well as the eval step of the 60k checkpoint on the same image so I assume that the export isn't working great.
So first, If I want to export this checkpoint into a model that will eventually be used in C++, is there anything different I can do?
I noticed that resnet_beta has 'axial_use_recompute_grad' set to True, how can I set it to false? Under which category of the config file it should go?

Thanks!

fschvart · 2022-07-12T20:14:04Z

OK, I partly answered myself. In order to change the setting it needs to be changed in the model/encoder/axial_resnet_instances.py file.
I still have an issue so I'll open a new issue.

aquariusjay · 2022-07-14T15:45:55Z

Closing the issue, since @fschvart has opened a new one, and @posEdgeOfLife could reopen it if there is still any issue.

aquariusjay closed this as completed Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

usage of the checkpoint throws error #82

usage of the checkpoint throws error #82

posEdgeOfLife commented Feb 2, 2022

markweberdev commented Feb 4, 2022

posEdgeOfLife commented Feb 4, 2022

csrhddlam commented Mar 25, 2022

posEdgeOfLife commented Mar 25, 2022

csrhddlam commented Mar 28, 2022 •

edited

posEdgeOfLife commented Mar 28, 2022

fschvart commented Jul 12, 2022

fschvart commented Jul 12, 2022

aquariusjay commented Jul 14, 2022

usage of the checkpoint throws error #82

usage of the checkpoint throws error #82

Comments

posEdgeOfLife commented Feb 2, 2022

markweberdev commented Feb 4, 2022

posEdgeOfLife commented Feb 4, 2022

csrhddlam commented Mar 25, 2022

posEdgeOfLife commented Mar 25, 2022

csrhddlam commented Mar 28, 2022 • edited

posEdgeOfLife commented Mar 28, 2022

fschvart commented Jul 12, 2022

fschvart commented Jul 12, 2022

aquariusjay commented Jul 14, 2022

csrhddlam commented Mar 28, 2022 •

edited