Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

onnx model can not be simpiflied and pass onnx.check and wierd output #10

Closed
lucasjinreal opened this issue Oct 14, 2021 · 10 comments
Closed
Labels
bug Something isn't working

Comments

@lucasjinreal
Copy link

the onnx model exported has very wierd dimension caused it can not be simplifed or pass onnx.checker.check.

This is verbose output of export DETR:

image

this is verbose output of AnchorDETR:

image

Both are last serveral layers, as you can see, for DETR the strides seems very small

but AnchorDETR are something like Float(1, 1, 900, 91, strides=[81900, 81900, 91, 1], requires_grad=1, device=cuda:0) gaint value.

and it caused when try to check this model, or try to simplifed this model:

ValueError: Message onnx.ModelProto exceeds maximum protobuf size of 2GB: 3714028571

error got.

any idea?

@tangjiuqi097
Copy link
Collaborator

Hi, @jinfagang

Both are last serveral layers, as you can see, for DETR the strides seems very small

but AnchorDETR are something like Float(1, 1, 900, 91, strides=[81900, 81900, 91, 1], requires_grad=1, device=cuda:0) gaint value.

I think the values of strides are normal. The strides are based on the shape. For example, if the shape is [a,b,c,d], then the stride is [bcd, cd, d, 1].

and it caused when try to check this model, or try to simplifed this model:

You do not pass the onnx.checker because of the onnx_simplifier but not the exported onnx model. The exported onnx model can pass the check of onnx.checker.check_model and I will push the code with the checker to export_onnx.py.

The size is still normal before the function eliminate_const_nodes in onnx_simplifier. But for this problem, I suggest you open an issue to the repo of onnx-simplifier.

@lucasjinreal
Copy link
Author

lucasjinreal commented Oct 14, 2021

@tangjiuqi097 Yes. I already have, but the issue not only had there.

Onnx simplifier helps eliminate constant values and make whole graph model simple by calling onnxoptimize functions, in other words, without it, a transformer model not able to converted to any other framework, or, can not converted to by optimized way, which is meaningless.

I just don't know why anchordetr can not pass simplifier while DETR can.

I had tried DETR, the file size between anchordetr and detr model are almost same level, but the former can be simplified and inferenced via ONNX-runtime (after simplified).

if you try inference anchordetr onnx model you will found your result will all be Nan. which means, this model can not be correctly inference via onnxruntime (even you might not gonna get any error throws). And I think even it can infer on CPU with onnxruntime, doesn't means it can inference via GPU.

@tangjiuqi097
Copy link
Collaborator

@jinfagang Hi,

if you try inference anchordetr onnx model you will found your result will all be Nan.

This problem may be the same as issue #49 in DETR and I will fix it as they do. But I am not sure if the problem of simplifier is related to it.

@tangjiuqi097 tangjiuqi097 reopened this Oct 14, 2021
@lucasjinreal
Copy link
Author

@tangjiuqi097 thanks for you notice this. Given you export onnx make nested_tensor_list out of whole trace scope, it might not highly related with that problem. But worthy to give it a try. I am still puzzeled by can not being simplified cause without it, hard to deploy on tensorrt or tvm.

@tangjiuqi097
Copy link
Collaborator

@jinfagang Now the problems are fixed by following #173 in DETR.

It is because the onnx does not support the slice assignment in nested_tensor_from_tensor_list. It will make all the regions be masked and lead to nan for the feature position and attention weight.

The problem of increased size for onnx_simplifier is disappeared after fixing the bug in nested_tensor_from_tensor_list. You can open an issue to onnx-simplifier if you are interested in the reason.

@lucasjinreal
Copy link
Author

@tangjiuqi097 Nice, let me have a try.

@tangjiuqi097 tangjiuqi097 added the bug Something isn't working label Oct 15, 2021
@tangjiuqi097
Copy link
Collaborator

tangjiuqi097 commented Oct 15, 2021

Hi,
1.

assert (np.abs(res1[0].cpu().numpy()-res2[0]).max() < 1e-5) and (np.abs(res1[1].cpu().numpy()-res2[1]).max() < 1e-5), "inaccurate results"

AssertionError: inaccurate results

It is because the 1e-5 is too strict for the pred_logits. But actually it is ok and I have updated this code.

And I can't get any detections using this onnx inference script:

Have you loaded the checkpoint to export the onnx model?

@tangjiuqi097 tangjiuqi097 reopened this Oct 15, 2021
@lucasjinreal
Copy link
Author

image

Now looks normal, I will do trt acceleration later.

@tangjiuqi097
Copy link
Collaborator

@jinfagang BTW, as we use the focal loss for the category loss, it should be better to use sigmoid instead of softmax to get the confidence score.

@lucasjinreal
Copy link
Author

@tangjiuqi097 thanks for advice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants