Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot convert tf savedmodel to onnx #1287

Closed
zhaohb opened this issue Jan 22, 2021 · 32 comments
Closed

cannot convert tf savedmodel to onnx #1287

zhaohb opened this issue Jan 22, 2021 · 32 comments
Labels
pending on user response Waiting for more information or validation from user

Comments

@zhaohb
Copy link

zhaohb commented Jan 22, 2021

System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Linux Ubuntu 18.04
TensorFlow installed from (source or binary):
binary
TensorFlow version (use command below):
tf-nightly-gpu 2.5.0.dev20210119
Python version:
3.6 (Anaconda)
Tensorflow-onnx version:
1.8.0. build from source

my command line :

python -m tf2onnx.convert --saved-model ./model.savedmodel --output fea.onnx --custom-ops Bucketize,AsString,StringToHashBucketFast --signature_def serving_default --tag serve --opset 12 

But I got the following error:

......
2021-01-21 11:29:41,413 - ERROR - Could not find table resource to replace placeholder unknown_172
2021-01-21 11:29:41,415 - ERROR - Could not find table resource to replace placeholder unknown_174
2021-01-21 11:29:41,416 - ERROR - Could not find table resource to replace placeholder unknown_176
2021-01-21 11:29:41,417 - ERROR - Could not find table resource to replace placeholder unknown_178
2021-01-21 11:29:41,418 - ERROR - Could not find table resource to replace placeholder unknown_180
2021-01-21 11:29:41,418 - ERROR - Could not find table resource to replace placeholder unknown_183
2021-01-21 11:29:41,418 - ERROR - Could not find table resource to replace placeholder unknown_185
2021-01-21 11:29:41,418 - ERROR - Could not find table resource to replace placeholder unknown_187
2021-01-21 11:29:41,418 - ERROR - Could not find table resource to replace placeholder unknown_189
2021-01-21 11:29:41,418 - ERROR - Could not find table resource to replace placeholder unknown_193
2021-01-21 11:29:41,418 - ERROR - Could not find table resource to replace placeholder unknown_195
2021-01-21 11:29:41,419 - ERROR - Could not find table resource to replace placeholder unknown_197
......
tensorflow.python.framework.errors_impl.InvalidArgumentError: 'func' argument to TF_GraphCopyFunction cannot be null
Exception ignored in: <bound method CapturableResourceDeleter.__del__ of <tensorflow.python.training.tracking.tracking.CapturableResourceDeleter object at 0x7f70486cbcf8>>
Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/training/tracking/tracking.py", line 208, in __del__
    self._destroy_resource()
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 797, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 841, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 695, in _initialize
    *args, **kwds))
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2981, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3373, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3218, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 998, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 603, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/saved_model/function_deserialization.py", line 257, in restored_function_body
    return _call_concrete_function(function, inputs)
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/saved_model/function_deserialization.py", line 75, in _call_concrete_function
    result = function._call_flat(tensor_inputs, function._captured_inputs)  # pylint: disable=protected-access
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 116, in _call_flat
    cancellation_manager)
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1944, in _call_flat
    flat_outputs = forward_function.call(ctx, args_with_tangents)
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 590, in call
    executor_type=executor_type)
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/ops/functional_ops.py", line 1206, in partitioned_call
    f.add_to_graph(graph)
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 506, in add_to_graph
    g._add_function(self)
  File "/usr/local/anaconda3/envs/tf2.2-n/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3403, in _add_function
    gradient)

I want to get the ONNX model, desperate for some advice!

thank you very much

@TomWildenhain-Microsoft
Copy link
Contributor

Hi @zhaohb, can you please upload a copy of the saved model?

@zhaohb
Copy link
Author

zhaohb commented Jan 23, 2021

hi,@TomWildenhain-Microsoft, colab link https://colab.research.google.com/drive/1wxu8piPR9qyAC8EjtDd6-STZqek77BO7?usp=sharing,
I can also send the model file to you by email. Is that OK?

@TomWildenhain-Microsoft
Copy link
Contributor

I have requested access to the colab

@zhaohb
Copy link
Author

zhaohb commented Jan 28, 2021

@TomWildenhain-Microsoft , sorry for the late reply, I have added you to the user group

@TomWildenhain-Microsoft
Copy link
Contributor

Shoot, it looks like this colab requires a tar file I don't have. Can you please zip the saved model directory you are trying to convert and upload it to google drive?

@zhaohb
Copy link
Author

zhaohb commented Jan 29, 2021

model file link : https://drive.google.com/file/d/1OmfoxcalmJMpW3QyOXnFWkUygn58CTZe/view , I had add you to user group.

@zhaohb
Copy link
Author

zhaohb commented Jan 29, 2021

@TomWildenhain-Microsoft Can you repeat the experiment? I'm waiting for your reply.

@TomWildenhain-Microsoft
Copy link
Contributor

Taking a look now.

@TomWildenhain-Microsoft
Copy link
Contributor

The error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: 'func' argument to TF_GraphCopyFunction cannot be null
Is during resource destruction and can be safely ignored. The bigger problem are the errors of the form:
2021-01-21 11:29:41,413 - ERROR - Could not find table resource to replace placeholder unknown_172

I'm not sure why it isn't finding the tables, but interestingly they are int64 to int64 tables, not the usual string to int64 tables, so currently we won't be able to convert them anyway.

Do you know why this model has int64 -> int64 hash tables? What type of model is this?

@TomWildenhain-Microsoft
Copy link
Contributor

From the values, it looks like these tables are storing some sort of permutations. Can you change the model to use gather ops instead of tables?

LookupTableExportV2(keys=<tf.Tensor: shape=(31,), dtype=int64, numpy=
array([21,  2, 14, 26,  7, 19,  0, 12, 24,  5, 17, 29, 10, 22,  3, 15, 27,
        8, 20,  1, 13, 25,  6, 18, 30, 23, 11,  4, 28, 16,  9],
      dtype=int64)>, values=<tf.Tensor: shape=(31,), dtype=int64, numpy=
array([28,  4, 11, 29,  0, 10,  8, 15, 23,  1, 18, 26, 13, 20,  2, 17, 27,
        9, 21,  6, 12, 30,  7, 16, 25, 24, 14,  5, 22, 19,  3],
      dtype=int64)>)

@zhaohb
Copy link
Author

zhaohb commented Jan 30, 2021

Thank you for your reply, I also think the most important mistake is:

 ERROR - Could not find table resource to replace placeholder unknown_172

but now I can't change the model, so should we add a feature to onnx to fix this bug?can't find table source, probably because resource type is used.I think this bug is going to be very common.

@TomWildenhain-Microsoft
Copy link
Contributor

@MoFHeka yes I think that could work though the hard part is getting the data out of the tables first. I don't know how to get the tables if I don't know their key/value types

@TomWildenhain-Microsoft
Copy link
Contributor

Normally we can find the initializer for the table by looking at the imported saved model, but I can't find it in this particular instance. I'm not sure if it is in the Python object, or if after loading it TensorFlow has destroyed the data.

@TomWildenhain-Microsoft
Copy link
Contributor

I'm able to get the data out of the table if I know the type in advance, which I do in this case. However if I guess the type incorrectly TensorFlow aborts.

@TomWildenhain-Microsoft
Copy link
Contributor

Worst case we could jump into the protobuf of the saved model directly to find the initializers, but it would be much cleaner if we can get it out of the imported savedmodel Python object.

@zhaohb
Copy link
Author

zhaohb commented Jan 30, 2021

@TomWildenhain-Microsoft https://drive.google.com/file/d/1dNhMOn9h7FtcuLSMXFK9AhBRdU1NitPg/view?usp=sharing
this is keras model, and we can verify whether we can get tables from the H5 model.

@TomWildenhain-Microsoft
Copy link
Contributor

I used a bit of a hacky method but it should work for this model. I'm not merging it, but try converting the model using this branch: #1310

@zhaohb
Copy link
Author

zhaohb commented Feb 2, 2021

@TomWildenhain-Microsoft ok,thank you very much, I will close this issue.

@TomWildenhain-Microsoft
Copy link
Contributor

Don't close it just yet. I got a model but I don't have data to test it. Let me know if you can convert and once you do whether the results are correct and fast enough. I converted the tables by casting the int keys to strings which might cause a slow down. If so, I can make a better conversation using Gather ops.

@zhaohb zhaohb closed this as completed Feb 2, 2021
@zhaohb zhaohb reopened this Feb 2, 2021
@zhaohb
Copy link
Author

zhaohb commented Feb 2, 2021

ok, I had reopen, I will test it as soon as possible.

@zhaohb
Copy link
Author

zhaohb commented Feb 3, 2021

@TomWildenhain-Microsoft I had test that branch, and now I can get onnx model, it is great. But when I want to implement Bucketize op(onnxruntime does not implement Bucketize and must be customized), I find that the Dtypes of the input Bucketize are abnormal. Some are float32, but others are None :

import onnx_graphsurgeon as gs
import numpy as np
import onnx

graph = gs.import_onnx(onnx.load("fea.onnx"))

bucketizes = [node for node in graph.nodes if node.op == "Bucketize"]
for item in bucketizes:
    item_type = item.inputs[0].dtype
    print(item.op, " : ", item_type)

output:

Bucketize  :  float32
Bucketize  :  float32
......
Bucketize  :  float32
Bucketize  :  float32
Bucketize  :  float32
Bucketize  :  float32
Bucketize  :  float32
Bucketize  :  float32
Bucketize  :  float32
Bucketize  :  float32
Bucketize  :  float32
Bucketize  :  float32
Bucketize  :  None
Bucketize  :  float32
Bucketize  :  None
Bucketize  :  None
Bucketize  :  float32
Bucketize  :  None
Bucketize  :  None
Bucketize  :  None
Bucketize  :  float32
Bucketize  :  None
Bucketize  :  float32
Bucketize  :  float32
Bucketize  :  float32
Bucketize  :  float32
Bucketize  :  None
Bucketize  :  None
Bucketize  :  float32
......

@TomWildenhain-Microsoft
Copy link
Contributor

I think that's just because graph surgeon doesn't know how to do type inference for some of the custom ops in the graph, so it thinks the type is unknown but really it is float32. The onnx file only stores types for the input and output tensors. Everything else is inferred.

@TomWildenhain-Microsoft
Copy link
Contributor

For those bucket nodes are the buckets constant or are they passed in as an input? If they are constant I might be able to make a conversion for them.

@zhaohb
Copy link
Author

zhaohb commented Feb 3, 2021

I've solved the Bucketize Op problem, and it works, but It's very slow, about 10x times slower than TensorFlow itself, how do you optimize it. I also found that can not be executed in parallel.

@zhaohb
Copy link
Author

zhaohb commented Feb 3, 2021

I've set up a test environment on coLab, tenosrflow-onnx has added modifications to your branch, onnxruntime has been recompiled, and custom operators have been implemented in the ort-customops project. You can test the onnx model in this environment.
The link is as follows:
https://colab.research.google.com/drive/1wxu8piPR9qyAC8EjtDd6-STZqek77BO7?usp=sharing

@zhaohb
Copy link
Author

zhaohb commented Feb 3, 2021

@TomWildenhain-Microsoft The onnx model used in the tests was simpler, but was also converted from savedModel based on your branching, and tests showed that the onnx model was much slower than the saved model.

@TomWildenhain-Microsoft
Copy link
Contributor

I'm not too surprised the perf is really bad since this model seems to use a ton of ops that normally aren't very common and we haven't optimized for (normally models have 1 or 2 table lookups, this model has hundreds). We can change those table lookups into gather ops which should be faster. Also you can run ORT with profiling turned on and we can see where the slowdown is.

Out of curiosity, what does this model do and why are you converting it to onnx?

@zhaohb
Copy link
Author

zhaohb commented Feb 4, 2021

@TomWildenhain-Microsoft I had turn on the profiling option and generated the corresponding JSON file, from which we can see that IO still accounts for a large proportion. The corresponding model is new_coarse. onnx, which I have shared with you.
json link:
https://drive.google.com/file/d/1CBXm6wPXHNRxUhM_OZeKqmGloGpX9_-V/view?usp=sharing
json which add parallel link:
https://drive.google.com/file/d/1CA-MPsMOzKjK4HfxDlUZlawbkzBlpBtN/view?usp=sharing
onnx model link:
https://drive.google.com/file/d/1plOQS-aFPukrfeTFw2rBS3-64zU0A8h4/view?usp=sharing

This model is recommendation model, we did this converting to speed up the model, but right now, it's not working well.

@zhaohb
Copy link
Author

zhaohb commented Feb 7, 2021

we can make a better conversation using Gather ops. will we support this change?

@TomWildenhain-Microsoft
Copy link
Contributor

If you make the conversion and add tests for it we will merge it into master and maintain it.

@TomWildenhain-Microsoft
Copy link
Contributor

@guschmue
Copy link
Collaborator

guschmue commented Apr 7, 2021

assume this is resolved.

@guschmue guschmue closed this as completed Apr 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending on user response Waiting for more information or validation from user
Projects
None yet
Development

No branches or pull requests

3 participants