-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a PyTorch to Relay parser #63
Conversation
Having issues installing torch and torchvision in CI/CD so tests can be run. Will continue looking at it. |
WIP for now, looks like there may be some bugs. |
d093bf8
to
14cf351
Compare
083c631
to
b8f8e1a
Compare
@kevinthesun @zhiics @yongwww Could you guys take a look at this sometime? *Also please ignore the changes to the CI scripts. I had some issues trying to get the CI to pass but it may have been unrelated (PT is installed here: https://github.com/neo-ai/tvm/blob/dev/docker/install/ubuntu_install_onnx.sh) |
c620491
to
be96d1f
Compare
6d58323
to
a4d3f56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@zhiics pls help take a look |
6522294
to
c8d690e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort. I have a general question, which is about the handling of dynamic models. I think the current structure or handling of ops would have the same/similar problem to the TF frontend parser when handling dynamic models -- we need to infer value/shape everywhere. I agree we don't see this in most image classification model, but we probably need to be aware of it when extending to models involving dynamic shape.
@yongwww @kevinthesun how do you guys think?
9066acf
to
d266abf
Compare
Any more feedback? I responded to everything except the question regarding dynamic shape which I'm not sure about. Anywhere we need shape, we tend to use the infer_shape pass. The only thing I'm slightly concerned about is that VGG and AlexNet are kind of flaky in terms of accuracy. |
@alexwong Dynamic shape would be the a separate issue that we will work on incrementally. I brought it up just because I wanted to make sure that design would not change much when we handle dynamic information in the future. Could you please elaborate a little more about the accuracy problem? What is flaky? And can you please dig a bit more into the details because I think flaky accuracy on real model might cause problems to the service in the end. |
There's been some issues with vgg and alexnet failing due to bad accuracy. I can take another look but it's kind of hard to debug at the moment as it doesn't happen when running locally on CPU and doesn't happen consistently when running on the CI. |
How frequently it happens in the CI? We can checkout the docker image to reproduce if it never happens on the local machine. Lets discuss offline. |
@alexwong it's possible to checkout the ci container to reproduce flaky on your own ec2 instance. The flaky issue should be fixed if it occurs frequently. I will take a look at this pr again later today. |
Okay, trying to reproduce on an ec2 instance with the docker image. I would say it happens less than half the time (rough guess). It is odd that this only recently happened and I don't think there's been too many functional changes in the code. Part of the reason could be from pulling in the upstream merge here. |
I'm able to reproduce the failures with the ci-gpu container. Am having trouble diagnosing the issue though. Comparing the initial PT graph and produced relay module for passes and failures, there is no difference so my assumption would be that the problem is most likely on the operator implementation level. Will compare resulting output after each operator/layer and compare between PyTorch and TVM to figure out where it's going wrong. Will require some rework on the PT side to return the output at each layer in the forward function. |
I'm still looking into the VGG issue but @zxy844288792 talked to Vin and he said we can merge this without the fix and have that as a separate issue as it only occurs when using the settings in ci-gpu and doesn't occur in our service. It may be good to temporarily comment out VGG in the test so it doesn't cause other PRs to fail CI occasionally though. Can we finalize the review and merge this? @yongwww |
3d366cd
to
b83978c
Compare
@alexwong I am okay for commenting out vgg temporarily, filed an issue #80 to track it. I agree with @zhiics about the infer_shape for dynamic inputs. Considering we don't have dynamic requirements/customer needs/tests for pt model currently, it is okay for me to skip this part. Overall, lgtm, @alexwong pls fix all comments. |
8498f7f
to
0979094
Compare
Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.
Support PyTorch natively in TVM by providing a relay parser. Adapted from #23 and currently only supports traced models (no control flow) and probably only image classification models (provided test uses torchvision impl to test against).
Like other frontends, grab the relay module and paramaters to build via:
mod, params = relay.frontend.from_pytorch(trace, input_shapes)
Tested against torchvision models in the included test_forward.py. Will write a discussion post in discuss.tvm.ai to see if we want this upstream.