Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvFuser compiling with -O0 option #1244

Closed
csarofeen opened this issue May 1, 2022 · 8 comments
Closed

nvFuser compiling with -O0 option #1244

csarofeen opened this issue May 1, 2022 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@csarofeen
Copy link

csarofeen commented May 1, 2022

Describe the bug
I don't know if it's more generally enabled, but when running benchmarks I realized:
https://github.com/rwightman/pytorch-image-models/blame/master/timm/utils/jit.py#L38-L39
was on, which disables fused multiply-add, as well as sets -O0 to the compiler arguments of nvrtc

On V100 when running:

 TIMM_BENCHMARK_NVFUSER_SKIP_NODE_KINDS='aten::_batch_norm_impl_index;aten::_batch_norm_impl_index_backward' PYTORCH_NVFUSER_DISABLE_FALLBACK=1 python benchmark.py --bench train --model vit_base_patch16_224 --img-size 224 -b 64 --amp --torchscript

I picked this network because it was the network performing the worst of all the ones we benchmarked. I was disabling BatchNorm because our BN backwards fusions still have perf gaps.

Anyways... with those lines above I was getting

  • nvFuser: 331 ms/iteration

which was a terrible regression from eager mode or the default fuser. Commenting those lines I'm getting:

  • nvFuser: 203ms/iteration

For reference:

  • Default fuser: 223ms/iteration
  • Eager mode: 176ms/iteration

I will continue tracking nvFuser perf issues down, but this puts it at a much closer position than with these flags on.
@jjsjann123 why were these lines added?

@csarofeen csarofeen added the bug Something isn't working label May 1, 2022
@csarofeen csarofeen changed the title nvFuser operating with -O0 optimization nvFuser compiling with -O0 option May 1, 2022
@rwightman
Copy link
Collaborator

@csarofeen it's currently enabled when nvfuser is enabled, but I'm not usually enabling nvfuser right now with torchscript use due to the issues I've run into.

Those env var flags ended up in the current state in my last round of testing a few months ago where they were improving a specific use case I was interested in (not a ViT, but a ResNet/RegNet like arch w/ custom norm layer), but I did not test/validate widely.

I have plans to revist and test again but wasn't sure when a good point would be in the torch release cycle ... have there been any large step changes in nvfuser performance and reliability on recent nightlies? expected for 1.12?

@csarofeen
Copy link
Author

We're targeting 1.12 to switch the default fuser in torch script to nvFuser. So the answer is yes, big differences are to be expected for 1.12 in nvFuser reliability. Correctness and performance across TIMM is an absolute must for us if we want to do that, so we're triaging all networks with the primary requirement is to not regress compared to eager mode. Seems the current default doesn't meet that requirement. The secondary requirement is to get some consistent gains across TIMM.

We're working aggressively to close the gap with our heuristics for TIMM more generally, which is why you're seeing increased activity from myself, @jjsjann123 and @xwang233.

@rwightman
Copy link
Collaborator

@csarofeen that sounds promising, would you recommend any specific PT nightly wheels or recent PT NGC releases for me to do some testing with changed defaults?

@csarofeen
Copy link
Author

I'll let you know when there's something good to repro, can send you our plots on A100 and V100, and can point you a place to try yourself. Promising, but there's always a lot of work to do. I'm aiming for end of this week or next to have a culmination of improvements from what you've seen.

@csarofeen
Copy link
Author

csarofeen commented May 18, 2022

@rwightman Hey, sorry for the delayed follow up, we're still sorting through performance. Good news and bad news, there's a lot of potential to speed up the repository with nvFuser, but there's some caveats today and improvements we're going to be targeting over the next 3 months. I'm going to give you some rough numbers as a basis for further conversation. Let me know if you want to hang out by VC and talk in more detail.

  1. I'm going to show you some rough plots, the y-axis is going to be iteration throughput of nvFuser / iteration throughput of eager mode alone (there'll be a caveat to this when I mention some inference numbers which will have CUDA Graphs enabled on both). Above 1.0 means nvFuser is doing good stuff, below means we've got some distance to go with adding additional optimizations to our kernels. 3 known areas we have to improve is (1) channels last BatchNorm (this is an area that cuDNN is highly optimized and we've got some optimizations that are almost done to pipe through to improve our performance. (2) Transpose support, we've got this mostly sorted out but still need some work in automation so we don't reproduce an issue like you hit with View. (3) today we don't recognize tensors with stride == 0 to be broadcasted, so we don't optimize those like we optimize tensors with size == 1, I don't have a good idea of when we will support this but at some point we should.

  2. TorchScript is not going to be a compelling option for you. TorchScript has some limitations that you're just not going to be happy with, there is another option today called AOT Autograd (CC @Chillee) inside FuncTorch. I think it's reasonable to start early adoption of AOT Autograd. It's hard for me to know exactly what interface longer term makes most sense for you to interact with, but starting to play around with AOT Autograd + nvFuser I think is worth while. In other worse I think you can confidently abandon TorchScript. Flat out AMP does not work in TorchScript and we're not planning to make it work, it can work in AOT Autograd which just traces the autocasting done in eager mode.

  3. I believe we've got some pretty solid channels first support, which might not be what you really want, channels last has serious regressions, but I'm confident we can get channels last in much better shape in the next 3 months, this is not a final state but hopefully a nice demonstration of progress.

  4. Performance numbers. nvFuser is more compelling on newer and faster hardware. It should provide gains on any hardware and if not we will definitely consider it a bug. However, faster GPUs are just harder to generate fast kernels for, and this is where nvFuser shines the most. If eager mode has not as many optimizations as we have in nvFuser kernels then eager mode will relatively look better on slower GPUs because those optimizations matter less, and it will look worse on faster GPUs because those optimizations start to matter much more

  5. AOT Autograd does not support dynamic shapes today, nor does CUDA Graphs which could really help in inference, but I didn't see any dynamic shape workloads so I'm assuming that's not an issue for you.

  6. Compilation time is significant! nvFuser in some of these networks is compiling 100+ kernels, and each kernel can take over 250ms to compile, so overheads are very significant today. This is not the greatest user experience but we're uncertain what to do about it right now.

Sorry for the long disclaimer, let me give you some results:

Training results are on a batch size 128. This can be far from the max batch size possible on each network on a 16GB card but we need a consistent batch size for easy comparison of perf. In the plots red x's are AMP NCHW, green dots are NHWC (yeah they're not great yet), blue line is fp32 NCHW.

V100:
timm_nvfuser_vs_eager_dgx1v-32g

A100:
timm_nvfuser_vs_eager_dgxa100-40g

As mentioned you can see how a much faster card (V100 is rated at 900GB/s and A100 at 1.6TB/s memory throughput, so A100 is a lot harder to generate faster kernels which is why you see a significant delta in perf between the two. nvFuser is consistently generating fast kernels in channels first on both, but eager mode struggles to compete on the faster GPU because nvFuser is just going after really aggressive optimization strategies consistently.

Inference is an interesting story and I'm still trying to sort out some of the details. For inference if you're interested I 100% recommend CUDA graphs because that will give you a large boost in performance for really small batch sizes, though perf gains of CUDA graphs do start trailing off at something like batch size 16 (depends on which card you're using). CUDA Graphs should be able to deliver something between 1-5x depending on how small your batch size is by removing all of the latency. On top of these gains nvFuser seems to be giving us perf gains on channels last but regressions on channels first. I did recently optimize our outer reductions (what's used a lot for normalizations in channels last) to be better on smaller sizes but this is still very unexpected and we're looking into it. So take these results with a grain of salt.

For these plots we're running eager mode with CUDA Graphs and we're running AOT Autograd with CUDA Graphs which removes all latency concerns and focuses on the quality of the kernels. These results we only ran on A100 40GB, but we could run on other cards:

Batch size 1:
timm_cudagraphs_infer_nvfuser_vs_eager_bs1_dgxa100-40g

Batch size 4:
timm_cudagraphs_infer_nvfuser_vs_eager_bs4_dgxa100-40g

Batch size 8:
timm_cudagraphs_infer_nvfuser_vs_eager_bs8_dgxa100-40g

Batch size 16:
timm_cudagraphs_infer_nvfuser_vs_eager_bs16_dgxa100-40g

If any of this sounds of interest to you we're happy to help contribute AOT Autograd and Cuda Graphs into your benchmarking script (CC @xwang233 and @kevinstephano). We're waiting to see if pytorch/pytorch#77471 gets into 1.12 to know if it will be in there. If not we'll continue to aggressively push improvements into upstream and our goal is to get these numbers consistently above 1x! I'm excited about the potential and the progress we've made, but I think TIMM is a great proving ground and we'll continue to go after it aggressively.

@rwightman
Copy link
Collaborator

@csarofeen I haven't had a chance to fully digest this, I was on a road trip last week. I'll respond this week, but thanks for the extensive data!

@rwightman
Copy link
Collaborator

in any case, I'd definitely be interested in having flags to enable aot autograph or cuda graphs in the benchmark script (doesn't matter what makese it into an official PT release as they'd be optional) @csarofeen @xwang233 and @kevinstephano

@csarofeen
Copy link
Author

Closing as original issue no longer relevant. Still some information being disseminated here but think its okay to close now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants