-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvFuser compiling with -O0 option #1244
Comments
@csarofeen it's currently enabled when nvfuser is enabled, but I'm not usually enabling nvfuser right now with torchscript use due to the issues I've run into. Those env var flags ended up in the current state in my last round of testing a few months ago where they were improving a specific use case I was interested in (not a ViT, but a ResNet/RegNet like arch w/ custom norm layer), but I did not test/validate widely. I have plans to revist and test again but wasn't sure when a good point would be in the torch release cycle ... have there been any large step changes in nvfuser performance and reliability on recent nightlies? expected for 1.12? |
We're targeting 1.12 to switch the default fuser in torch script to nvFuser. So the answer is yes, big differences are to be expected for 1.12 in nvFuser reliability. Correctness and performance across TIMM is an absolute must for us if we want to do that, so we're triaging all networks with the primary requirement is to not regress compared to eager mode. Seems the current default doesn't meet that requirement. The secondary requirement is to get some consistent gains across TIMM. We're working aggressively to close the gap with our heuristics for TIMM more generally, which is why you're seeing increased activity from myself, @jjsjann123 and @xwang233. |
@csarofeen that sounds promising, would you recommend any specific PT nightly wheels or recent PT NGC releases for me to do some testing with changed defaults? |
I'll let you know when there's something good to repro, can send you our plots on A100 and V100, and can point you a place to try yourself. Promising, but there's always a lot of work to do. I'm aiming for end of this week or next to have a culmination of improvements from what you've seen. |
@rwightman Hey, sorry for the delayed follow up, we're still sorting through performance. Good news and bad news, there's a lot of potential to speed up the repository with nvFuser, but there's some caveats today and improvements we're going to be targeting over the next 3 months. I'm going to give you some rough numbers as a basis for further conversation. Let me know if you want to hang out by VC and talk in more detail.
Sorry for the long disclaimer, let me give you some results: Training results are on a batch size 128. This can be far from the max batch size possible on each network on a 16GB card but we need a consistent batch size for easy comparison of perf. In the plots red x's are AMP NCHW, green dots are NHWC (yeah they're not great yet), blue line is fp32 NCHW. As mentioned you can see how a much faster card (V100 is rated at 900GB/s and A100 at 1.6TB/s memory throughput, so A100 is a lot harder to generate faster kernels which is why you see a significant delta in perf between the two. nvFuser is consistently generating fast kernels in channels first on both, but eager mode struggles to compete on the faster GPU because nvFuser is just going after really aggressive optimization strategies consistently. Inference is an interesting story and I'm still trying to sort out some of the details. For inference if you're interested I 100% recommend CUDA graphs because that will give you a large boost in performance for really small batch sizes, though perf gains of CUDA graphs do start trailing off at something like batch size 16 (depends on which card you're using). CUDA Graphs should be able to deliver something between 1-5x depending on how small your batch size is by removing all of the latency. On top of these gains nvFuser seems to be giving us perf gains on channels last but regressions on channels first. I did recently optimize our outer reductions (what's used a lot for normalizations in channels last) to be better on smaller sizes but this is still very unexpected and we're looking into it. So take these results with a grain of salt. For these plots we're running eager mode with CUDA Graphs and we're running AOT Autograd with CUDA Graphs which removes all latency concerns and focuses on the quality of the kernels. These results we only ran on A100 40GB, but we could run on other cards: If any of this sounds of interest to you we're happy to help contribute AOT Autograd and Cuda Graphs into your benchmarking script (CC @xwang233 and @kevinstephano). We're waiting to see if pytorch/pytorch#77471 gets into 1.12 to know if it will be in there. If not we'll continue to aggressively push improvements into upstream and our goal is to get these numbers consistently above 1x! I'm excited about the potential and the progress we've made, but I think TIMM is a great proving ground and we'll continue to go after it aggressively. |
@csarofeen I haven't had a chance to fully digest this, I was on a road trip last week. I'll respond this week, but thanks for the extensive data! |
in any case, I'd definitely be interested in having flags to enable aot autograph or cuda graphs in the benchmark script (doesn't matter what makese it into an official PT release as they'd be optional) @csarofeen @xwang233 and @kevinstephano |
Closing as original issue no longer relevant. Still some information being disseminated here but think its okay to close now. |
Describe the bug
I don't know if it's more generally enabled, but when running benchmarks I realized:
https://github.com/rwightman/pytorch-image-models/blame/master/timm/utils/jit.py#L38-L39
was on, which disables fused multiply-add, as well as sets
-O0
to the compiler arguments of nvrtcOn V100 when running:
I picked this network because it was the network performing the worst of all the ones we benchmarked. I was disabling BatchNorm because our BN backwards fusions still have perf gaps.
Anyways... with those lines above I was getting
which was a terrible regression from eager mode or the default fuser. Commenting those lines I'm getting:
For reference:
I will continue tracking nvFuser perf issues down, but this puts it at a much closer position than with these flags on.
@jjsjann123 why were these lines added?
The text was updated successfully, but these errors were encountered: