-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVIDIA] Move collective pipeliner after post-layout opts #12866
Conversation
@frgossen Can you help review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining the motivation. From the description, it is not quite clear why this is needed. Can you expand the motivation, maybe with an example? What fp8 pattern appears in a while loop and how is it transformed vs should be transformed?
Can you add test cases with the pattern variations you care about? I think that could also help understand the motivation behind this change.
Sure. Basically, the original graph looks like
where
Now, with the CP pass, we see that the original graph is transformed into:
where you can see the So, that is why this PR moves the CP after the gemm rewriter. So, it doesn't need to worry about breaking any patterns. Please let me know if the above makes sense to you. At the same time, I am trying to extract the above from our real model training as a unit test to showcase the issue. |
5d8c3bf
to
ffb664a
Compare
@frgossen Any update? We have recently tested this in some real models and we can see the fp8 gemms are back. |
Do you know why moving CP to post layout is not generally good? |
xla/service/gpu/gpu_compiler.cc
Outdated
@@ -1239,6 +1261,8 @@ absl::Status GpuCompiler::OptimizeHloModule( | |||
hlo_module, stream_exec, options, gpu_target_config, | |||
thread_pool.get_mutable())); | |||
|
|||
TF_RETURN_IF_ERROR(RunCollectivePipelinerPasses(hlo_module)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to rename and make it clear this is post layout CP: RunPostLayoutCollectivePipelinerPasses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. PTAL.
I think theoretically it should be fine to move CP to post layout, since the CP is supposed to be agnostic of non-collective ops. The main concern (from @Tixxx ) might be some passes in layout (e.g. layout normalization) might alter the collectives through inserting copies or transposes, for instance. We are not 100% confident that this would pose a problem or not. So, our plan is to keep the flag and set it to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Can you address the build failures? |
ca32297
to
f207a5c
Compare
@frgossen Rebased. PTAL. (Only AMD ROCm is failing but the log implies it is irrelevant to this PR). |
Looks like this runs into a merge conflict. Can you rebase? |
f207a5c
to
6bfc58e
Compare
… unit test Imported from GitHub PR openxla/xla#13920 @kaixih @frgossen This is just a rebased version of openxla/xla#12866 Copybara import of the project: -- dcf940b6f41e3e14b8e999e36055c07bf468f9a1 by shuw <shuw@nvidia.com>: Move collective pipeliner after post-layout opts and add a unit test -- 9a731e9ec6c84f809804776a3fba2d0b11a60294 by shuw <shuw@nvidia.com>: Change to EXPECT_EQ Merging this change closes #13920 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13920 from wenscarl:kaixih/move_collective_pipeliner_dev_rebase 9a731e9ec6c84f809804776a3fba2d0b11a60294 PiperOrigin-RevId: 646142320
… unit test Imported from GitHub PR #13920 @kaixih @frgossen This is just a rebased version of #12866 Copybara import of the project: -- dcf940b by shuw <shuw@nvidia.com>: Move collective pipeliner after post-layout opts and add a unit test -- 9a731e9 by shuw <shuw@nvidia.com>: Change to EXPECT_EQ Merging this change closes #13920 COPYBARA_INTEGRATE_REVIEW=#13920 from wenscarl:kaixih/move_collective_pipeliner_dev_rebase 9a731e9 PiperOrigin-RevId: 646428752
… unit test Imported from GitHub PR openxla/xla#13920 @kaixih @frgossen This is just a rebased version of openxla/xla#12866 Copybara import of the project: -- dcf940b6f41e3e14b8e999e36055c07bf468f9a1 by shuw <shuw@nvidia.com>: Move collective pipeliner after post-layout opts and add a unit test -- 9a731e9ec6c84f809804776a3fba2d0b11a60294 by shuw <shuw@nvidia.com>: Change to EXPECT_EQ Merging this change closes #13920 PiperOrigin-RevId: 646428752
Imported from GitHub PR #17453 Moves the collective quantizer pass ahead of the collective pipeliner to preserve FP8 quantization and dequantization patterns preceded or followed by collectives without running the collective pipeliner post layout assignment. See #12866 and #15292. Copybara import of the project: -- d8e8f63 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. -- b34f203 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. Merging this change closes #17453 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17453 from philipphack:u_collective_passes_xla b34f203 PiperOrigin-RevId: 681670419
Imported from GitHub PR #17453 Moves the collective quantizer pass ahead of the collective pipeliner to preserve FP8 quantization and dequantization patterns preceded or followed by collectives without running the collective pipeliner post layout assignment. See #12866 and #15292. Copybara import of the project: -- d8e8f63 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. -- b34f203 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. Merging this change closes #17453 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17453 from philipphack:u_collective_passes_xla b34f203 PiperOrigin-RevId: 681670419
Imported from GitHub PR #17453 Moves the collective quantizer pass ahead of the collective pipeliner to preserve FP8 quantization and dequantization patterns preceded or followed by collectives without running the collective pipeliner post layout assignment. See #12866 and #15292. Copybara import of the project: -- d8e8f63 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. -- b34f203 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. Merging this change closes #17453 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17453 from philipphack:u_collective_passes_xla b34f203 PiperOrigin-RevId: 681670419
Imported from GitHub PR #17453 Moves the collective quantizer pass ahead of the collective pipeliner to preserve FP8 quantization and dequantization patterns preceded or followed by collectives without running the collective pipeliner post layout assignment. See #12866 and #15292. Copybara import of the project: -- d8e8f63 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. -- b34f203 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. Merging this change closes #17453 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17453 from philipphack:u_collective_passes_xla b34f203 PiperOrigin-RevId: 681670419
Imported from GitHub PR #17453 Moves the collective quantizer pass ahead of the collective pipeliner to preserve FP8 quantization and dequantization patterns preceded or followed by collectives without running the collective pipeliner post layout assignment. See #12866 and #15292. Copybara import of the project: -- d8e8f63 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. -- b34f203 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. Merging this change closes #17453 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17453 from philipphack:u_collective_passes_xla b34f203 PiperOrigin-RevId: 681670419
Imported from GitHub PR #17453 Moves the collective quantizer pass ahead of the collective pipeliner to preserve FP8 quantization and dequantization patterns preceded or followed by collectives without running the collective pipeliner post layout assignment. See #12866 and #15292. Copybara import of the project: -- d8e8f63 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. -- b34f203 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. Merging this change closes #17453 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17453 from philipphack:u_collective_passes_xla b34f203 PiperOrigin-RevId: 681670419
Imported from GitHub PR #17453 Moves the collective quantizer pass ahead of the collective pipeliner to preserve FP8 quantization and dequantization patterns preceded or followed by collectives without running the collective pipeliner post layout assignment. See #12866 and #15292. Copybara import of the project: -- d8e8f63 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. -- b34f203 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. Merging this change closes #17453 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17453 from philipphack:u_collective_passes_xla b34f203 PiperOrigin-RevId: 681670419
Imported from GitHub PR #17453 Moves the collective quantizer pass ahead of the collective pipeliner to preserve FP8 quantization and dequantization patterns preceded or followed by collectives without running the collective pipeliner post layout assignment. See #12866 and #15292. Copybara import of the project: -- d8e8f63 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. -- b34f203 by Philipp Hack <phack@nvidia.com>: Moves the collective quantizer pass before the collective pipeliner. Merging this change closes #17453 COPYBARA_INTEGRATE_REVIEW=#17453 from philipphack:u_collective_passes_xla b34f203 PiperOrigin-RevId: 682489977
Imported from GitHub PR #17950 Removes the functionality to optionally run the collective pipeliner pass post layout assignment. This option was introduced in #12866 to preserve FP8 quantization and dequantization patterns and was obsoleted by the reordering of the collective optimization passes in #17453. Copybara import of the project: -- e9e9c1a by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. -- c5489ac by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. Merging this change closes #17950 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17950 from philipphack:u_collective_flag_xla c5489ac PiperOrigin-RevId: 684256309
Imported from GitHub PR #17950 Removes the functionality to optionally run the collective pipeliner pass post layout assignment. This option was introduced in #12866 to preserve FP8 quantization and dequantization patterns and was obsoleted by the reordering of the collective optimization passes in #17453. Copybara import of the project: -- e9e9c1a by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. -- c5489ac by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. Merging this change closes #17950 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17950 from philipphack:u_collective_flag_xla c5489ac PiperOrigin-RevId: 684256309
Imported from GitHub PR #17950 Removes the functionality to optionally run the collective pipeliner pass post layout assignment. This option was introduced in #12866 to preserve FP8 quantization and dequantization patterns and was obsoleted by the reordering of the collective optimization passes in #17453. Copybara import of the project: -- e9e9c1a by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. -- c5489ac by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. Merging this change closes #17950 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17950 from philipphack:u_collective_flag_xla c5489ac PiperOrigin-RevId: 684256309
Imported from GitHub PR #17950 Removes the functionality to optionally run the collective pipeliner pass post layout assignment. This option was introduced in #12866 to preserve FP8 quantization and dequantization patterns and was obsoleted by the reordering of the collective optimization passes in #17453. Copybara import of the project: -- e9e9c1a by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. -- c5489ac by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. Merging this change closes #17950 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17950 from philipphack:u_collective_flag_xla c5489ac PiperOrigin-RevId: 684256309
Imported from GitHub PR #17950 Removes the functionality to optionally run the collective pipeliner pass post layout assignment. This option was introduced in #12866 to preserve FP8 quantization and dequantization patterns and was obsoleted by the reordering of the collective optimization passes in #17453. Copybara import of the project: -- e9e9c1a by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. -- c5489ac by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. Merging this change closes #17950 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17950 from philipphack:u_collective_flag_xla c5489ac PiperOrigin-RevId: 684256309
Imported from GitHub PR #17950 Removes the functionality to optionally run the collective pipeliner pass post layout assignment. This option was introduced in #12866 to preserve FP8 quantization and dequantization patterns and was obsoleted by the reordering of the collective optimization passes in #17453. Copybara import of the project: -- e9e9c1a by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. -- c5489ac by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. Merging this change closes #17950 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17950 from philipphack:u_collective_flag_xla c5489ac PiperOrigin-RevId: 684256309
Imported from GitHub PR #17950 Removes the functionality to optionally run the collective pipeliner pass post layout assignment. This option was introduced in #12866 to preserve FP8 quantization and dequantization patterns and was obsoleted by the reordering of the collective optimization passes in #17453. Copybara import of the project: -- e9e9c1a by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. -- c5489ac by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. Merging this change closes #17950 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17950 from philipphack:u_collective_flag_xla c5489ac PiperOrigin-RevId: 684256309
Imported from GitHub PR #17950 Removes the functionality to optionally run the collective pipeliner pass post layout assignment. This option was introduced in #12866 to preserve FP8 quantization and dequantization patterns and was obsoleted by the reordering of the collective optimization passes in #17453. Copybara import of the project: -- e9e9c1a by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. -- c5489ac by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. Merging this change closes #17950 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17950 from philipphack:u_collective_flag_xla c5489ac PiperOrigin-RevId: 684256309
Imported from GitHub PR #17950 Removes the functionality to optionally run the collective pipeliner pass post layout assignment. This option was introduced in #12866 to preserve FP8 quantization and dequantization patterns and was obsoleted by the reordering of the collective optimization passes in #17453. Copybara import of the project: -- e9e9c1a by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. -- c5489ac by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. Merging this change closes #17950 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17950 from philipphack:u_collective_flag_xla c5489ac PiperOrigin-RevId: 684256309
Imported from GitHub PR #17950 Removes the functionality to optionally run the collective pipeliner pass post layout assignment. This option was introduced in #12866 to preserve FP8 quantization and dequantization patterns and was obsoleted by the reordering of the collective optimization passes in #17453. Copybara import of the project: -- e9e9c1a by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. -- c5489ac by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. Merging this change closes #17950 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17950 from philipphack:u_collective_flag_xla c5489ac PiperOrigin-RevId: 684256309
Imported from GitHub PR #17950 Removes the functionality to optionally run the collective pipeliner pass post layout assignment. This option was introduced in #12866 to preserve FP8 quantization and dequantization patterns and was obsoleted by the reordering of the collective optimization passes in #17453. Copybara import of the project: -- e9e9c1a by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. -- c5489ac by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. Merging this change closes #17950 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17950 from philipphack:u_collective_flag_xla c5489ac PiperOrigin-RevId: 684256309
Imported from GitHub PR #17950 Removes the functionality to optionally run the collective pipeliner pass post layout assignment. This option was introduced in #12866 to preserve FP8 quantization and dequantization patterns and was obsoleted by the reordering of the collective optimization passes in #17453. Copybara import of the project: -- e9e9c1a by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. -- c5489ac by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. Merging this change closes #17950 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17950 from philipphack:u_collective_flag_xla c5489ac PiperOrigin-RevId: 684256309
Imported from GitHub PR #17950 Removes the functionality to optionally run the collective pipeliner pass post layout assignment. This option was introduced in #12866 to preserve FP8 quantization and dequantization patterns and was obsoleted by the reordering of the collective optimization passes in #17453. Copybara import of the project: -- e9e9c1a by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. -- c5489ac by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. Merging this change closes #17950 FUTURE_COPYBARA_INTEGRATE_REVIEW=#17950 from philipphack:u_collective_flag_xla c5489ac PiperOrigin-RevId: 684256309
Imported from GitHub PR #17950 Removes the functionality to optionally run the collective pipeliner pass post layout assignment. This option was introduced in #12866 to preserve FP8 quantization and dequantization patterns and was obsoleted by the reordering of the collective optimization passes in #17453. Copybara import of the project: -- e9e9c1a by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. -- c5489ac by Philipp Hack <phack@nvidia.com>: Removes the option to run the collective pipeliner post layout assignment. Merging this change closes #17950 COPYBARA_INTEGRATE_REVIEW=#17950 from philipphack:u_collective_flag_xla c5489ac PiperOrigin-RevId: 684614799
To perform the fp8 dot, we insert quant and dequant ops before the dot and rely on the gemm-rewriter to pattern match and rewrite to cublaslt calls for GPUs.
When this fp8 pattern appears in while loops that is qualified for the collective pipeliner (CP), the CP might disrupts the pattern by accidentally peeling the quant or dequant ops off the loop.
So, this PR moves the CP after the gemm-rewriter (when the newly-introduced
xla_gpu_collective_pipeliner_post_layout_optimization
is set), since this can save us from redesigning the CP to support this special case.Note, for the long-term solution, we should move the collectives to the "safer" places before applying the CP. For example, in the above use case, we can move the collectives in-between the quant and dequant and then the CP should be able to preserve the dequant->dot pattern. This is actually already supported as an integral part in the gemm-rewriter.
cc. @Tixxx @hx89 @nluehr @philipphack