Support lowering XLA clamp instruction to cuDNN. #13020

elfiegg · 2024-05-23T19:58:13Z

Support lowering XLA clamp instruction to cuDNN.
cc @sergachev

@sergachev

Imported from GitHub PR openxla/xla#13020 Support lowering XLA clamp instruction to cuDNN. cc @sergachev Copybara import of the project: -- 47dc71f2a0d5887461a0b7d985328442e0e8da2f by Elfie Guo <elfieg@nvidia.com>: Support lowering clamp instruction to cuDNN. Merging this change closes #13020 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13020 from elfiegg:fp8_triton 47dc71f2a0d5887461a0b7d985328442e0e8da2f PiperOrigin-RevId: 636814740

@sergachev

Imported from GitHub PR openxla/xla#13020 Support lowering XLA clamp instruction to cuDNN. cc @sergachev Copybara import of the project: -- 47dc71f2a0d5887461a0b7d985328442e0e8da2f by Elfie Guo <elfieg@nvidia.com>: Support lowering clamp instruction to cuDNN. Merging this change closes #13020 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13020 from elfiegg:fp8_triton 47dc71f2a0d5887461a0b7d985328442e0e8da2f PiperOrigin-RevId: 636814740

@sergachev

Imported from GitHub PR openxla/xla#13020 Support lowering XLA clamp instruction to cuDNN. cc @sergachev Copybara import of the project: -- 47dc71f2a0d5887461a0b7d985328442e0e8da2f by Elfie Guo <elfieg@nvidia.com>: Support lowering clamp instruction to cuDNN. Merging this change closes #13020 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13020 from elfiegg:fp8_triton 47dc71f2a0d5887461a0b7d985328442e0e8da2f PiperOrigin-RevId: 636814740

@sergachev

Imported from GitHub PR openxla/xla#13020 Support lowering XLA clamp instruction to cuDNN. cc @sergachev Copybara import of the project: -- 47dc71f2a0d5887461a0b7d985328442e0e8da2f by Elfie Guo <elfieg@nvidia.com>: Support lowering clamp instruction to cuDNN. Merging this change closes #13020 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13020 from elfiegg:fp8_triton 47dc71f2a0d5887461a0b7d985328442e0e8da2f PiperOrigin-RevId: 636814740

@sergachev

Imported from GitHub PR openxla/xla#13020 Support lowering XLA clamp instruction to cuDNN. cc @sergachev Copybara import of the project: -- 47dc71f2a0d5887461a0b7d985328442e0e8da2f by Elfie Guo <elfieg@nvidia.com>: Support lowering clamp instruction to cuDNN. Merging this change closes #13020 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13020 from elfiegg:fp8_triton 47dc71f2a0d5887461a0b7d985328442e0e8da2f PiperOrigin-RevId: 636814740

@sergachev

Imported from GitHub PR openxla/xla#13020 Support lowering XLA clamp instruction to cuDNN. cc @sergachev Copybara import of the project: -- 47dc71f2a0d5887461a0b7d985328442e0e8da2f by Elfie Guo <elfieg@nvidia.com>: Support lowering clamp instruction to cuDNN. Merging this change closes #13020 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13020 from elfiegg:fp8_triton 47dc71f2a0d5887461a0b7d985328442e0e8da2f PiperOrigin-RevId: 636814740

Overall the idea is to collect profile data for each module given amount of times (which can be configured) then recompile the module with the aggregated profile data. 1. We need to track how many times each module were profiled and collect profiling results. For this i added a ProfileSessionRunner class at profile.py. The class can track how many times an instance of it was called to profile a session and also can aggregate profile results. 2. We need associate profiling session to the module at the interpreter. To do this i added a dictionary to pjit.py which associates Jaxpr with profile session runner. 3. The profile session runner should be passed to pxla.py and then called. 4. We need to correctly deal with fast path at the interpreter level, so JAX won't use HLO directly if PGLE need to be collected, but also JAX will not recompiled the module only for PGLE. See changes in pjit.py and in lru_cache.h 5. Once FDO is collected we need to share it between hosts to keep deterministic compilation. FUTURE_COPYBARA_INTEGRATE_REVIEW=#13020 from elfiegg:fp8_triton 47dc71f PiperOrigin-RevId: 617824989

@sergachev

Imported from GitHub PR openxla/xla#13020 Support lowering XLA clamp instruction to cuDNN. cc @sergachev Copybara import of the project: -- 47dc71f2a0d5887461a0b7d985328442e0e8da2f by Elfie Guo <elfieg@nvidia.com>: Support lowering clamp instruction to cuDNN. Merging this change closes #13020 PiperOrigin-RevId: 636875273

Overall the idea is to collect profile data for each module given amount of times (which can be configured) then recompile the module with the aggregated profile data. 1. We need to track how many times each module were profiled and collect profiling results. For this i added a ProfileSessionRunner class at profile.py. The class can track how many times an instance of it was called to profile a session and also can aggregate profile results. 2. We need associate profiling session to the module at the interpreter. To do this i added a dictionary to pjit.py which associates Jaxpr with profile session runner. 3. The profile session runner should be passed to pxla.py and then called. 4. We need to correctly deal with fast path at the interpreter level, so JAX won't use HLO directly if PGLE need to be collected, but also JAX will not recompiled the module only for PGLE. See changes in pjit.py and in lru_cache.h 5. Once FDO is collected we need to share it between hosts to keep deterministic compilation. FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#13020 from elfiegg:fp8_triton 47dc71f2a0d5887461a0b7d985328442e0e8da2f PiperOrigin-RevId: 617824989

sergeykozub · 2024-05-24T14:06:33Z

This PR results in "Unsupported elementwise operation %clamp.1" in an internal test, which is raised in "ir_emitter_triton.cc"
I'll roll back and look into this on Monday.

github-actions · 2024-05-24T15:25:53Z

This PR was rolled back in d114ece!

elfiegg · 2024-05-24T20:18:11Z

sent https://github.com/openxla/xla/pull/13061/files, can you help verify if this fixes the test?

akuegel · 2024-05-28T06:49:49Z

Confirmed, this fixed the test, and we have rolled this PR forward again.

Support lowering clamp instruction to cuDNN.

47dc71f

github-actions bot added the kokoro:force-run Forces CI to rerun label May 23, 2024

github-actions bot assigned kamaljeeti and xla-rotation May 23, 2024

kokoro-team removed the kokoro:force-run Forces CI to rerun label May 23, 2024

sergeykozub approved these changes May 24, 2024

View reviewed changes

copybara-service bot mentioned this pull request May 24, 2024

PR #13020: Support lowering XLA clamp instruction to cuDNN. tensorflow/tensorflow#68582

Merged

sergeykozub approved these changes May 24, 2024

View reviewed changes

copybara-service bot mentioned this pull request May 24, 2024

[JAX] Automatically share PGO data for GPU latency-hiding scheduler. #11024

Merged

copybara-service bot closed this in fdff0cd May 24, 2024

copybara-service bot mentioned this pull request May 24, 2024

[JAX] Automatically share PGO data for GPU latency-hiding scheduler. tensorflow/tensorflow#64663

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support lowering XLA clamp instruction to cuDNN. #13020

Support lowering XLA clamp instruction to cuDNN. #13020

elfiegg commented May 23, 2024

sergeykozub commented May 24, 2024

github-actions bot commented May 24, 2024

elfiegg commented May 24, 2024

akuegel commented May 28, 2024

Support lowering XLA clamp instruction to cuDNN. #13020

Support lowering XLA clamp instruction to cuDNN. #13020

Conversation

elfiegg commented May 23, 2024

sergeykozub commented May 24, 2024

github-actions bot commented May 24, 2024

elfiegg commented May 24, 2024

akuegel commented May 28, 2024