Merge OpenAI Triton commit `b0f8332` #1725

whitneywhtsang · 2024-07-30T02:40:54Z

This PR change the Triton base from a51de76 to b0f8332 (Jul 29).
Pass rate: 98.49%

Please do not squash and merge this PR.

This makes its interface more similar to `do_bench`, making it easier to switch between the two.

… (#4401) This would silently do nothing right now. Having this error prevents this behavior.

…… (#4187) … instead of being a separate class

….if as live (#4404) Summary: When scf.if is marked as live in ForOpDeadArgElim, we should mark its condition as live too. Without this fix, with the test case added in this patch, the scf.if will be removed.

The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

PyTorch 2.4 was officially released on July 24, 2024: https://pytorch.org/blog/pytorch2-4/. This makes sure we pick up the latest features in PyTorch in the CI.

This PR first promotes common infrastructure in `lib/Dialect/TritonGPU/Transforms/Pipeliner` to enable inclusion by other target backends. No other changes have been made to the lib/include directories. Second, the `tritonamdgpu-stream-pipeline` pass has been completely revamped based on code from `lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp` using similar scheduling passes to compute multi-stage pipelines. Some of this code could be consolidated further in the CoarseSchedule class (or perhaps a derived LoopScheduler class). This modulo scheduler collects `tt.load` ops and generates local_storage and management ops for the ramp-up stage (stage-0), then collecting all uses of the loads for stage-1. Multi-buffering is introduced when num_stages exceeds the max distance between load and uses. Buffering may be in Shared memory for `tt.dot` uses or Registers for all other uses. This current implement does not support peeling the last iteration if the loop is dynamic. Lastly, the `tritonamdgpu-reorder-instructions` pass has been enhanced to move `tt.load` ops as early as possible in its region. This includes loop bodies as well as func entry blocks for the case of ramp-up. This pass will also move `triton_gpu.local_store` ops as early as possible if their source is not directly from a `tt.load`. In this way, a multi-buffered pipeline will overlap in this order: 1. tt.load buffer+2 2. tg.local_store buffer+1 3. tt.dot buffer+0 --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com>

…408) This commit changes the AccelerateAMDMatmul pass to use common target utils to avoid duplication. While here, cleaned up `using` namespaces and symbols.

I'm adding Windows support to XLA and this PR updates `f2reduce.cpp` so that it compiles successfully on Windows.

Fixed FLOPS viewer (which wasn't showing before since flops have a width) Co-authored-by: Jokeren <robinho364@gmail.com>

…ersion (#4383) This is the first PR that replace the old distributed->distributed layout conversion using linear layout. We tried to match the original conversion mechanism as much as possible for now, but will try to improve its memory usage, reduce bank conflicts, and promote generalizability. There are a list of TODOs after this PR: 1. Remove the old code 2. Implement conversion within warps 3. Implement DotOpLayout conversion 4. Avoid bank conflicts using swizzling instead of padding 5. Update comments/revisit barriers for reduce/atomic operations --------- Co-authored-by: Justin Lebar <justin.lebar@gmail.com>

int3 and others added 14 commits July 25, 2024 20:51

Add quantiles parameter to do_bench_cudagraph (#4388)

59a13eb

This makes its interface more similar to `do_bench`, making it easier to switch between the two.

[test] NFC: split loop pipeline test file to prepare sharing (#4399)

0bc9fd2

[FRONTEND] Add error when max_num_imprecise_acc is smaller than K dim…

309484c

… (#4401) This would silently do nothing right now. Having this error prevents this behavior.

[FRONTEND] Refine check for max_num_imprecise_acc checks (#4402)

6fc5010

[FRONTEND] Making const pointer as an attribute of pointer_type class…

92f75d3

…… (#4187) … instead of being a separate class

Fix an issue in DeadArgElim where we fail to mark condition op of scf…

7b617bc

….if as live (#4404) Summary: When scf.if is marked as live in ForOpDeadArgElim, we should mark its condition as live too. Without this fix, with the test case added in this patch, the scf.if will be removed.

[CI][AMD] Update docker image to use PyTorch 2.4 (#4415)

25a10e5

PyTorch 2.4 was officially released on July 24, 2024: https://pytorch.org/blog/pytorch2-4/. This makes sure we pick up the latest features in PyTorch in the CI.

[Backend][AMD] Use common target ISA utils in AccelerateAMDMatmul (#4…

b57a0b6

…408) This commit changes the AccelerateAMDMatmul pass to use common target utils to avoid duplication. While here, cleaned up `using` namespaces and symbols.

Fixed windows compilation support for f2reduce.cpp. (#4416)

bc580ee

I'm adding Windows support to XLA and this PR updates `f2reduce.cpp` so that it compiles successfully on Windows.

[PROTON] Fixed flops viewer and refactored unit tests (#4394)

f7c3eac

Fixed FLOPS viewer (which wasn't showing before since flops have a width) Co-authored-by: Jokeren <robinho364@gmail.com>

Merge commit '7b617bcc35c4cf06f61dd267fc049fe33b2851f9'

2759f69

whitneywhtsang requested review from pbchekin and chengjunlu July 30, 2024 02:40

whitneywhtsang self-assigned this Jul 30, 2024

chengjunlu approved these changes Jul 30, 2024

View reviewed changes

Merge commit 'b0f8332c7dedb6ce3a2cf365e53391775d4e4a2e'

eb9ed77

whitneywhtsang changed the title ~~Merge OpenAI Triton commit 7b617bc~~ Merge OpenAI Triton commit b0f8332 Jul 30, 2024

whitneywhtsang merged commit eb9ed77 into llvm-target Jul 30, 2024
4 checks passed

whitneywhtsang deleted the whitneywhtsang/merge branch July 30, 2024 04:16

whitneywhtsang mentioned this pull request Jul 31, 2024

Merge OpenAI Triton till Aug 2nd #1665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge OpenAI Triton commit `b0f8332` #1725

Merge OpenAI Triton commit `b0f8332` #1725

whitneywhtsang commented Jul 30, 2024 •

edited

Loading

Merge OpenAI Triton commit b0f8332 #1725

Merge OpenAI Triton commit b0f8332 #1725

Conversation

whitneywhtsang commented Jul 30, 2024 • edited Loading

Merge OpenAI Triton commit `b0f8332` #1725

Merge OpenAI Triton commit `b0f8332` #1725

whitneywhtsang commented Jul 30, 2024 •

edited

Loading