Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge OpenAI Triton commit b0f8332 #1725

Merged
merged 15 commits into from
Jul 30, 2024
Merged

Conversation

whitneywhtsang
Copy link
Contributor

@whitneywhtsang whitneywhtsang commented Jul 30, 2024

This PR change the Triton base from a51de76 to b0f8332 (Jul 29).
Pass rate: 98.49%

Please do not squash and merge this PR.

int3 and others added 14 commits July 25, 2024 20:51
This makes its interface more similar to `do_bench`, making it easier to
switch between the two.
… (#4401)

This would silently do nothing right now. Having this error prevents
this behavior.
…… (#4187)

… instead of being a separate class
….if as live (#4404)

Summary: When scf.if is marked as live in ForOpDeadArgElim, we should
mark its condition as live too. Without this fix, with the test case
added in this patch, the scf.if will be removed.
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [ ] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [x] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)
PyTorch 2.4 was officially released on July 24, 2024:
https://pytorch.org/blog/pytorch2-4/.

This makes sure we pick up the latest features in
PyTorch in the CI.
This PR first promotes common infrastructure in
`lib/Dialect/TritonGPU/Transforms/Pipeliner` to enable inclusion by
other target backends. No other changes have been made to the
lib/include directories.

Second, the `tritonamdgpu-stream-pipeline` pass has been completely
revamped based on code from
`lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp`
using similar scheduling passes to compute multi-stage pipelines. Some
of this code could be consolidated further in the CoarseSchedule class
(or perhaps a derived LoopScheduler class). This modulo scheduler
collects `tt.load` ops and generates local_storage and management ops
for the ramp-up stage (stage-0), then collecting all uses of the loads
for stage-1. Multi-buffering is introduced when num_stages exceeds the
max distance between load and uses. Buffering may be in Shared memory
for `tt.dot` uses or Registers for all other uses. This current
implement does not support peeling the last iteration if the loop is
dynamic.

Lastly, the `tritonamdgpu-reorder-instructions` pass has been enhanced
to move `tt.load` ops as early as possible in its region. This includes
loop bodies as well as func entry blocks for the case of ramp-up. This
pass will also move `triton_gpu.local_store` ops as early as possible if
their source is not directly from a `tt.load`. In this way, a
multi-buffered pipeline will overlap in this order:
1. tt.load buffer+2
2. tg.local_store buffer+1
3. tt.dot buffer+0

---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>
…408)

This commit changes the AccelerateAMDMatmul pass
to use common target utils to avoid duplication.

While here, cleaned up `using` namespaces and symbols.
I'm adding Windows support to XLA and this PR updates `f2reduce.cpp` so
that it compiles successfully on Windows.
Fixed FLOPS viewer (which wasn't showing before since flops have a
width)

Co-authored-by: Jokeren <robinho364@gmail.com>
…ersion (#4383)

This is the first PR that replace the old distributed->distributed
layout conversion using linear layout.
We tried to match the original conversion mechanism as much as possible
for now, but will try to improve its memory usage, reduce bank
conflicts, and promote generalizability.

There are a list of TODOs after this PR:

1. Remove the old code
2. Implement conversion within warps
3. Implement DotOpLayout conversion
4. Avoid bank conflicts using swizzling instead of padding
5. Update comments/revisit barriers for reduce/atomic operations

---------

Co-authored-by: Justin Lebar <justin.lebar@gmail.com>
@whitneywhtsang whitneywhtsang changed the title Merge OpenAI Triton commit 7b617bc Merge OpenAI Triton commit b0f8332 Jul 30, 2024
@whitneywhtsang whitneywhtsang merged commit eb9ed77 into llvm-target Jul 30, 2024
4 checks passed
@whitneywhtsang whitneywhtsang deleted the whitneywhtsang/merge branch July 30, 2024 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.