[NNPA] Multiple zAIU support with ZHighForkOp #2681

imaihal · 2024-01-15T06:57:30Z

This PR replaces #2563

This PR enables to create threads using async dialects to run operations on multiple NNPA devices. ZHighForkOp and ZHighJoinOp are introduced as high-level IR and they are lowered into AsyncExecuteOp and AsyncAwaitOp.

Currently large MatMul ops are supported. Given A(N x K) * B(K x M), M is split for the parallelization. The MatMul ops whose M is greater than or equal to this threshold specified by compiler option are parallelized. The MatMul ops are rewritten in rewrite-onnx-for-zhigh pass by using Split op, Concat op, and ZHighForkOp and ZHighJoinOp which are newly introduced in this PR. ZHighForkOp created a thread to compute sub-Matrix, and ZHighJoinOp waits for completing the thread. They are lowered into AsyncExecuteOp and AsyncAwaitOp in ZHighToZLow pass.

How to run

Set an option to specify the number of devices and threshold.
--nnpa-matmul-parallel=#device : threshold
- Enable parallelization with the number of devices and the threshold of dimension size.
"string" is in the format of "#DEVICES":"THRESHOLD".
Link and load async runtime library (${LLVM_HOME}/build/lib/libmlir_async_runtime.so)
Use -L${LLVM_PROJECT_HOME}/build/lib -lmlir_async_runtimef for compilation and set LD_LIBRARY_PATH it at runtime.

Example:
Compile: (4 nnpa devices with threshold 128 )
$ onnx-mlir -O3 --mtriple=s390x-ibm-loz --mcpu=z16 --maccel=NNPA --nnpa-matmul-parallel=4:128 <onnx model> -L${LLVM_PROJECT_HOME}/build/lib -lmlir_async_runtime

Summary of implementation

ParallelMatMulPattern in rewrite-onnx-for-zhigh pass
1)Split Matrix B along M dimension by Split op
2)Insert ZHigh ForkOp and ZHigh JoinOp to create threads
3)Use Concat Op to gather the results of each thread
ONNX to ZHigh
Lower ONNX.MatMul into ZHigh ops as usual
ZHigh to ZLow
3.1) Move alloc op to outside of ForkOp region to deallocate correctly.
3.2) Replace the result of forkOp with allocated value.
3.3) Create Async ExecuteOp and copy ForkOp region into it.
3.4)Create AsyncAwaitOp and replace ZHighJoinOp with it

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com> Co-authored-by: Yasushi Negishi <negishi@jp.ibm.com>

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com> Co-authored-by: Yasushi Negishi <negishi@jp.ibm.com>

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Set correct layout for bcast case. Add code to profile each staick and unstaick time Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Signed-off-by: Yasushi Negishi <negishi@jp.ibm.com>

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com> Co-authored-by: Yasushi Negishi <negishi@jp.ibm.com>

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

chentong319 · 2024-01-18T15:33:47Z

Could you add the example for ZHigh dialect to the PR?

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

imaihal · 2024-01-19T02:59:33Z

@chentong319

This is the example for rewriting when using --nnpa-matmul-parallel=2:256 (two devices with threshold 256). Also, I added the same example as comments in the code.

Input

func.func @test_matmul_parallel(%arg0: tensor<1x64xf32>, %arg1: tensor<64x512xf32>) -> tensor<1x512xf32> {
  %0 = "onnx.MatMul"(%arg0, %arg1) : (tensor<1x64xf32>, tensor<64x512xf32>) -> tensor<1x512xf32>
  return %0 : tensor<1x512xf32>
}

Result of rewriting.

  func.func @test_matmul_parallel(%arg0: tensor<1x64xf32>, %arg1: tensor<64x512xf32>) -> tensor<1x512xf32> {
    %0 = onnx.Constant dense<256> : tensor<2xi64>
    %1:2 = "onnx.Split"(%arg1, %0) {axis = 1 : si64} : (tensor<64x512xf32>, tensor<2xi64>) -> (tensor<64x256xf32>, tensor<64x256xf32>)
    %2 = "zhigh.Fork"() ({
      %5 = "onnx.MatMul"(%arg0, %1#0) : (tensor<1x64xf32>, tensor<64x256xf32>) -> tensor<1x256xf32>
      onnx.Yield %5 : tensor<1x256xf32>
    }) {id = 0 : si64} : () -> tensor<1x256xf32>
    %3 = "zhigh.Fork"() ({
      %5 = "onnx.MatMul"(%arg0, %1#1) : (tensor<1x64xf32>, tensor<64x256xf32>) -> tensor<1x256xf32>
      onnx.Yield %5 : tensor<1x256xf32>
    }) {id = 1 : si64} : () -> tensor<1x256xf32>
    "zhigh.Join"(%2) : (tensor<1x256xf32>) -> ()
    "zhigh.Join"(%3) : (tensor<1x256xf32>) -> ()
    %4 = "onnx.Concat"(%2, %3) {axis = 1 : si64} : (tensor<1x256xf32>, tensor<1x256xf32>) -> tensor<1x512xf32>
    return %4 : tensor<1x512xf32>
  }

imaihal · 2024-01-25T10:37:12Z

@AlexandreEichenberger @tungld @chentong319 Any comments on this?

chentong319 · 2024-01-30T18:29:16Z

The purpose of ZHigh.join is to mark the place that the value returned for fork should be ready. Better to use:
%4 = %zhigh.join(%2) to make sure the use of the result is after the join. Another issue, which I do not know the answer, is how to tell compiler avoid moving join up if possible.

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

imaihal · 2024-03-21T04:18:15Z

This was replaced with PR #2756 using OpenMP.

imaihal and others added 30 commits July 6, 2023 01:01

Add ZHigh.AsyncMatMulOp

fc890e6

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Merge branch 'main' into async_matmul_zhigh

5eccb0c

Add ZHigh.Wait op

8ac1be9

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Rewrite for ZHigh.AsyncMatMulOp

564084c

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Add waitOp

b949e83

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Merge branch 'main' into async_matmul_zhigh

bc672db

Initial implementation for lowering of ZHighAsyncMatMulOp

0af9466

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Add C function to run single MatMul on CPU to be called from KrnlCall op

0692be2

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Add C for the external function

d85c8d4

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Add lowering for WaitOp.

1d2b74c

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Add initial pthread code

2ea0511

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Change shape of token.

50768f1

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Add C code for NNPA/CPU by Negishi-san

9477f09

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com> Co-authored-by: Yasushi Negishi <negishi@jp.ibm.com>

Remove temporal C code.

f98a5ad

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Temporary fix: TokenType from I64 to F32

3f01828

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Temporary change for debugging

062af5b

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Change to ZHighMatMulAsyncOp and ZHighMatMulWaitOp

b02bb50

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Merge branch 'main' into async_matmul_zhigh

beaf393

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Remove unnecessary opiotn

7ba044a

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Set affinity

6c82f1d

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Fix cases for spliting only A

2bf8a31

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Insert code for measurements

5817c75

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Merge branch 'main' into async_matmul_zhigh

f438838

Add support for stacked and bcast cases

bcc33a3

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com> Co-authored-by: Yasushi Negishi <negishi@jp.ibm.com>

Fix print and assert in C code

f66fd59

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Update C code

bfe37f5

Set correct layout for bcast case. Add code to profile each staick and unstaick time Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Update zhigh and zlow to remove stick/unstick from C code.

411695e

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Update for compile-time stickification.

77585b6

Signed-off-by: Yasushi Negishi <negishi@jp.ibm.com>

Add initial ZLowToLLVM for MatMulAsyncOp.

840e4fa

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Fix ZLowToLLVM by Negishi-san

1209f76

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com> Co-authored-by: Yasushi Negishi <negishi@jp.ibm.com>

imaihal and others added 6 commits January 18, 2024 02:32

Add automatic CPU-NNPA_device mapping.

3970258

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com> Co-authored-by: Yasushi Negishi <negishi@jp.ibm.com>

Update ZHighToZLow to solve segmentation fault.

c9a10d0

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Update parameter name for onnx-mlir-opt.

b4ef3e3

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Add an attribute in ForkOp for id

cf4657e

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Clean up

712c43d

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Merge branch 'main' into async_matmul_async_dialect_zhigh_execute

a3ab00f

imaihal marked this pull request as ready for review January 18, 2024 15:23

Add additional comments using Matmul example.

1fa2397

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

imaihal mentioned this pull request Jan 19, 2024

[NNPA] Multiple zAIU support for MatMulOp #2563

Closed

2 tasks

Merge branch 'main' into async_matmul_async_dialect_zhigh_execute

e92b5e1

imaihal requested review from tungld, AlexandreEichenberger, chentong319 and negiyas January 22, 2024 00:50

Merge branch 'main' into async_matmul_async_dialect_zhigh_execute

b3501f0

imaihal changed the title ~~[NNPA] Multiple zAIU support for MatMulOp with ZHighForkOp~~ [NNPA] Multiple zAIU support with ZHighForkOp Jan 25, 2024

Merge branch 'main' into async_matmul_async_dialect_zhigh_execute

d27476a

imaihal added 6 commits February 2, 2024 08:41

Update to insert deallocOp the input value is used after JoinOp.

e1957a8

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Merge branch 'main' into async_matmul_async_dialect_zhigh_execute

0d41125

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Merge branch 'main' into async_matmul_async_dialect_zhigh_execute

b85dbef

Merge branch 'main' into async_matmul_async_dialect_zhigh_execute

6c1532c

Fix deallocation

d5d3c0e

Signed-off-by: Haruki Imai <imaihal@jp.ibm.com>

Merge branch 'main' into async_matmul_async_dialect_zhigh_execute

42724f2

imaihal closed this Mar 21, 2024

imaihal mentioned this pull request Mar 21, 2024

[WIP] High-Level (Coase grain) parallelization with ONNXParallelOp and ONNXForkOp based on OpenMP #2756

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NNPA] Multiple zAIU support with ZHighForkOp #2681

[NNPA] Multiple zAIU support with ZHighForkOp #2681

imaihal commented Jan 15, 2024 •

edited

Loading

chentong319 commented Jan 18, 2024

imaihal commented Jan 19, 2024 •

edited

Loading

imaihal commented Jan 25, 2024

chentong319 commented Jan 30, 2024

imaihal commented Mar 21, 2024

[NNPA] Multiple zAIU support with ZHighForkOp #2681

[NNPA] Multiple zAIU support with ZHighForkOp #2681

Conversation

imaihal commented Jan 15, 2024 • edited Loading

chentong319 commented Jan 18, 2024

imaihal commented Jan 19, 2024 • edited Loading

imaihal commented Jan 25, 2024

chentong319 commented Jan 30, 2024

imaihal commented Mar 21, 2024

imaihal commented Jan 15, 2024 •

edited

Loading

imaihal commented Jan 19, 2024 •

edited

Loading